in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.
Submitted by: kmacy
Tested by: make universe
This includes support for:
- Read-Write Error Recovery mode page;
- Informational Exceptions Control mode page;
- Logical Block Provisioning mode page;
- LOG SENSE command.
No real Informational Exceptions features yet. This is only a placeholder.
Sponsored by: iXsystems, Inc.
SPC-4 r2 allows to return empty defect list if the list is not supported.
We don't reallu support defect data lists, but this suppresses some errors.
MFC after: 1 week
Make this subcommand less FC-specific, reporting target and port addresses
in more generic way. Also make it report list of connected initiators in
unified way, working for both FC and iSCSI, and potentially others.
MFC after: 1 week
Queued async events handling in CAM opened race, that may lead to duplicate
AC_PATH_REGISTERED events delivery during boot. That was not happening
before r272935 because the driver was initialized later. After that change
it started create duplicate ports in CTL.
Target mode operation does not depend on the initiator mode scan process.
This change allows the target driver to attach earlier and receive some
async events (like AC_CONTRACT) that could be lost otherwise.
MFC after: 1 week
Such LUNs will be visible to initiators, but return "not ready" status
on media access commands. If backing storage become available later,
`ctladm modify ...` or `service ctld reload` can trigger its reopen.
It allows to push out some final data from the send queue to the socket
before its close. In particular, it increases chances for logout response
to be delivered to the initiator.
Before this change target could send R2T request for write transfer of any
size, that could violate iSCSI RFC, which allows initiator to limit maximum
R2T size by negotiating MaxBurstLength connection parameter.
Also report an error in case of write underflow, when initiator provides
less data than initiator expects. Previously in such case our target
sent R2T request for non-existing data, violating the RFC, and confusing
some initiators. SCSI specs don't explicitly define how write underflows
should be handled and there are different oppinions, but reporting error
is hopefully better then violating iSCSI RFC with unpredictable results.
MFC after: 2 weeks
SPC-2 tells REPORT LUNS shall be supported by devices supporting LUNs other
then LUN 0. If we see LUN 0 disconnected, guess there may be others, and
so REPORT LUNS shall be supported.
MFC after: 1 month
Previous logic was not differentiating disconnected LUNs and absent targets.
That made it to stop scan if LUN 0 was not found for any reason. That made
problematic, for example, using iSCSI targets declaring SPC-2 compliance and
having no LUN 0 configured.
The new logic continues sequential LUN scan if:
-- we have more configured LUNs that need recheck;
-- this LUN is connected and its SCSI version allows more LUNs;
-- this LUN is disconnected, its SCSI version allows more LUNs and we
guess they may be connected (we haven't scanned first 8 LUNs yet or
kern.cam.cam_srch_hi sysctl is set to scan more).
Reported by: trasz
MFC after: 1 month
or I_T NEXUS LOSS, clear all minor UAs for the LUN, redundant in this case.
All SAM specifications tell that target MAY do it, but libiscsi initiator
seems require it to be done, terminating connection with error if some more
UAs happen to be reported during iSCSI connection.
MFC after: 3 days
Using pointer from the cdev directly is dangerous since we have no reference
on it, and it may change any time. That caused panic if device has gone.
While there, report capacity change only if it really changed.
MFC after: 3 days
Without clustering support we any way have only one group of permanently
active ports, but that gives us one more supported VMWare feature. ;)
Solaris' Comstar also reports it even when only one port is present.
devq_openings counter lost its meaning after allocation queues has gone.
held counter is still meaningful, but problematic to update due to separate
locking of CCB allocation and queuing.
To fix that replace devq_openings counter with allocated counter. held is
now calculated on request as difference between number of allocated, queued
and active CCBs.
MFC after: 1 month
It allows to bypass range checks between UNMAP and READ/WRITE commands,
which may introduce additional delays while waiting for UNMAP parameters.
READ and WRITE commands are always processed in safe order since their
range checks are almost free.
Before this change UNMAP completely blocked other I/Os while running.
Now it blocks only colliding ones, slowing down others only due to ZFS
locks collisions.
Sponsored by: iXsystems, Inc.
None of existing STEC devices need UNMAP or even support it well, having
many limitations and even hanging sometimes executing those commands.
New devices that may use UNMAP going to be released under HGST name.
MFC after: 3 days
kern.cam.ctl.iscsi.ping_timeout to 0. This fixes interoperability with
some initiators that don't properly support NOP-Ins, namely iPXE/gPXE.
Submitted by: Chen Wen <pokkys@gmail.com>
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
At this moment it works only for files and ZVOLs in device mode since BIOs
have no respective respective cache control flags (DPO/FUA).
MFC after: 1 month
Sponsored by: iXsystems, Inc.
that's ATAPI specific. Instead, skip to PROBE_SET_MULTI instead for
non ATAPI protocols. The prior code incorrectly terminated the probe
with a break, rather than arranging for probedone to get called. This
caused panics or worse on some systems.
that's only mostly similar. Specifically word 78 bits are defined for
IDENTIFY DEVICE as
5 Supports Hardware Feature Control
while a IDENTIFY PACKET DEVICE defines them as
5 Asynchronous notification supported
Therefore, only pay attention to bit 5 when we're talking to ATAPI
devices (we don't use the hardware feature control at this time).
Ignore it for ATA devices. Remove kludge that papered over this issue
for Samsung SATA SSDs, since Micron drives also have the bit set and
the error was caused by this bad interpretation of the spec (which is
quite easy to do, since bits aren't normally overlapping like this).
sizeof(struct scsi_inquiry_data) of 256 bytes combined with off-by-one
error in the changed code gave total INQUIRY data length above 255 bytes,
that was maximal INQUIRY length in SPC-2. While SPC-3 increased the
maximal length to 64K, at least sg3_utils are still confused by that.
MFC after: 1 week
This allows to avoid extra network traffic when copying files on NTFS iSCSI
disks within one storage host by drag'n'dropping them in Windows Explorer
of Windows 8/2012. It should also accelerate Hyper-V VM operations, etc.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
left some of the decisions based on its counterpart, SA_CCB_BUFFER_IO
being random. As a result, propagation of the residual information
for the SPACE command was broken, so the number of filemarks
encountered during a SPACE operation was miscalculated. Consequently,
systems relying on properly tracked filemark counters (like Bacula)
fell apart.
The change also removes a switch/case in sadone() which r256843
degraded to a single remaining case label.
PR: 192285
Approved by: ken
MFC after: 2 weeks
Unlike disk devices ZVOLs process all requests synchronously. That makes
impossible sending multiple requests to them from single thread. From the
other side ZVOLs have real d_read/d_write methods, which unlike d_strategy
can handle uio scatter/gather and have no strict I/O size limitations.
So, if ZVOL in "dev" mode is detected, use of d_read/d_write methods instead
of d_strategy allows to avoid pointless splitting of large requests into
MAXPHYS (128K) sized chunks.
MFC after: 1 week
link_elf_obj: symbol icl_pdu_new_bhs undefined
PR: 192031
Submitted by: Nils Beyer (earlier version)
MFC after: 3 days
Sponsored by: FreeBSD Foundation
After I gave each iSCSI target its own port, the old limit appeared to be
not so big. This change almost proportionally increases per-LUN memory
use, but it is still three times better then it was before r268807.
MFC after: 2 weeks
CTL never had use for CA support code since SPI has gone, and there is no
even frontends supporting that. But it still was reserving 256 bytes of
memory per LUN per every possible initiator on every possible port.
Wrap unused code with ifdef's in case somebody even need it.
MFC after: 2 weeks
This allows to clone VMs and move them between LUNs inside one storage
host without generating extra network traffic to the initiator and back,
and without being limited by network bandwidth.
LUNs participating in copy operation should have UNIQUE NAA or EUI IDs set.
For LUNs without these IDs VMWare will use traditional copy operations.
Beware: the above LUN IDs explicitly set to values non-unique from the VM
cluster point of view may cause data corruption if wrong LUN is addressed!
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
That should make operation more kind to multi-initiator environment.
Without this, other initiators may find out that something bad happened
to their commands only via command timeout.
Testing shown that both original queued design with separate task queue,
and recent direct execution design had significant flaw: If abort request
arrives just after the victim, the last one may not be in the ooa_queue
yet, and so invisible for the task management function.
Unlike original queued implementation, use same queue for all SCSI and
TASK requests from the same initiator. That avoids races between them:
task functions are always executed in proper time, relatively to other
requests.
If port passed negative IID value, the function will try to allocate IID
from the pool of unused, based on passed wwpn or name arguments. It does
all its best to make IID unique and persistent across reconnects.
This makes persistent reservation properly work for iSCSI. Previously,
in case of reconnects, reservation could be unexpectedly lost, or even
migrate between intiators.
teardown, and new port creation during `service ctld restart`.
Close it by returning iSCSI port internal state, that allows to identify
dying ports, which should not be counted as existing, from really alive.
Instead make ports provide wanted port and target IDs, and LUNs provide
wanted LUN IDs. After that core Device ID VPD code only had to link all
of them together and add relative port and port group numbers.
LUN ID for iSCSI LUNs no longer created by CTL, but by ctld, and passed
to CTL as "scsiname" LUN option. This makes LUNs to report the same set
of IDs, independently from the port through which it is accessed, as
required by SCSI specifications.
Having single port for all iSCSI connections makes problematic implementing
some more advanced SCSI functionality in CTL, that require proper ports
enumeration and identification.
This change extends CTL iSCSI API, making ctld daemon to control list of
iSCSI ports in CTL. When new target is defined in config fine, ctld will
create respective port in CTL. When target is removed -- port will be
also removed after all active commands through that port properly aborted.
This change require ctld to be rebuilt to match the kernel.
As a minor side effect, this allows to have iSCSI targets without LUNs.
While that may look odd and not very useful, that is not incorrect.
Before iSCSI implementation CTL had no knowledge about frontend drivers,
it had only frontends, which really were ports (alike to LUNs, if comparing
to backends). But iSCSI added there ioctl() method, which does not belong
to frontend as a port, but belongs to a frontend driver.
camcontrol(8) now supports a new 'persist' subcommand that allows users to
issue SCSI PERSISTENT RESERVE IN / OUT commands.
sbin/camcontrol/Makefile:
Add persist.c.
sbin/camcontrol/persist.c:
New persistent reservation support for camcontrol(8).
We have support for all known operation modes for PERSISTENT RESERVE
IN and PERSISTENT RESERVE OUT.
exceptions noted above.
sbin/camcontrol/camcontrol.8:
Document the new 'persist' subcommand.
In the section on the Transport ID (-I) option, explain what
Transport IDs for each protocol should look like. At some point
some of this information could probably get moved off in a
separate man page, either on Transport IDs alone or a man page
documenting the Transport ID parsing code.
Add a number of examples of persistent reservation commands.
Persistent Reservations are complex enough that the average user
probably won't be able to get the commands exactly right by just
reading the man page. These examples show a few basic and
advanced examples of how to use persistent reservations.
sbin/camcontrol/camcontrol.h:
Move the definition for camcontrol_optret here, so we can use it
for the persistent reservation code.
Add a definition for the new scsipersist() function.
sbin/camcontrol/camcontrol.c:
Add 'persist' to the list of subcommands.
Document 'persist' in the help text.
sys/cam/scsi/scsi_all.c:
Add the scsi_persistent_reserve_in() and
scsi_persistent_reserve_out() CCB building functions.
Add a new function, scsi_transportid_sbuf(). This takes a
SCSI Transport ID (documented in SPC-4), and prints it to
an sbuf(9). There are some transports (like ATA, USB, and
SSA) for which there is no transport defined. We need to
come up with a reasonable thing to do if we're presented
with a Transport ID that claims to be for one of those
protocols.
Add new routines scsi_get_nv() and scsi_nv_to_str().
These functions do a table lookup to go between a string and an
integer. There are lots of table lookups needed in the
persistent reservation code in camcontrol(8).
Add a new function, scsi_parse_transportid(), along with leaf node
functions to parse:
FC, 1394 and SAS (scsi_parse_transportid_64bit())
iSCSI (scsi_parse_transportid_iscsi())
SPI (scsi_parse_transportid_spi())
RDMA (scsi_parse_transportid_rdma())
PCIe (scsi_parse_transportid_sop())
Transport IDs. Given a string with the general form proto,id these
functions create a SCSI Transport ID structure.
sys/cam/scsi/scsi_all.h:
Update the various persistent reservation data structures to
SPC4r36l, but also rename some fields that were previously
obsolete with the proper names from older SCSI specs. This
allows using older, obsolete persistent reservation types when
desired.
Add function prototypes for the new persistent reservation CCB
building functions.
Add a data strucure for the READ FULL STATUS service action
of the PERSISTENT RESERVE IN command.
Add Transport ID structures for all protocols described in SPC-4.
Add a new series of SCSI_PROTO_XXX definitions, and
redefine other defines in terms of these new definitions.
Add a prototype for scsi_transportid_sbuf().
Change a couple of "obsolete" persistent reservation data
structure fields into something more meaningful, based on
what the field was called when it was defined in the spec.
(e.g. SPC, SPC-2, etc.)
Create a new define, SPRI_MAX_LEN, for the maximum allocation
length allowed for the PERSISTENT RESERVE IN command.
Add data structures and enumerations for the new name/value
translation functions.
Add data structures for SCSI over PCIe Routing IDs.
Bring the PERSISTENT RESERVE OUT Register and Move parameter list
structure (struct scsi_per_res_out_parms) up to date with SPC-4.
Add a data structure for the transport IDs that can optionally be
appended to the basic PERSISTENT RESERVE OUT parameter list.
Move SCSI protocol macro definitions out of the VPD page 0x83
definition and combine them with the more up to date protocol
definitions higher in the file.
Add function prototypes for scsi_nv_to_str(), scsi_get_nv(),
scsi_parse_transportid_64bit(), scsi_parse_transportid_spi(),
scsi_parse_transportid_rdma(), scsi_parse_transportid_iscsi(),
scsi_parse_transportid_sop(), and scsi_parse_transportid().
Sponsored by: Spectra Logic Corporation
MFC after: 1 week
requests on the trim_queue, even for the CFA ERASE. This allows us, in
the future, to collapse adjacent requests. Since CFA ERASE is only for
CF cards, and it is so restrictive in what it can do, the collapse
code is not presently here. This also brings the ada driver more in
line with the da driver's treatment of BIO_DELETEs.
Reviewed by: mav@
For every supported command define CDB length and mask of bits that are
allowed to be set. This allows to remove bunch of checks through the code
and still make the validation more strict. To properly do it for commands
supporting multiple service actions, formalize their parsing by adding
subtables for each of such commands.
As visible effect, this change allows to add support for REPORT SUPPORTED
OPERATION CODES command, reporting to client all the data about supported
SCSI commands, except timeouts.
MFC after: 2 weeks
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
Instead of trying to guess size of disk I/O operations (it just won't work
that way for newly added commands, and is equal to data move size for old
ones), account data move traffic. If disk I/Os are that interesting, then
backends have to account and provide that information.
Block backend already exports the information about disk I/Os via devstat,
so having it here too is excessive.
MFC after: 2 weeks
This gives some use to 512KB per-LUN buffers, allocated for Copan-specific
processor code and not used. It allows, for example, to test transport
performance and/or correctness without accessing the media, as supported
by Linux version of sg3_utils.
MFC after: 2 weeks
Split global ctl_lock, historically protecting most of CTL context:
- remaining ctl_lock now protects lists of fronends and backends;
- per-LUN lun_lock(s) protect LUN-specific information;
- per-thread queue_lock(s) protect request queues.
This allows to radically reduce congestion on ctl_lock.
Create multiple worker threads, depending on number of CPUs, and assign
each LUN to one of them. This allows to spread load between multiple CPUs,
still avoiging congestion on queues and LUNs locks.
On 40-core server, exporting 5 LUNs, each backed by gstripe of SATA SSDs,
accessed via 6 iSCSI connections, this change improves peak request rate
from 250K to 680K IOPS.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While for FreeBSD client that is only a minor optimization, VMWare client
doesn't support additional data requests after all data being sent once as
immediate.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
From one side it allows to remove CTL_FLAG_TASK_PENDING flag, handling of
which significantly complicates fine-grained locking. From the other side
it reduces task management requests latency even below then that flag could.
As downside, it denies task management code to sleep, but that is not needed
any way now.
Discussed with: ken
This should allow to abort commands doing mostly disk I/O, such as VERIFY
or WRITE SAME. Before this change CTL_FLAG_ABORT was only checked around
data moves, which for these commands may not happen for a very long time.
MFC after: 2 weeks
SPC-4 recommends T10 vendor ID based LUN ID was created by concatenating
product name and serial number (and istgt follows that). But product name
is 16 bytes long by itself, so 16 bytes total length is clearly not enough
to fit both.
To keep compatibility with existing configurations, pad short device IDs
to old length of 16, same as before.
This change probably breaks CTL user-level ABI, so control tools should
be rebuilt after this change.
MFC after: 2 weeks
we're now back to the pre-r228483 level of default verbosity. This in
turn again typically allows for reading information that userland might
have printed on the screen before initiating a halt, but still permits
to debug potential device shutdown problems on system shutdown via
CAM_DEBUG etc.
Reviewed by: mav
MFC after: 3 days
Sponsored by: Bally Wulff Games & Entertainment GmbH
for any outstanding commands to be properly aborted by CTL.
Without it, in some cases (such as files backing the LUNs
stored on failing disk drives), terminating a busy session
would result in panic.
Reviewed by: mav@ (earlier version)
Sponsored by: The FreeBSD Foundation
Make data_submit backends method support not only read and write requests,
but also two new ones: verify and compare. Verify just checks readability
of the data in specified location without transferring them outside.
Compare reads the specified data and compares them to received data,
returning error if they are different.
VERIFY(10/12/16) commands request either verify or compare from backend,
depending on BYTCHK CDB field. COMPARE AND WRITE command executed in two
stages: first it requests compare, and then, if succeesed, requests write.
Atomicity of operation is guarantied by CTL request ordering code.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
trims to the device assumes the list is sorted. Don't apply the
optimization of not sorting the queue when we have SSDs to the
delete_queue, since it causes more discard traffic to the drive. While
one could argue that the higher levels should coalesce the trims,
that's not done today, so some optimization at this level is needed.
CR: https://phabric.freebsd.org/D142
Make it really work for native FreeBSD programs. Before this it was broken
for years due to different number of pointer dereferences in Linux and
FreeBSD IOCTL paths, permanently returning errors to FreeBSD programs.
This change breaks the driver FreeBSD IOCTL ABI, making it more strict,
but since it was not working any way -- who bother.
Add shims for 32-bit programs on 64-bit host, translating the argument
of the SG_IO IOCTL for both FreeBSD and Linux ABIs.
With this change I was able to run 32-bit Linux sg3_utils tools and simple
32 and 64-bit FreeBSD test tools on both 32 and 64-bit FreeBSD systems.
MFC after: 1 month
Nobody yet reported disk supporting I/Os less then our MAXPHYS value, but
since we any way have code to read Block Limits VPD page, that is easy.
MFC after: 2 weeks
Instead of rereading VPD pages on every device open, do it only on initial
device probe, and in cases when device reported via UNIT ATTENTIONs that
something has changed. Capacity is still rereaded on every open because
it is more critical for operation and more probable to change in run time.
On my tests with Intel 530 SSDs on mps(4) HBA this change reduces time
GEOM needs to retaste the device (that includes few open/close cycles)
from ~150ms to ~30ms.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Instead of allocating up to 16MB or RAM at once to handle whole I/O,
allocate up to 1MB at a time, but do multiple ctl_datamove() and storage
I/Os if needed.
Unfortunately we can't check range collisions for UNMAP commands alike
to writes, because they include multiple ranges, which are also passed
in data block, not in CDB. As result, UNMAP commands have to be treated
as colliding with any other command accessing the media.
From the other side all UNMAPs are equal (we don't support ANCHOR flag),
so we can execute several UNMAPs same time.
This code was heavily broken few months ago during CAM locking changes.
Fixing it would require almost complete rewrite. Since there are no
known devices on market using this interface younger then ~15 years, and
they are CD, not even DVD, I don't see much reason to rewrite it.
This change does not mean those devices won't work. They will just work
slower due to inefficient disks load/unload schedule if several LUNs
accessed same time.
Discussed with: ken@
Silence on: scsi@, hardware@
MFC after: 1 week
This patch adds support for three new SCSI commands: UNMAP, WRITE SAME(10)
and WRITE SAME(16). WRITE SAME commands support both normal write mode
and UNMAP flag. To properly report UNMAP capabilities this patch also adds
support for reporting two new VPD pages: Block limits and Logical Block
Provisioning.
UNMAP support can be enabled per-LUN by adding "-o unmap=on" to `ctladm
create` command line or "option unmap on" to lun sections of /etc/ctl.conf.
At this moment UNMAP supported for ramdisks and device-backed block LUNs.
It was tested to work great with ZFS ZVOLs. For file-backed LUNs UNMAP
support is unfortunately missing due to absence of respective VFS KPI.
Reviewed by: ken
MFC after: 1 month
Sponsored by: iXsystems, Inc
This avoids extra locking in icl_pdu_queue(); the upper layer needs to call
it while holding its own lock anyway, to avoid sending PDUs out of order.
Sponsored by: The FreeBSD Foundation