Commit Graph

610 Commits

Author SHA1 Message Date
John Baldwin
c261b6ea4e iscsi: Teach the iSCSI stack about "large" received PDUs.
When using iSCSI PDU offload (cxgbei) on T6 adapters, a burst of
received PDUs can be reported via a single message to the driver.

Previously the driver passed these multi-PDU bursts up to the iSCSI
stack up as a single "large" PDU by rewriting the buffer offset, data
segment length, and DataSN fields in the iSCSI header.  The DataSN
field in particular was rewritten so that each of the "large" PDUs
used consecutively increasing values.  While this worked, the forged
DataSN values did not match the ExpDataSN value in the subsequent SCSI
Response PDU.  The initiator does not currently verify this value, but
the forged DataSN values prevent adding a check.

To avoid this, allow a logical iSCSI PDU (struct icl_pdu) to describe
a burst of PDUs via a new 'ip_additional_pdus' field.  Normally this
field is set to zero when 'struct icl_pdu' represents a single PDU.
If logical PDU represents a burst of on-the-wire PDUs, then 'ip_npdus'
contains the count of additional on-the-wire PDUs.  The header of this
"large" PDU is still modified, but the DataSN field now contains the
DataSN value of the first on-the-wire PDU in the burst.

Reviewed by:	mav
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D31577
2021-08-18 10:56:28 -07:00
Alexander Motin
303477d325 cam(4): Mark all sysctls as CTLFLAG_MPSAFE.
This code does not use Giant lock for very long time.

MFC after:	2 weeks
2021-08-10 20:07:19 -04:00
John Baldwin
f0594f52f6 iSCSI: Add support for segmentation offload for hardware offloads.
Similar to TSO, iSCSI segmentation offload permits the upper layers to
submit a "large" virtual PDU which is split up into multiple segments
(PDUs) on the wire.  Similar to how the TCP/IP headers are used as
templates for TSO, the BHS at the start of a large PDU is used as a
template to construct the specific BHS at the start of each PDU.  In
particular, the DataSN is incremented for each subsequent PDU, and the
'F' flag is only set on the last PDU.

struct icl_conn has a new 'ic_hw_isomax' field which defaults to 0,
but can be set to the largest virtual PDU a backend supports.  If this
value is non-zero, the iSCSI target and initiator use this size
instead of 'ic_max_send_data_segment_length' to determine the maximum
size for SCSI Data-In and SCSI Data-Out PDUs.  Note that since PDUs
can be constructed from multiple buffers before being dispatched, the
target and initiator must wait for the PDU to be fully constructed
before determining the number of DataSN values were consumed (and thus
updating the per-transfer DataSN value used for the start of the next
PDU).

The target generates large PDUs for SCSI Data-In PDUs in
cfiscsi_datamove_in().  The initiator generates large PDUs for SCSI
Data-Out PDUs generated in response to an R2T.

Reviewed by:	mav
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D31222
2021-08-06 14:03:00 -07:00
Konstantin Belousov
0ef5eee9d9 Add vn_lktype_write()
and remove repetetive code that calculates vnode locking type for write.

Reviewed by:	khng, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D31405
2021-08-04 19:40:13 +03:00
Edward Tomasz Napierala
616a676a05 cam: clear stack-allocated CCB in the target layer
Note that, as pointed out by scottl@, this code should really look
a bit different, in that the stack allocations should be replaced
with dynamic allocation, and the periph creation should be moved
to a context where one can use M_WAITOK.  See the review for more
details.  For now let's go with a minimal fix until we're done with
UMA CCBs.

Reviewed By:	mav, imp
Sponsored by:	NetApp, Inc.
Sponsored by:	Klara, Inc.
Differential Revision:	https://reviews.freebsd.org/D30298
2021-07-21 10:18:28 +01:00
John Baldwin
0cc7d64a2a iscsi: Move the maximum data segment limits into 'struct icl_conn'.
This fixes a few bugs in iSCSI backends where the backends were using
the limits they advertised initially during the login phase as the
final values instead of the values negotiated with the other end.

Reported by:	Jithesh Arakkan @ Chelsio
Reviewed by:	mav
Differential Revision:	https://reviews.freebsd.org/D30271
2021-05-20 09:59:11 -07:00
John Baldwin
71e3d1b3a0 iscsi: Always free a cdw before its associated ctl_io.
cxgbei stores state about a target transfer in the ctl_private[] array
of a ctl_io that is freed when a target transfer (represented by the
cdw) is freed.  As such, freeing a ctl_io before a cdw that references
it can result in a use after free in cxgbei.  Two of the four places
freed the cdw first, and the other two freed the ctl_io first.  Fix
the latter two places to free the cdw first.

Reported by:	Jithesh Arakkan @ Chelsio
Reviewed by:	mav
Differential Revision:	https://reviews.freebsd.org/D30270
2021-05-20 09:58:59 -07:00
Alexander Motin
ac503c194c Introduce "soft" serseq variant.
With new ZFS prefetcher improvements it is no longer needed to fully
serialize reads to reach decent prediction hit rate.  Softer variant
only creates small time window to reduce races instead of completely
blocking following reads while previous is running.  It much less
hurts the performance in case of prediction miss.

MFC after:	1 month
2021-04-06 17:27:16 -04:00
Alexander Motin
6ed39db257 Do not exit ctl_be_block_worker() prematurely.
Return while there are any I/Os in a queue may result in them stuck
indefinitely, since there is only one taskqueue task for all of them.
I think I've reproduced this by switching ha_role to secondary under
heavy load.

MFC after:	3 days
2021-03-05 22:45:47 -05:00
Alexander Motin
a59e2982fe Optimize out few extra memory accesses.
MFC after:	1 week
2021-03-01 18:36:33 -05:00
Alexander Motin
9d9fd8b79f Micro-optimize OOA queue processing.
- Move ctl_get_cmd_entry() calls from every OOA traversal to when
  the requests first inserted, storing seridx in struct ctl_scsiio.
- Move some checks out of the loop in ctl_check_ooa().
- Replace checks for errors that can not happen with asserts.
- Transpose ctl_serialize_table, so that any OOA traversal accessed
  only one row (cache line).  Compact it from enum to uint8_t.
- Optimize static branch predictions in hottest places.

Due to O(n) nature on deep LUN queues this can be the hottest code
path in CTL, and additional 20% of IOPS I see in some 4KB I/O tests
are good to have in reserve.  About 50% of CPU time here according
to the profiles is now spent in two memory accesses per traversed
request in OOA.

Sponsored by:	iXsystems, Inc.
MFC after:	2 weeks
2021-02-27 10:40:24 -05:00
Alexander Motin
a9bd22814f Remove pointless lun->be_lun checks.
There is no such thing as LUN without backend, at least for years.

MFC after:	1 week
2021-02-25 19:48:03 -05:00
Alexander Motin
7d4c444374 Bump CTL block backend threads from 14 to 32 per LUN.
This makes random read benchmarks look better on a wide ZFS pools.
I am not sure where the original value goes from, but it is there
for too long now.

MFC after:	1 week
2021-02-23 11:03:32 -05:00
Alexander Motin
c02a28754b Fix build after 2c7dc6bae9.
MFC after:	1 month
2021-02-21 17:21:14 -05:00
Alexander Motin
2c7dc6bae9 Refactor CTL datamove KPI.
- Make frontends call unified CTL core method ctl_datamove_done()
to report move completion.  It allows to reduce code duplication
in differerent backends by accounting DMA time in common code.
 - Add to ctl_datamove_done() and be_move_done() callback samethr
argument, reporting whether the callback is called in the same
context as ctl_datamove().  It allows for some cases like iSCSI
write with immediate data or camsim frontend write save one context
switch, since we know that the context is sleepable.
 - Remove data_move_done() methods from struct ctl_backend_driver,
unused since forever.

MFC after:	 1 month
2021-02-21 16:52:33 -05:00
Alexander Motin
05d882b780 Microoptimize CTL I/O queues.
Switch OOA queue from TAILQ to LIST and change its direction, so that
we traverse it forward, not backward.  There is only one place where
we really need other direction, and it is not critical.

Use STAILQ_REMOVE_HEAD() instead of STAILQ_REMOVE() in backends.

Replace few impossible conditions with assertions.

MFC after:	1 month
2021-02-19 15:49:36 -05:00
Alexander Motin
812c9f48a2 Save context switch per I/O for iSCSI and IOCTL frontends.
Introduce new CTL core KPI ctl_run(), preprocessing I/Os in the caller
context instead of scheduling another thread just for that.  This call
may sleep, that is not acceptable for some frontends like the original
CAM/FC one, but iSCSI already has separate sleepable per-connection RX
threads, and another thread scheduling is mostly just a waste of time.
IOCTL frontend actually waits for the I/O completion in the caller
thread, so the use of another thread for this has even less sense.

With this change I can measure ~5% IOPS improvement on 4KB iSCSI I/Os
to ZFS.

MFC after:	1 month
2021-02-18 22:29:38 -05:00
Alexander Motin
c67a2909a6 Move XPT_IMMEDIATE_NOTIFY handling out of periph lock.
It is a rare, but still better to not have lock dependencies.

MFC after:	1 month
2021-02-18 16:31:38 -05:00
Alexander Motin
b31dae0caa Exclude reserved iSCSI Target Transfer Tag.
RFC 7143 (11.7.4):
   The Target Transfer Tag values are not specified by this protocol,
   except that the value 0xffffffff is reserved and means that the
   Target Transfer Tag is not supplied.

MFC after:	1 month
2021-01-24 13:58:29 -05:00
Konstantin Belousov
cd85379104 Make MAXPHYS tunable. Bump MAXPHYS to 1M.
Replace MAXPHYS by runtime variable maxphys. It is initialized from
MAXPHYS by default, but can be also adjusted with the tunable kern.maxphys.

Make b_pages[] array in struct buf flexible.  Size b_pages[] for buffer
cache buffers exactly to atop(maxbcachebuf) (currently it is sized to
atop(MAXPHYS)), and b_pages[] for pbufs is sized to atop(maxphys) + 1.
The +1 for pbufs allow several pbuf consumers, among them vmapbuf(),
to use unaligned buffers still sized to maxphys, esp. when such
buffers come from userspace (*).  Overall, we save significant amount
of otherwise wasted memory in b_pages[] for buffer cache buffers,
while bumping MAXPHYS to desired high value.

Eliminate all direct uses of the MAXPHYS constant in kernel and driver
sources, except a place which initialize maxphys.  Some random (and
arguably weird) uses of MAXPHYS, e.g. in linuxolator, are converted
straight.  Some drivers, which use MAXPHYS to size embeded structures,
get private MAXPHYS-like constant; their convertion is out of scope
for this work.

Changes to cam/, dev/ahci, dev/ata, dev/mpr, dev/mpt, dev/mvs,
dev/siis, where either submitted by, or based on changes by mav.

Suggested by: mav (*)
Reviewed by:	imp, mav, imp, mckusick, scottl (intermediate versions)
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D27225
2020-11-28 12:12:51 +00:00
Alexander Motin
8054320e07 Make CTL nicer to increased MAXPHYS.
Before this CTL always allocated MAXPHYS-sized buffers, even for 4KB I/O,
that is even more overkill for MAXPHYS of 1MB.  This change limits maximum
allocation to 512KB if MAXPHYS is bigger, plus if one is above 128KB, adds
new 128KB UMA zone for smaller I/Os.  The patch factors out alloc/free,
so later we could make it use more zones or malloc() if we'd like.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-11-11 21:59:39 +00:00
Edward Tomasz Napierala
bce7ee9d41 Drop "All rights reserved" from all my stuff. This includes
Foundation copyrights, approved by emaste@.  It does not include
files which carry other people's copyrights; if you're one
of those people, feel free to make similar change.

Reviewed by:	emaste, imp, gbe (manpages)
Differential Revision:	https://reviews.freebsd.org/D26980
2020-10-28 13:46:11 +00:00
Alexander Motin
8836496815 Introduce support of SCSI Command Priority.
SAM-3 specification introduced concept of Task Priority, that was renamed
to Command Priority in SAM-4, and supported by all modern SCSI transports.
It provides 15 levels of relative priorities: 1 - highest, 15 - lowest and
0 - default.  SAT specification for SATA devices translates priorities 1-3
into NCQ high priority.

This change adds new "priority" field into empty spots of struct ccb_scsiio
and struct ccb_accept_tio of CAM and struct ctl_scsiio of CTL.  Respective
support is added into iscsi(4), isp(4), mpr(4), mps(4) and ocs_fc(4) drivers
for both initiator and where applicable target roles.  Minimal support was
added to CTL to receive the priority value from different frontends, pass it
between HA controllers and report in few places.

This patch does not add consumers of this functionality, so nothing should
really change yet, since the field is still set to 0 (default) on initiator
and not actively used on target.  Those are to be implemented separately.

I've confirmed priority working on WD Red SATA disks connected via mpr(4)
and properly transferred to CTL target via iscsi(4), isp(4) and ocs_fc(4).

While there, added missing tag_action support to ocs_fc(4) initiator role.

MFC after:	1 month
Relnotes:	yes
Sponsored by:	iXsystems, Inc.
2020-10-25 19:34:02 +00:00
Mateusz Guzik
27dcd3d90b cam: clean up empty lines in .c and .h files 2020-09-01 22:13:48 +00:00
Alexander Motin
7758c80f74 Fix CTL ioctl port creation error handling.
Submitted by:	Bret Ketchum
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D26143
2020-08-21 20:10:29 +00:00
Mateusz Guzik
7ad2a82da2 vfs: drop the error parameter from vn_isdisk, introduce vn_isdisk_error
Most consumers pass NULL.
2020-08-19 02:51:17 +00:00
Alexander Motin
8bdf81e4d1 Add CTL support for REPORT IDENTIFYING INFORMATION command.
It allows to report to initiator LU identifying information, preset via
"ident_info" and "text_ident_info" options.

Unfortunately it is impossible to implement SET IDENTIFYING INFORMATION,
since we have no persistent storage it requires, so the information is
read-only for initiator and has to be set out-of-band.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-08-06 19:16:11 +00:00
Alexander Motin
9a4510ac32 Implement zero-copy iSCSI target transmission/read.
Add ICL_NOCOPY flag to icl_pdu_append_data(), specifying that the method
can just reference the data buffer instead of immediately copying it.

Extend the offload KPI with optional PDU queue method, allowing to specify
completion callback, called when all the data referenced by above has been
transferred and won't be accessed any more (the buffers can be freed).

Implement the above functionality in software iSCSI driver using mbufs
with external storage and reference counter.  Note that some NICs (ixl(4))
may keep the mbuf in TX queue for a long time, so CTL has to be ready.

Add optional method to struct ctl_scsiio for buffer reference counting.
Implement it for CTL block backend, allowing to delay free of the struct
ctl_be_block_io and memory it references as needed.  In first reincarnation
of the patch I tried to delay whole I/O as it is done for FibreChannel,
that was cleaner, but due to the above callback delays I had to rewrite
it this way to not leave LUN referenced potentially for hours or more.

All together on sequential read from ZFS ARC this saves about 30% of CPU
time and memory bandwidth by avoiding one of 3 memory copies (the other
two are from ZFS ARC to DMU cache and then from DMU cache to CTL buffers).
On tests with 2x Xeon Silver 4114 this allows to reach full line rate of
100GigE NIC.  Tests with Gold CPUs and two 100GigE NICs are stil TBD,
but expectations to saturate them are pretty high. ;)

Discussed with:	Chelsio
Sponsored by:	iXsystems, Inc.
2020-06-08 20:53:57 +00:00
Alexander Motin
ec18cf79e6 Remove session locking from cfiscsi_pdu_update_cmdsn().
cs_cmdsn can be incremented with single atomic.  expcmdsn/maxcmdsn set in
cfiscsi_pdu_prepare() based on cs_cmdsn are not required to be updated
synchronously, only monotonically, that is achieved with lock there.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-05-29 17:52:20 +00:00
Alexander Motin
dbcf7598b0 Report STATUS_QUEUED/SENT in ctladm dumpooa output.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-05-29 13:07:52 +00:00
Alexander Motin
353c460050 Move EXPDATASN/R2TSN from PDU to CTL_PRIV_FRONTEND.
We any way have per-I/O space in CTL_PRIV_FRONTEND, while for PDU private
fields I have better use ideas.  Plus to me such use of PDU fields looked
a layering violation.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-05-29 02:32:48 +00:00
Alexander Motin
30a31f6c71 Remove PDU_TOTAL_TRANSFER_LEN() macro.
I don't see a point to copy io->scsiio.kern_total_len into the request
PDU private field.  The io is going to stay with us till the end, and
kern_total_len field is not changed after being first initialized.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-05-28 23:55:46 +00:00
Alexander Motin
767300e87a Make struct ctl_be_lun first element of struct ctl_be_*_lun.
It allows to remove some extra pointer dereferences and slightly tightens
up the code by unification.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-05-28 21:30:29 +00:00
Alexander Motin
0d7fed74c7 Remove ctl_free_beio() LUN and ctl_io dependencies.
This slightly simplifies the code, plus may be a ground for asynchronous
buffer free.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-05-28 18:12:05 +00:00
Alexander Motin
2bbcd07e39 Properly check kern_sg_entries for S/G list.
ctl_data_print() is called in core context, so does not even know meaning
of ext_sg_entries.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-05-26 19:09:19 +00:00
Alexander Motin
3873b14991 Fix fallout of r319722 in CTL HA.
ha_lso is a listening socket (unless bind() has failed), so should use
solisten_upcall_set(NULL, NULL), not soupcall_clear().

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-05-26 15:08:35 +00:00
Alexander Motin
fd10265cd2 Do not remove upcall if we haven't yet.
This fixes assertion if we failed to bind listening HA socket.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-05-26 13:57:14 +00:00
Alexander Motin
8a1cd3cee3 Add session locking in cfiscsi_ioctl_handoff().
While there, remove ifdef around cs_target check in cfiscsi_ioctl_list().
I am not sure why this ifdef was added, but without this check code will
crash below on NULL dereference.

Submitted by:	Aleksandr Fedorov <aleksandr.fedorov@itglobal.com>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D24587
2020-05-03 16:14:55 +00:00
Alexander Motin
34144c2c71 Cleanup LUN addition/removal.
- Make ctl_add_lun() synchronous.  Asynchronous addition was used by
Copan's proprietary code long ago and never for upstream FreeBSD.
 - Move LUN enable/disable calls from backends to CTL core.
 - Serialize LUN modification and partially removal to avoid double frees.
 - Slightly unify backends code.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2020-05-02 16:54:59 +00:00
Alexander Motin
efeedddcb5 Fix panic on kern.cam.ctl.ha_role change after r333446.
MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2020-04-07 03:19:00 +00:00
Ed Maste
3709674072 sys/cam: remove doubled ;s 2020-03-20 16:15:45 +00:00
Warner Losh
9ac30e0b66 Remove unused cam ccb flags
These flags have been unused for some time. Some of them were in the
CAM2 specification, but CAM has moved on a bit from that. Some were
used in the old Pluto VideoSpace (and AirSpace) systems which had the
video playback I/O scheduler in userspace, but have been unused since
then.

Reviewed by: chuck, ken
Differential Revision:  https://reviews.freebsd.org/D24008
2020-03-10 23:58:41 +00:00
Warner Losh
51447e4962 Remove pre-FreeBSD 11 compat code. 2020-03-01 23:01:47 +00:00
Pawel Biernacki
7029da5c36 Mark more nodes as CTLFLAG_MPSAFE or CTLFLAG_NEEDGIANT (17 of many)
r357614 added CTLFLAG_NEEDGIANT to make it easier to find nodes that are
still not MPSAFE (or already are but aren’t properly marked).
Use it in preparation for a general review of all nodes.

This is non-functional change that adds annotations to SYSCTL_NODE and
SYSCTL_PROC nodes using one of the soon-to-be-required flags.

Mark all obvious cases as MPSAFE.  All entries that haven't been marked
as MPSAFE before are by default marked as NEEDGIANT

Approved by:	kib (mentor, blanket)
Commented by:	kib, gallatin, melifaro
Differential Revision:	https://reviews.freebsd.org/D23718
2020-02-26 14:26:36 +00:00
Alexander Motin
12373e9519 Bind CTL backends taskqueues to the CTL process.
MFC after:	2 weeks
2020-02-08 21:59:46 +00:00
Mateusz Guzik
b249ce48ea vfs: drop the mostly unused flags argument from VOP_UNLOCK
Filesystems which want to use it in limited capacity can employ the
VOP_UNLOCK_FLAGS macro.

Reviewed by:	kib (previous version)
Differential Revision:	https://reviews.freebsd.org/D21427
2020-01-03 22:29:58 +00:00
Alexander Motin
024932aae9 Use atomic for start_count in devstat_start_transaction().
Combined with earlier nstart/nend removal it allows to remove several locks
from request path of GEOM and few other places.  It would be cool if we had
more SMP-friendly statistics, but this helps too.

Sponsored by:	iXsystems, Inc.
2019-12-30 03:13:38 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Alexander Motin
45577133ef Remove lock from CTL camsim frontend.
CAM does not need a SIM lock for quite a while, and CTL never needed it.

MFC after:	2 weeks
2019-11-03 00:13:23 +00:00
Alexander Motin
1173e5a721 Reenable UNMAP support on ramdisks by default.
For some reason, I guess just mechanical editing, it was disable in r333446.

MFC after:	2 weeks
2019-07-27 18:07:46 +00:00