According to the analysis, the largest name size is
24 not including '\0' (NVMF_RDMA_WRITE_COMPLETE),
so change the the size of name. Also add a check
to avoid the str exceeding our defined name size.
Change-Id: Iddf2cb52a3f5358306a59fc66bb997fa8098cde0
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
This avoids corner case where a buffer gets allocated on the 100th
try.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: If65053d539d458d9a53c8850bbb4cbe4ee84f604
this patch fix the potential possibility of coredump when
we have NVMe device hot inserted.
Change-Id: Idac255f25f42b4746c2d3ae6dfc57a19b7001160
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
This is the initial commit for "blobfs", a lightweight
filesystem built on top of the SPDK blobstore.
Also included in this patch:
1) a shim for using SPDK bdevs as the backing store for
SPDK blobstore/blobfs
2) documentation for using blobfs as the storage engine
with RocksDB
3) scripts for running a set of workloads and collecting
profiling data with RocksDB and blobfs
See doc/blobfs/getting_started.md included in this commit
for more details on blobfs, including some of the current
limitations.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: I2a6d3d4b87236730051228ed62c0c04e04c42c73
Avoid division by zero in the event mempool cache size calculation.
Change-Id: Ic117ef2dc3a798fb0a57572f1178233e83e73849
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
It was causing segfaults and infinite looping.
Change-Id: I4c19b5d3af1ba1360250cd5f6aa573a27003409f
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
This enables the vhost library to build on systems missing the (fairly
recent) linux/virtio_scsi.h header.
Change-Id: I680863b26961ec3cbe4ad4e575555454f6461bbf
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
If we do not do a bounds check, this can run off the end
of an array.
Change-Id: I43cc4848fca7d68218e507db20e33823f8b550e4
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Attempting to add a listen address for an unavailable transport will
fail with a better error message.
Change-Id: If4cf5b66c16dadcb6e0f0b28cea4aa510ba6a9fc
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Rather than failing silently, let the user know why the listen address
failed.
Change-Id: I41c2a51c6071ee739b282a1a39198a2887a73c4d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The message about the uevent socket is not a fatal error; it just means
that hotplug monitoring will not work.
Change-Id: I29f6a253e96a86420c0fde9e19135f9f1d229bb9
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This is the initial commit for the "blobstore", a lightweight,
highly parallel, persistent, power-fail safe block allocator.
Documentation will be added in future patches.
Change-Id: I20a4daf899f1215d396f7931c3ec9a2e2bb269d0
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The user now must choose the name for each AIO bdev. This
provides consistency for names across restarts.
Change-Id: I13ced1d02bb28c51d314512d60f739499b0c7d8d
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Simplify code that previously needed to check for subsystem type by
factoring out the discovery controller operations into a new ops
instance.
Change-Id: Id87b498e4623451993fe779ffb765be5a6743fd9
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
No functional change, just rearranging code.
Change-Id: I28328dfefd7de269d326834c484f2c2fca4e6c1f
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
When data needs to be transferred from the controller
to the host, do a single ibv_post_send containing
both the data and the completion.
Change-Id: I072c545b31593e0e324c97ed700b42c6a4c358e1
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This call had been reduced to a simple wrapper
around the ibv call. Delete it.
Change-Id: I42926d123db262617119a9cff77bc0d0eb1e8f31
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
These functions were only called from one place and
their functionality has been reduced to a wrapper
around the underlying ibv call. Remove them.
Change-Id: I65182012dbe6393b9d57f4191fd327bcd025a6c8
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This keeps all SGL handling in the prep_data function.
Change-Id: I9bfeed3748c1b329288350b85aa87bd604cfce4e
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Now that all of the SGL mappings are static,
this function just called ibv_post_recv. Delete
the function and call ibv_post_recv directly.
Change-Id: I45216170a157709249b08c4cb0ebdb1adb906049
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This patch makes create_vhost_scsi_controller check if given file is a socket before deleting it
Change-Id: I7a37c12913b461f779732e724c85e2f7b5d67442
Signed-off-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com>
For an NVMe read, send the completion immediately
following the RDMA WRITE, without waiting for
the acknowledgement. RDMA is strictly ordered,
so the WRITE will arrive before the completion.
Change-Id: I7e4e01d7a02c2130b655ef90f5fdaec992d9361a
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Except for a CONNECT capsule, always use the central data
pool for RDMA READ/WRITE operations. The in-capsule
data buffer is associated with the receive operation
while the pool data buffers are associated with the
completion, and using the in-capsule data buffer
causes a lifetime mismatch.
Change-Id: Ieb45e521d78daa7c706078a3dd5c5a146f8dc1d6
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
After commit b654e9b, this is no longer required.
Change-Id: I0cf1a7059d7fba0303aca5ad5a15afe3890b4172
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The RDMA protocol this module uses is strictly ordered,
which means messages are delivered in exactly the order
they are sent. However, we have detected a number of
cases where the acknowledgements for those messages
arrive out of order. This patch attempts to handle
that case.
Separate the data required to post a recv from the
data required to send a response. If a recv arrives
when no response object is available, queue the
recv.
Change-Id: I2d6f2f8636b820d0c746505e5a5e3d3442ce5ba4
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Names for the NVMe bdevs are now assigned by the user.
This means the same name will always be assigned to the
same device, even across restarts.
Change-Id: If9825ec9abcb5236b4671bc44a825e4f0d704fe3
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
remove the unnecessary rte_eal_pci_probe_one() in function
spdk_pci_device_detach(), this could cause error message when we
terminate the application, it will also not make sense try to probe one
device after we detach it, we could call spdk_pci_nvme_device_attach()
instead of spdk_pci_nvme_enumerate() when we have one given device address,
dpdk will try to scan the device and add it back to pci device list then.
Change-Id: I35f5bb412249bb20da57394f0531c10a49691906
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
This clarifies the relation between the values assigned to sg_list and
num_sge (no functional change).
Change-Id: I8e81d47dd97a033b17cd3b813b06e4887127146c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
All devices must be specified by BDF. Add support for scripts
to use lspci to grab the available NVMe device BDFs for the
current machine.
Change-Id: I4a53b335e3d516629f050ae1b2ab7aff8dd7f568
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The mappings are all static, so it isn't interesting
to print them out on each I/O.
Change-Id: I85301b4518d4523a7c031f6ca9ff678d91428504
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This allows pipelining of READ/WRITE with completion.
Change-Id: Ib3ab5bffb8e3e5de8cbae7a3b2fff7d9f6646d2d
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
This allows static initialization of the scatter
gather list as well as future optimizations
around pipelining commands with data.
Change-Id: I8af8f3e3425610bc720677c9bc84f163cfb6278a
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
The first version of the Linux kernel NVMe-oF initiator had
a bug when reporting queue size where it was off by 1. We
had a workaround to deal with this. Now that the kernel
has been fixed, remove the workaround.
Change-Id: I0ad4a5c6db68cfa9683ab93e6f5210772c713b55
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Move claimed flag to struct spdk_scsi_lun and remove RPC call that allow
SCSI LUN to be deleted by user.
Change-Id: I0fe57d33ab017816ab4799bce259807735e0c783
Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Register all spdk_malloc() memory regions as ibv_mr in a spdk_mem_map
so we can look up the RDMA key for the user's buffer and pass it in the SGL
directly, rather than copying through a pre-registered bounce buffer.
Change-Id: I7340bc2020b5256750c95dbd24ba67961404e5e7
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The extended LBA format flag should be initialized after namespace
capability flag.
Change-Id: Iad479b454bb4e31120c17d40ae23937a099c6f8f
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
Change SCSI device configuration format from "DevX LUN0" to "Dev X LUN0"
This allow checking configuration against silly errors when device
number is out of range.
Also assert exactly only one LUN is given.
Change-Id: Idccd6878119282fc51947b092bdda7ae06aa94ad
Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
The send completions must be processed prior to the
recv completions. However, if the completion queues
are separate this leaves a small window where
a send+recv completion arrive between polling
the send_cq and the recv_cq, resulting in the code
seeing the recv completion prior to the send
completion.
By combining the completion queues, this eliminates
any potential gap. The send completion will always
be processed before the recv completion.
Change-Id: I06bfef6af48559d0b9e00524ebc10f1a102e7387
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The sq_head handling is already done in
spdk_nvmf_rdma_request_send_completion, so do not need to
do again.
Change-Id: I527ff8adfcbdf43ac79794cb5c7777c0e8ef6973
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
The env layer already understands that shm_id < 0 means that
multi-process is not enabled. Leave shm_id defaulted to -1 so that
other code can detect when it is not set.
Change-Id: Ifd1667598d55c216f95f13561dc2a550677db5f4
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
These options are only necessary for applications that intend to be used
in a multi-process configuration.
Change-Id: I3e1fa0682611d92267d0ad1b3f2016dc926b96b6
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Previously, if the maximum number of virtual namespaces had already been
reached, adding a bdev to a subsystem would claim it without actually
adding it to the ns_list array.
Change-Id: Iab68ad1a75748c0e88232240185695aac08d71d2
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
They are not used outside of their respective files.
Change-Id: I754834e7354caec877cd2fe193e56854e5a34e20
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This patch fix the issue when large IO failed:
when we handle the read command which need split, we need make
sure all the subtasks to be handled if one of the subtask failed,
this will make sure the command have chance return back to initiator.
Change-Id: I0c01e1a34c6179fce37ab52c8121268b6ee31102
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
The actual uses of intrinsics are already guarded by feature-specific
ifdefs in nvme_pcie_copy_command(), but the header itself should also
only be included when it will actually be needed.
Change-Id: Ife65d6432b8dfd9d9db80fe4e385ab76491874c0
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
SPDK_COUNTOF works like sizeof, except it returns the number of elements
in an array instead of the number of bytes.
Change-Id: I38ff4dd3485ed9b630cc5660ff84851d0031911f
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This patch adds a library, application and test scripts for extending
SPDK to present virtio-scsi controllers to QEMU-based VMs and
process I/O submitted to devices attached to those controllers.
This functionality is dependent on QEMU patches to enable
vhost-scsi in userspace - those patches are currently working their
way through the QEMU mailing list, but temporary patches to enable
this functionality in QEMU will be made available shortly through the
SPDK github repository.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Signed-off-by: Krzysztof Jakimiak <krzysztof.jakimiak@intel.com>
Signed-off-by: Michal Kosciowski <michal.kosciowski@intel.com>
Signed-off-by: Karol Latecki <karolx.latecki@intel.com>
Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com>
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com>
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Signed-off-by: Krzysztof Jakimiak <krzysztof.jakimiak@intel.com>
Change-Id: I138e4021f0ac4b1cd9a6e4041783cdf06e6f0efb
This avoids registering PMDs that are not used by a given
application. For example, an app may wish to *not* use
ioat - in this case, ioat PMD would not be registered with
DPDK, and we would not waste time probing these devices
when probing other devices like NVMe.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: If378e40bde9057c7808603aa1918bcfe80fa0e9d
These allocations need to be from memory registered with the SPDK env
library to allow future work on automatic ibverbs memory registration.
Change-Id: I6ec6999ecd6d6bf6ba4ab159630f7d01f3d46154
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
It is no longer used now that AER handling holds the request until it is
triggerred.
Change-Id: I71a75e86f82bc06f677cf26defa701e60b9aa1bd
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This will allow returning a different default value per mem map.
Change-Id: I94d3de197acfb2e6ad40092ab0588ba4e951af80
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Add a top-level structure that can be reused for other kinds of memory
address translations.
Change-Id: I046f98b76b4e98087d90095d6e9dea5cd6ab7898
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This patch make the function spdk_scsi_task_process_null_lun() as public and
finish the task immediately once we get task in iscsi layer.
Change-Id: I4ada027d3a324dce8ef0d0f7706dbc14184ead96
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
After checking the code, aerl in our session is 0,
so there will be only 1 AER. So currently,
we will only handle 1 AER case.
When the AER event is triggered by real NVMe device owned
by the subsystem, it notifies all sessions belonging to
the subsystem.
Change-Id: Ia80fb0f03e893c20d8dd14afbed8db10db38301c
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
By the time the module is unloaded, the reactors
have already stopped. That means the event will never
actually fire. Simply remove it.
Change-Id: I4fe371ae7a679d51254d0267fbbbf74c3e9cf477
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The wr_id should never be NULL - it will always correspond to a request
we previously posted. Convert the check to an assert() so we notice if
this ever happens (which would indicate a programming error somewhere
else).
While we're here, add a more robust check to make sure the request is
actually in the correct array of requests for the connection being
polled (also in an assert, since this should never fail in normal
execution).
Change-Id: I855763d7d827fb8cf00a775c7bc2ccb579db8d0f
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Kernel nvmf host always tries to connect nvmf target
when we does not issue nvme disconnect command. Thus,
we face rdma_create_qp issue, the reason is that we call
rdma_listen too early, and the event retrieved from
rdma_cm_get_event is too late.
And this patch solves this issue.
Change-Id: I153a8aea7420a86a236301dad9bd54af97f60865
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
The function spdk_iscsi_transfer_in will handle the task if the
status is not SPDK_SCSI_STATUS_GOOD.
Change-Id: I61155ffa056b3eac551f215d50e1808e5389fdb5
Signed-off-by: cunyinch <cunyin.chang@intel.com>
Calling spdk_nvmf_request_complete to complete spdk_nvmf_request
causes some fields in completion queue entry not set correctly.
Calling spdk_nvmf_request_complete fixes the problem.
This can be used for issuing an abort for the timed-out command.
Change-Id: I3c5727fdddc156cd7c8f99afbc3e6da8e73bba56
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Make sure the compiler arranges the fast path as the fallthrough case by
annotating the checks in spdk_vtophys().
Change-Id: If0fc3149297131894b5c7a94bff31bf8ee40326e
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Now that all DPDK memory is registered at startup, spdk_vtophys() never
needs to add new translations to the vtophys map. This means that any
lookup that fails to find an allocated map_1gb will always return
SPDK_VTOPHYS_ERROR rather than trying to allocate it and then failing
the lookup anyway.
Change-Id: I7e6f7af183199651f5808a17810a17970b0e3331
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
vtophys_get_paddr() and vtophys_get_dpdk_paddr() are doing similar
things; combine them into one function that works for all DPDK
memory addresses.
Part of the vtophys test is temporarily disabled until the next commit,
which will register all DPDK memory at startup and stop lookiing up
addresses at runtime.
Change-Id: I91312837aa1e6170bacaf3b0d2adbdc4391d3afa
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This just moves the lookup of the physical address up one level - now
_spdk_vtophys_register_one() is only responsible for filling out the
mapping table, not looking up the translation.
Change-Id: I9fd5b85da623e403fda0563b6bdebd4aaaf42864
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Rather than storing the page frame number, just store the full physical
address of each 2 MB page. This simplifies the lookup code and makes
the map generic (values are inserted and retrieved without any
modification) for future uses.
Change-Id: Ib1081513a0682f6b8b908f3401c00d87b00f484c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
No need to build a whitelist and scan anymore - the NVMe
driver can directly attach to a specified device.
Change-Id: Ie60c09b6ab37a7f068c496f0cad53bfdc8617349
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Move the ibv_recv_wr initialization in
nvme_rdma_alloc_rsps. Thus we can save some
CPU times
Change-Id: Id449b2684290431f8b3ba97ec4058171d34038bf
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
We do not need to set it for submission since the contents
are same
Change-Id: I345094e2e8a858b318be73d28f09393566587d95
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
When performed limitation iSCSI tests, 128 target nodes with 1
connection for each target node, for IO bigger than 256KiB iSCSI
target will report run of out task pool issue sometimes. When
all the iSCSI parameters with default values, each connection
will consume maximum 189 tasks, we hardcoded the task pool with
16384, so 189 * 128 connection will exceed 16384. Increase the
default number from 16384 to 32768 will fix the issue.
With 1MiB block size and queue depth with 128 for each connection,
there will be 64 outstanding iSCSI commands in the iSCSI target,
for Writes, the maximum R2T number is 4, so the maximum tasks for
the 4 R2T is (1 + 16) * 4 = 68, 8KiB for the first burst task, 16
for the data segment. For Reads, the maximum 64 data in segment can
be used as 4 iSCSI Read commands. The rest 56 iSCSI commands will
cost 56 tasks, so the total number is 56 + 64 + 68 = 188, 1 additional
task for NOP task.
Change-Id: I945871cbe3076139f08c2ef647af2d9c84601dcb
Signed-off-by: Changpeng Liu <changpeng.liu@intel.com>
This is consistent with the rest of the RPC calls that report a number
of blocks, and it matches the field in the split_disk structure.
Change-Id: Ie25534617112d65979c317fe13e05a6c32520a15
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The driver_specific object should contain a single object with the
blockdev driver's name so that the user can determine how to interpret
it. This matches the NVMe blockdev driver.
Change-Id: I434b910a95dd527363af78469dc900e9d19ec12e
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Now that namespace splitting support has been removed from the NVMe bdev
in commit efccac8 ("bdev/nvme: remove NvmeLunsPerNs and LunSizeInMB"),
the block_size and total_size fields in the NVMe bdev's driver_specific
config data are redundant. The generic get_bdevs num_blocks and
block_size fields provide the same information.
Change-Id: I080d2017d608716a593bb553ee667e9c4017ffb7
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Move cb_arg to the first argument to match the other NVMe callback
function signatures.
Change-Id: I4e699c8071dcb7ba4ce3cdb82ee985600208204c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Whether a nvme command having data transfer cannot be completely
determined by command opcode. For set features command, some features
don't require data transfer.
Change spdk_nvmf_request_prep_data to fix this issue.
This has been reported for a number of different device
types. We suspect these devices are technically out of
spec, but they work with most other available NVMe
drivers on accident.
Change-Id: I529cfc03fc314cbab2a1cd40620bf1dd5b54182d
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Reason: the 4 fields of struct ibv_recv_wr is already
set in the following 4 lines.
Change-Id: I97437ee2e4c6e944154813bb48b1740b182220df
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
The responsibility for writing the pid file should lie with the init
system, not the application itself.
This was also broken by the recent instance ID/shared memory ID rework;
the pid file was named based on the pid, making it fairly worthless.
Change-Id: Ifb4f2d3ce5cf132f2c2e8bd3d0ba605ff8ccd8fe
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This was added by mistake in commit 18d26e42a3 ("env: Move DPDK
intialization into the env library."). It is always dead code, because
shm_id is set to getpid() right above it, and it will never be -1.
Change-Id: I19c798a87bf7a3b12547d772b981b038857abcaa
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This patch makes spdk_scsi_lun_construct behave as documented.
spdk_scsi_lun_construct will return only newly created LUN.
If LUN with that name already exists, NULL will be returned.
Unit test relevant to this behaviour is now changed to show
this functionality is now working.
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Change-Id: I002903d6f96555c638aba3fa99cc2c2504ced603
This is necessary to prevent claiming the same LUN twice
and properly cleanup in case of an error during spdk_scsi_dev_construct.
This patch addresses three issues:
- spdk_scsi_lun_claim error is correctly handled in spdk_scsi_dev_add_lun
- on error when constructing scsi dev, it is now correctly removed along with attached luns
- spdk_scsi_dev_destruct not only unclaims, but calls spdk_scsi_lun_destruct on each lun in dev
Unit tests relevant to this behaviour are changed to show this functionality is now working.
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Change-Id: I111c320f875e5003e3f1f7748a2630097301ce1b
This patch adds two new unit tests for scsi device:
- creating two different devices, each containing the same lun
- creating one device, with the same lun twice
As noted in code, three asserts are incorrectly set to show functionality
that is not working currently.
Next patch in series implements that functionality and changes asserts
in the unit tests.
Signed-off-by: Tomasz Zawadzki <tomasz.zawadzki@intel.com>
Change-Id: I2645401fee4f2cd986458e0a4db108ce4e1bf9db
By default, all SPDK applications will not share memory.
To share memory, start the applications with the same
shared memory id.
Change-Id: Ib6180369ef0ed12d05983a21d7943e467402b21a
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Funnel all of the return paths in the main parsing routine through the
code that sets the *end pointer so that all error cases set it.
Change-Id: I0565913f7b9488470ede79dc1af84eb4b9a03225
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The direct and virtual mode code is identical; move it to session.c like
the other virtualized get/set features.
Change-Id: I0a0e2dd795197c142ad5d9d0e4ddedb2aa5c8c2a
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Even for direct mode, each session should use its own
async event configuration like virtual mode instead of
passthrough.
Change-Id: I9c1175f3677c672c0cad684341b8a46a575d753e
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Instance ID is too overloaded and the uses are beginning
to conflict. Separate the RPC configuration out.
Change-Id: I712731130339fee4fc8de4dc2d0fea7040773c58
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The host and port output parameters point into the (non-const) char *ip,
so it makes more sense for them to be non-const as well.
This allows the flexibility to pass non-const char pointers as the
output parameters, which will be used in the nvmf_tgt/conf.c parsing
code.
Change-Id: I1d5b102fc389c06d36432904e4fda944437b659e
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
For namespaces with end-to-end protection information, metadata size
of exactly 8 bytes, and extended LBA configured, the NVMe driver would
calculate the size of the data block incorrectly. The NVMe spec has a
special provision for this specific case (8-byte metadata only) and
PRACT = 1 that requires that the host does not send the metadata as part
of the host memory buffer.
To fix this, clean up the calculation of the per-block data transfer
size by adding a new extended_lba_size field in the namespace, which
represents the total size of data to be transferred per block based on
the namespace's configured metadata size and whether it transfers
metadata as part of the data buffer. Then add the special case for
PRACT = 1 and PI configured and extended LBA in the R/W helper
functions.
Change-Id: I0b383a58c773cac06e6c018858b57129064c6059
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
These were repeated a few different places, so pull them into a common
header file.
Change-Id: Id807fa2cfec0de2e0363aeb081510fb801781985
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This removes one addition from the submission path (negligible, but a
nice side effect), but also opens up the possibility of reporting the
total time an I/O took - since we are always tracking the submission
time anyway, there is no extra cost to report it in the completion
callback.
Change-Id: I7129e7c09d20da8082042a7622d045846461dd9c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The phenoemon is that we can not shutdown the nvmf tgt.
The solution is that we need to adjust the shutting down orders of
nvmf tgt subsystem and rdma trasport layer.
Change-Id: Ie39657370b1574960e0ee7cf604cc5872db0bed3
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
This function parses in place by inserting null terminators.
Change-Id: I61cb97b87ec05d0183fbaa993fd3d7580a188bde
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Split the ref_count field of the bottom level of the vtophys map tree to
a separate array so that the pfn_2mb field can be expanded to a full 64
bits again. This doesn't change behavior for the current use as a page
frame number; it is setup work for storing an arbitrary 64-bit pointer
value in the bottom level.
Change-Id: I0bc44df3edc9df4a479229d69c2f3884d43a340d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Since the io_channel will be passed to the underlying bdev's
read/write/... functions later, we need to also acquire an io_channel
for the underlying bdev, not for the virtual bdev.
Change-Id: Ica13076973fef875ea636770fce8eb27017aa1c3
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
These strings are not modified by the functions they are passed to, so
they can be const char *.
Change-Id: I11532f232990a305d706c14aac1b0f8f93b8f576
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
For infinite timeout states, instead of printing UINT64_MAX as a
decimal number, interpret it as "no timeout" instead.
Change-Id: I579f5857f96286734940ab5f493261e60354c4fe
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The PCIe transport initializes the quirks directly, so the generic hook
to get PCI ID is no longer necessary. This path was dead code.
Change-Id: I25bdaa598db53e4312a264d9d8356d1b416696e5
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The logic to fail queue pairs when the controller is failed should be
handled in the generic code, not in the individual transports.
This also allows nvme_qpair_fail() to be private to nvme_qpair.c.
Change-Id: I6194576dceb35073b9af8847e59314900028637c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Avoid a runtime check for the rte_ring type - we know that the event
ring is multi-producer/single-consumer at compile time.
Change-Id: I5d42aee9c635db86e545b661361a68818d80961d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Avoid accessing the internals of the bdev_io from outside of the bdev
library.
Change-Id: I01dfc38b2520353ad42bcd8587b90f197eadf101
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This is more CPU efficient than only grabbing one
completion per call to ibv_poll_cq.
Change-Id: I0c70d33639f0f345482d9e7c810f9c6723937058
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This virtual block device takes an underlying block device and splits it
into several smaller equal-sized block devices.
Change-Id: I6f6e686c1177b2e4885f7e88809ad329caae55bd
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
These were only intended for testing and should be replaced by a virtual
blockdev that can be layered on top of any kind of bdev.
Change-Id: I3ba2cc94630a6c6748d96e3401fee05aaabe20e0
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The work item queueing code was replaced with the current reactor/event
model, but the block comment above _spdk_reactor_run() wasn't updated to
match. Replace the pseudo-code with something resembling the current
behavior, and delete the outdated paragraph below it.
Change-Id: If0686c6a5d063f56d8ea3df9bf3a1e98eef40207
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
When we do frequent same subsystem add/delete,
we will face the adding issue. For example,
1 Add subsystem A
2 Delete subsystem A
3 Add subsystem A (Fail in this step).
The reason is that we did not correctly free
the listener resources of subsystems, and this patch
can solve this issue.
Change-Id: I6765a306a3f10c9a0f38c95dbba12e2a4073e705
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Detect whether the specified DPDK directory contains static or shared
libraries, and use the appropriate extension when building the library
list. Static libraries are still preferred.
Change-Id: I78c68fd38fba1ea42dd605fb77209651f8cdca75
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The $(ENV_LIBS) variable was including system library linker arguments
like '-ldl', but $(ENV_LIBS) is intended to be used as a dependency for
other Makefile targets, and those arguments don't belong there.
Add the system library linker arguments to ENV_LINKER_ARGS instead.
Change-Id: I247264d287047f1423365806042982b492eec311
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
In our previous code, we did not ack the event in
exceptional cases when we get a event via rdma_get_cm_event.
Thus, the code may block with in this statement:
rdma_destroy_id(rqpair->cm_id);
in some exceptiaonal cases. And this patch will solve this
issue.
Change-Id: Iddb6fb5356a5ee0ed04e261a040ba53042fca302
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
This make sure the qpair failure could be started from upper level application.
Change-Id: I7e04fe36929cc634ddf0078db96fbc40afb38f8c
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
Instead, check them every 5 iterations by default.
Change-Id: I9c42922868f8e965a0c801109e59e06aff5adf62
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
Simplify and remove a direct call to a DPDK function.
Change-Id: I08eaf86a48df67e3248eeaa764ae924b784d9277
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Store each reactor's per-socket event mempool in the spdk_reactor
structure to avoid calling rte_lcore_to_socket_id() on every iteration,
and make the function definition an internal, inlineable version
that takes the reactor pointer directly.
Change-Id: I841f7d7594308d7c572f5b7f609913c428bd13d7
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Minimize the number of times spdk_get_ticks() is called
because it is expensive.
Change-Id: I2f34ca724ec28f42866b76d224dacbe1f31e7a41
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
64k sessions over the lifetime of a single target is something
that really could happen, so handle this case.
Change-Id: Iaed92b9ff6cd078fcd7c1efe88cf0c860c77c4ac
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
For iscsi read/write, expected_data_xfer_len
is 0, dxfer_dir is set to SPDK_SCSI_DIR_NONE.
But we can still have read/write op in SCSI layer.
This patch solves this issue.
Change-Id: I950e163fffb06fefaf8a913d1f6de29c96a52264
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
The g_thread_mmio_ctrlr should be not NULL pointer when it enter the
handler function.
Change-Id: I45dba601c672b16e2c6feafd9059bafde0d8f1b4
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
If namespace is formatted with per lba metadata feature and also disable end-to-end protection
feature, host couldn't use per extended-lba metadata area.
Signed-off-by: Zhihao Zhang <thomas.zzh@alibaba-inc.com>
If the user asked for a specific PCI address in spdk_nvme_probe(), we
need to return 1, not 0, for the other PCI addresses that don't match
when enumerating. 0 means to attach the PCI driver, whereas 1 means to
continue enumerating.
With the previous behavior of returning 0, all NVMe devices would be
attached to the DPDK PCI driver, even if the user did not request for
them to be probed, and further calls to spdk_nvme_probe() would not find
any devices.
Change-Id: Ifbbcd7d1abe8ab535b6957855172e66a3e69fbe4
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This is not actually optional - it contains required
information for setting up the connection.
Change-Id: I21136de12794a0f4f5c14c5d3e2e3f2306c5c102
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This isn't used anywhere yet, but it will be for
NVMe-oF 1.1.
Change-Id: Ieae0688e6ad5b7a44568e5760382b5716b02e6f0
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
The code doesn't actually use this property of cntlid
for anything yet, but we will need it later.
Change-Id: I5fd514d75b903cc8769e7b9f196a4624e9cf876c
Signed-off-by: Ben Walker <benjamin.walker@intel.com>
This is necessary to process asynchronous events, as well as keep-alive
support for NVMe over Fabrics connections.
Based on a patch by Edward Yang <eyang@us.fujitsu.com>
Change-Id: I3e81f3d5061f75b12b625fa1a06629c6dc3dc61b
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
There is only a single device ID for all channels on the SKX
implementation of I/OAT.
Change-Id: I90ee79b1b673a199754f1ca4c9e38e934294e261
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This prevents the need for bdev users and modules to manipulate the
internal bdev_io error.nvme fields.
For now, all non-NVMe error types are treated as a generic device error,
but translation from SCSI to NVMe could be added in the future.
Change-Id: I4e831b26a2f41bf2f405c7576d5019bb898d4d1b
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Currently we use the pci functions provided by DPDK,
it identifies the device by class id related
info but not by pci bdf info, so we can add the filering
by pci_addr in pcie_nvme_enum_cb function.
Change-Id: I5942e98853f00fc10fa6aae5c113517653d1b357
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Since nvme_ns_cmd.c now walks the SGL, some of the test code
needs to also be updated to initialize and return correct values
such as ctrlr->flags and sge_length.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: I521213695def35d0897aabf57a0638a6c347632e
Convert the number parsing function into a linear sequence with a goto
label for each state, rather than a single loop with a state variable.
This makes the code easier to read and also improves speed (better
branch prediction and smaller inner loops for the common case).
On my test system, jsoncat citylots.json > /dev/null improves from
~1.7s to ~1.2s.
This changes behavior of some number parsing test cases: inputs matching
the number grammar as defined by JSON will be returned even if there is
trailing garbage, consistent with the rest of the parser. For example,
the input 01 will be parsed as a valid number 0 followed by trailing 1.
This only makes any difference when the full input is a single
number value, since if the value was nested in an object or array, the
trailing garbage will not match the expected syntax and the whole parse
will fail with SPDK_JSON_PARSE_INVALID (e.g. [00 will parse the first 0
as a number and then fail on the second 0, since only a comma or right
square bracket would be accepted).
Change-Id: Ifabfaed611219b3e0a06c8677190a28b87e8a13b
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This improves output speed significantly, especially if the write
callback is expensive (e.g. issues a syscall or takes a lock).
On my test system, jsoncat citylots.json > /dev/null improves from
~2.8s to ~1.7s.
citylots.json: https://github.com/zemirco/sf-city-lots-json (~181 MiB)
Change-Id: I7d411ce92366712ed87ad5fc6e9b64828541db4d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
If a blockdev module calls spdk_bdev_io_complete() within its
submit_request function, and the user's completion callback issues a new
I/O, it is possible to cause infinite recursion, consuming all available
stack space.
To avoid this, track whether a bdev_io is being processed by
submit_request, and if io_complete() is called in this case, defer the
completion via an event.
Change-Id: I6ccdb8ed4ee0d5738e6c9840d35431de52bd5fa2
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Preivously, we only supports probe the NVMf target
via discovery info, now we can support to directly
to connect it.
Change-Id: I08ce1d95de6744286357e68b48c97b773b902ac8
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
I do not see any reason to ignore using this channel. If that,
we should give comments in the file, otherwise we need to add it.
Change-Id: I56ad491c67a23831befc8c761ad0a02e721a15a4
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Because of the addition of io_channel support to the bdev layer, there
is no longer a need to re-run a completed I/O through the submission
event pipeline; it can be freed directly.
Change-Id: I2b9163c87293345acf0e85f6d0c1032f30209659
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Include timer-based pollers in the active/idle check that uses
last_action to determine when a reactor last executed an action.
Change-Id: Ib8f1253675b57aeb59206d099c6257f6d07f5acf
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
One microsecond is not really long enough to detect an idle condition
where calling the OS usleep() makes sense. Increase the minimum time
spent spin-waiting on events and pollers from one microsecond to one
millisecond.
Change-Id: I678118e357330f133251f4cfada8ff27e10158a5
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
When a connection enters full-feature phase and is assigned to an lcore,
we need to increment the counter for the new lcore, not the connection's
existing lcore.
Change-Id: Idced4090b6e8ac35a767fd223fbd81ba824615d3
Signed-off-by: Cunyin Chang <cunyin.chang@intel.com>
Claim the block devices used by iSCSI LUNs and NVMe-oF subsystems so
they can't accidentally be reused.
This will also be used by virtual block devices to allow layering of
bdevs.
Change-Id: I5384923fbf24f13f4ce720a797c5a628053d49f4
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
(1) Add nvme_rdma_build_sgl_request function
(2) Merge nvme_rdma_pre/post_copy_mem to nvme_rdma_copy_mem
Change-Id: I86abab821b32b4da0aa9489a6b9f7dc430333159
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
Use a plain function pointer + callback context for the bdev I/O
completion callback. This is possible now because each I/O channel will
be polled on the core that submitted the I/O.
Change-Id: I29ee8e4a3430df11c74845adab840395b9bc5010
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
An old prototype SPDK AHCI driver would return
TASK_SET_FULL if all NCQ slots were full on a given
disk. This would kick the SCSI task back to the LUN
to be retried later. Since then, we have pushed
responsibility onto the bdev modules themselves
to handle this kind of queueing/retry logic.
Removing this logic allows us to make some additional
changes that enable tasks to get completed inline without
an extra event callback to handle completion. We also
no longer need to worry about checking if pending tasks
need to be executed in the complete_task() routine, since
the execute() routine will now always exhaust the pending_task
list.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: If2dc3ab017e0dbc225c8f627e1f87c5a8e9b1e3e
Now that the hotplug code is isolated in nvme_pcie.c, it can call the
PCIe transport attach function directly.
Change-Id: I2df3b9168473b537cc9b13367e06d3d3b6fa22be
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The reactor structures are allocated in a contiguous array, and each
reactor is accessed from a different core, so align the reactor
structure to avoid false sharing.
Change-Id: I95162620ccb58fae060b2d95e47a38621dfbd140
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
It is private to lib/event/reactor.c and does not need to be exposed in
the global namespace.
Change-Id: Idfff0365a0afdd90a0567825d520adf61d99fd2b
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Previously, we did not calculate the ref for the LUN.
Change-Id: If2b7bc7d129e7efd994a7987ae2c421048969acb
Signed-off-by: Ziye Yang <ziye.yang@intel.com>
An SGE could be for a payload that is greater than the NVMe
devices MDTS (i.e. 128KB), but that SGE may not be aligned
on a sector-size boundary. We can safely assume that each
iov is individually physically contiguous - the DPDK
mempools for example guarantee this.
Signed-off-by: Jim Harris <james.r.harris@intel.com>
Change-Id: I8143ed01814c3154d0a06b8bbc548484437c1e88
The spdk_nvme_qpair::num_entries value is never used in the common code,
so move it to the individual transport qpairs to make it clear that it
is a transport-specific implementation detail.
Change-Id: I5c8f0de4fcd808912ba6d248cf5cee816079fd32
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The 'next' event pointer was never used in the entire code base (always
NULL).
Change-Id: I75f999d3a2e10512d86edec1a5a46ef263e2635b
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Use 'struct spdk_event *' directly for consistency with the rest of the
API.
Change-Id: Ib41a9bf47f5b18f4aebf5f4dee055455cb12ef7d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This allows the elimination of the spdk_event_get_arg1() and
spdk_event_get_arg2() macros, which accessed the event structure
directly; this was preventing the event structure definition from being
moved out of the public API header.
Change-Id: I74eced799ad7df61ff0b1390c63fb533e3fae8eb
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The public API user is supposed to retrieve the defaults via the
spdk_app_opts_init() function.
Change-Id: Ie2bd6e809b2d47dbd5d62d396e8715f89f4052d9
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The spdk_poller_register() function provides a way to pass an event to
call once the poller is registered, but it is always NULL in the current
code base.
Change-Id: I459bf40ae4d050589577d113b7984f1563aaa9cc
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The event->next field can be accessed directly from within the event
library implementation, and public API users should not be using it.
Change-Id: I98a1f0017e03e951d0c4eee3c7989b04324e57d1
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
It is only used within bdev.c and can be static.
Change-Id: Id6e2cd9e5dd61a3ef1e1a27993d7a5ea7728bff2
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
This is consistent with the other internal-only API headers.
Change-Id: I2c4748977d38a6c173311d26197d6273c168da7d
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
The definition of SPDK_UNREACHABLE uses the build-time DEBUG definition,
which is not available in the public API.
Change-Id: I1862c99fa5c85ccd3483f94e9c35de531da57f3c
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
Instead of passing the work completion, just pass the
response index. This keeps the work completions localized
to the polling function.
Change-Id: I0e6a1d8564200b5ac3aa43dfd58ae152d439bbd8
Signed-off-by: Ben Walker <benjamin.walker@intel.com>