doc: Add the SPDK NVMe-oF Target Programming Guide
Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com>
This commit is contained in:
parent
74e97d1245
commit
cf9e099862
@ -802,6 +802,7 @@ INPUT = ../include/spdk \
|
||||
nvme.md \
|
||||
nvme-cli.md \
|
||||
nvmf.md \
|
||||
nvmf_tgt_pg.md \
|
||||
ssd_internals.md \
|
||||
userspace.md \
|
||||
vagrant.md \
|
||||
|
@ -31,6 +31,7 @@
|
||||
- @ref bdev_module
|
||||
- @ref directory_structure
|
||||
- [Public API header files](files.html)
|
||||
- @ref nvmf_tgt_pg
|
||||
- @ref event
|
||||
- @ref blob
|
||||
|
||||
|
204
doc/nvmf_tgt_pg.md
Normal file
204
doc/nvmf_tgt_pg.md
Normal file
@ -0,0 +1,204 @@
|
||||
# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg}
|
||||
|
||||
## Target Audience
|
||||
|
||||
This programming guide is intended for developers authoring applications that
|
||||
use the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide
|
||||
background context, architectural insight, and design recommendations. This
|
||||
guide will not cover how to use the SPDK NVMe-oF target application. For a
|
||||
guide on how to use the existing application as-is, see @ref nvmf.
|
||||
|
||||
## Introduction
|
||||
|
||||
The SPDK NVMe-oF target library is located in `lib/nvmf`. The library
|
||||
implements all logic required to create an NVMe-oF target application. It is
|
||||
used in the implementation of the example NVMe-oF target application in
|
||||
`app/nvmf_tgt`, but is intended to be consumed independently.
|
||||
|
||||
This guide is written assuming that the reader is familiar with both NVMe and
|
||||
NVMe over Fabrics. The best way to become familiar with those is to read their
|
||||
[specifications](http://nvmexpress.org/resources/specifications/).
|
||||
|
||||
## Primitives
|
||||
|
||||
The library exposes a number of primitives - basic objects that the user
|
||||
creates and interacts with. They are:
|
||||
|
||||
`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does
|
||||
not appear in the NVMe-oF specification. SPDK defines this to mean the
|
||||
collection of subsystems with the associated namespaces, plus the set of
|
||||
transports and their associated network connections. This will be referred to
|
||||
throughout this guide as a **target**.
|
||||
|
||||
`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF
|
||||
specification. Subsystems contain namespaces and controllers and perform
|
||||
access control. This will be referred to throughout this guide as a
|
||||
**subsystem**.
|
||||
|
||||
`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF
|
||||
specification. Namespaces are **bdevs**. See @ref bdev for an explanation of
|
||||
the SPDK bdev layer. This will be referred to throughout this guide as a
|
||||
**namespace**.
|
||||
|
||||
`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF
|
||||
specification. These map 1:1 to network connections. This will be referred to
|
||||
throughout this guide as a **qpair**.
|
||||
|
||||
`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined
|
||||
by the NVMe-oF specification. The specification is designed to allow for many
|
||||
different network fabrics, so the code mirrors that and implements a plugin
|
||||
system. Currently, only the RDMA transport is available. This will be referred
|
||||
to throughout this guide as a **transport**.
|
||||
|
||||
`struct spdk_nvmf_poll_group`: An abstraction for a collection of network
|
||||
connections that can be polled as a unit. This is an SPDK-defined concept that
|
||||
does not appear in the NVMe-oF specification. Often, network transports have
|
||||
facilities to check for incoming data on groups of connections more
|
||||
efficiently than checking each one individually (e.g. epoll), so poll groups
|
||||
provide a generic abstraction for that. This will be referred to throughout
|
||||
this guide as a **poll group**.
|
||||
|
||||
`struct spdk_nvmf_listener`: A network address at which the target will accept
|
||||
new connections.
|
||||
|
||||
`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator)
|
||||
system. This is used for access control.
|
||||
|
||||
## The Basics
|
||||
|
||||
A user of the NVMe-oF target library begins by creating a target using
|
||||
spdk_nvmf_tgt_create(), setting up a set of addresses to accept connections on
|
||||
by calling spdk_nvmf_tgt_listen(), then creating a subsystem using
|
||||
spdk_nvmf_subsystem_create().
|
||||
|
||||
Subsystems begin in an inactive state and must be activated by calling
|
||||
spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only
|
||||
when in the paused or inactive state. A running subsystem may be paused by
|
||||
calling spdk_nvmf_subsystem_pause() and resumed by calling
|
||||
spdk_nvmf_subsystem_resume().
|
||||
|
||||
Namespaces may be added to the subsystem by calling
|
||||
spdk_nvmf_subsystem_add_ns() when the subsystem is inactive or paused.
|
||||
Namespaces are bdevs. See @ref bdev for more information about the SPDK bdev
|
||||
layer. A bdev may be obtained by calling spdk_bdev_get_by_name().
|
||||
|
||||
Once a subsystem exists and the target is listening on an address, new
|
||||
connections may be accepted by polling spdk_nvmf_tgt_accept().
|
||||
|
||||
All I/O to a subsystem is driven by a poll group, which polls for incoming
|
||||
network I/O. Poll groups may be created by calling
|
||||
spdk_nvmf_poll_group_create(). They automatically request to begin polling
|
||||
upon creation on the thread from which they were created. Most importantly, *a
|
||||
poll group may only be accessed from the thread it was created on.*
|
||||
|
||||
When spdk_nvmf_tgt_accept() detects a new connection, it will construct a new
|
||||
struct spdk_nvmf_qpair object and call the user provided `new_qpair_fn`
|
||||
callback for each new qpair. In response to this callback, the user must
|
||||
assign the qpair to a poll group by calling spdk_nvmf_poll_group_add().
|
||||
Remember, a poll group may only be accessed from the thread it was created on,
|
||||
so making a call to spdk_nvmf_poll_group_add() may require passing a message
|
||||
to the appropriate thread.
|
||||
|
||||
## Access Control
|
||||
|
||||
Access control is performed at the subsystem level by adding allowed listen
|
||||
addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and
|
||||
spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept
|
||||
connections from any host or over any established listen address. Listeners
|
||||
and hosts may only be added to inactive or paused subsystems.
|
||||
|
||||
## Discovery Subsystems
|
||||
|
||||
A discovery subsystem, as defined by the NVMe-oF specification, is
|
||||
automatically created for each NVMe-oF target constructed. Connections to the
|
||||
discovery subsystem are handled in the same way as any other subsystem - new
|
||||
qpairs are created in response to spdk_nvmf_tgt_accept() and they must be
|
||||
assigned to a poll group.
|
||||
|
||||
## Transports
|
||||
|
||||
The NVMe-oF specification defines multiple network transports (the "Fabrics"
|
||||
in NVMe over Fabrics) and has an extensible system for adding new fabrics
|
||||
in the future. The SPDK NVMe-oF target library implements a plugin system for
|
||||
network transports to mirror the specification. The API a new transport must
|
||||
implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA
|
||||
transport has been implemented.
|
||||
|
||||
The SPDK NVMe-oF target is designed to be able to process I/O from multiple
|
||||
fabrics simultaneously.
|
||||
|
||||
## Choosing a Threading Model
|
||||
|
||||
The SPDK NVMe-oF target library does not strictly dictate threading model, but
|
||||
poll groups do all of their polling and I/O processing on the thread they are
|
||||
created on. Given that, it almost always makes sense to create one poll group
|
||||
per thread used in the application. New qpairs created in response to
|
||||
spdk_nvmf_tgt_accept() can be handed out round-robin to the poll groups. This
|
||||
is how the SPDK NVMe-oF target application currently functions.
|
||||
|
||||
More advanced algorithms for distributing qpairs to poll groups is possible.
|
||||
For instance, a NUMA-aware algorithm would be an improvement over basic
|
||||
round-robin, where NUMA-aware means assigning qpairs to poll groups running on
|
||||
CPU cores that are on the same NUMA node as the network adapter and storage
|
||||
device. Load-aware algorithms also may have benefits.
|
||||
|
||||
## Scaling Across CPU Cores
|
||||
|
||||
Incoming I/O requests are picked up by the poll group polling their assigned
|
||||
qpair. For regular NVMe commands such as READ and WRITE, the I/O request is
|
||||
processed on the initial thread from start to the point where it is submitted
|
||||
to the backing storage device, without interruption. Completions are
|
||||
discovered by polling the backing storage device and also processed to
|
||||
completion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.)
|
||||
do not require any cross-thread coordination, and therefore take no locks.**
|
||||
|
||||
NVMe ADMIN commands, which are used for managing the NVMe device itself, may
|
||||
modify global state in the subsystem. For instance, an NVMe ADMIN command may
|
||||
perform namespace management, such as shrinking a namespace. For these
|
||||
commands, the subsystem will temporarily enter a paused state by sending a
|
||||
message to each thread in the system. All new incoming I/O on any thread
|
||||
targeting the subsystem will be queued during this time. Once the subsystem is
|
||||
fully paused, the state change will occur, and messages will be sent to each
|
||||
thread to release queued I/O and resume. Management commands are rare, so this
|
||||
style of coordination is preferable to forcing all commands to take locks in
|
||||
the I/O path.
|
||||
|
||||
## Zero Copy Support
|
||||
|
||||
For the RDMA transport, data is transferred from the RDMA NIC to host memory
|
||||
and then host memory to the SSD (or vis. versa), without any intermediate
|
||||
copies. Data is never moved from one location in host memory to another. Other
|
||||
transports in the future may require data copies.
|
||||
|
||||
## RDMA
|
||||
|
||||
The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and
|
||||
rdmacm libraries, which are packaged and available on most Linux
|
||||
distributions. It does not use a user-space RDMA driver stack through DPDK.
|
||||
|
||||
In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA
|
||||
transport allocates a single RDMA completion queue per poll group. All new
|
||||
qpairs assigned to the poll group are given their own RDMA send and receive
|
||||
queues, but share this common completion queue. This allows the poll group to
|
||||
poll a single queue for incoming messages instead of iterating through each
|
||||
one.
|
||||
|
||||
Each RDMA request is handled by a state machine that walks the request through
|
||||
a number of states. This keeps the code organized and makes all of the corner
|
||||
cases much more obvious.
|
||||
|
||||
RDMA SEND, READ, and WRITE operations are ordered with respect to one another,
|
||||
but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For
|
||||
instance, it is possible to detect an incoming RDMA RECV message containing a
|
||||
new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND
|
||||
containing an NVMe completion. This is problematic at full queue depth because
|
||||
there may not yet be a free request structure. To handle this, the RDMA
|
||||
request structure is broken into two parts - an rdma_recv and an rdma_request.
|
||||
New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a
|
||||
queue for a SEND acknowledgement before they can acquire a full rdma_request
|
||||
object.
|
||||
|
||||
Further, RDMA NICs expose different queue depths for READ/WRITE operations
|
||||
than they do for SEND/RECV operations. The RDMA transport reports available
|
||||
queue depth based on SEND/RECV operation limits and will queue in software as
|
||||
necessary to accomodate (usually lower) limits on READ/WRITE operations.
|
Loading…
x
Reference in New Issue
Block a user