numam-spdk/doc/nvmf_tgt_pg.md

# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg}

## Target Audience

This programming guide is intended for developers authoring applications that
use the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide
background context, architectural insight, and design recommendations. This
guide will not cover how to use the SPDK NVMe-oF target application. For a
guide on how to use the existing application as-is, see @ref nvmf.

## Introduction

The SPDK NVMe-oF target library is located in `lib/nvmf`. The library
implements all logic required to create an NVMe-oF target application. It is
used in the implementation of the example NVMe-oF target application in
`app/nvmf_tgt`, but is intended to be consumed independently.

This guide is written assuming that the reader is familiar with both NVMe and
NVMe over Fabrics. The best way to become familiar with those is to read their
[specifications](http://nvmexpress.org/resources/specifications/).

## Primitives

The library exposes a number of primitives - basic objects that the user
creates and interacts with. They are:

`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does
not appear in the NVMe-oF specification. SPDK defines this to mean the
collection of subsystems with the associated namespaces, plus the set of
transports and their associated network connections. This will be referred to
throughout this guide as a **target**.

`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF
specification. Subsystems contain namespaces and controllers and perform
access control. This will be referred to throughout this guide as a
**subsystem**.

`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF
specification. Namespaces are **bdevs**. See @ref bdev for an explanation of
the SPDK bdev layer. This will be referred to throughout this guide as a
**namespace**.

`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF
specification. These map 1:1 to network connections. This will be referred to
throughout this guide as a **qpair**.

`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined
by the NVMe-oF specification. The specification is designed to allow for many
different network fabrics, so the code mirrors that and implements a plugin
system. Currently, only the RDMA transport is available. This will be referred
to throughout this guide as a **transport**.

`struct spdk_nvmf_poll_group`: An abstraction for a collection of network
connections that can be polled as a unit. This is an SPDK-defined concept that
does not appear in the NVMe-oF specification. Often, network transports have
facilities to check for incoming data on groups of connections more
efficiently than checking each one individually (e.g. epoll), so poll groups
provide a generic abstraction for that. This will be referred to throughout
this guide as a **poll group**.

`struct spdk_nvmf_listener`: A network address at which the target will accept
new connections.

`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator)
system. This is used for access control.

## The Basics

A user of the NVMe-oF target library begins by creating a target using
spdk_nvmf_tgt_create(), setting up a set of addresses on which to accept
connections by calling spdk_nvmf_tgt_listen(), then creating a subsystem
using spdk_nvmf_subsystem_create().

Subsystems begin in an inactive state and must be activated by calling
spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only
when in the paused or inactive state. A running subsystem may be paused by
calling spdk_nvmf_subsystem_pause() and resumed by calling
spdk_nvmf_subsystem_resume().

Namespaces may be added to the subsystem by calling
spdk_nvmf_subsystem_add_ns() when the subsystem is inactive or paused.
Namespaces are bdevs. See @ref bdev for more information about the SPDK bdev
layer. A bdev may be obtained by calling spdk_bdev_get_by_name().

Once a subsystem exists and the target is listening on an address, new
connections will be automatically assigned to poll groups as they are
detected.

All I/O to a subsystem is driven by a poll group, which polls for incoming
network I/O. Poll groups may be created by calling
spdk_nvmf_poll_group_create(). They automatically request to begin polling
upon creation on the thread from which they were created. Most importantly, *a
poll group may only be accessed from the thread on which it was created.*

## Access Control

Access control is performed at the subsystem level by adding allowed listen
addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and
spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept
connections from any host or over any established listen address. Listeners
and hosts may only be added to inactive or paused subsystems.

## Discovery Subsystems

A discovery subsystem, as defined by the NVMe-oF specification, is
automatically created for each NVMe-oF target constructed. Connections to the
discovery subsystem are handled in the same way as any other subsystem.

## Transports

The NVMe-oF specification defines multiple network transports (the "Fabrics"
in NVMe over Fabrics) and has an extensible system for adding new fabrics
in the future. The SPDK NVMe-oF target library implements a plugin system for
network transports to mirror the specification. The API a new transport must
implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA
transport has been implemented.

The SPDK NVMe-oF target is designed to be able to process I/O from multiple
fabrics simultaneously.

## Choosing a Threading Model

The SPDK NVMe-oF target library does not strictly dictate threading model, but
poll groups do all of their polling and I/O processing on the thread they are
created on. Given that, it almost always makes sense to create one poll group
per thread used in the application.

## Scaling Across CPU Cores

Incoming I/O requests are picked up by the poll group polling their assigned
qpair. For regular NVMe commands such as READ and WRITE, the I/O request is
processed on the initial thread from start to the point where it is submitted
to the backing storage device, without interruption. Completions are
discovered by polling the backing storage device and also processed to
completion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.)
do not require any cross-thread coordination, and therefore take no locks.**

NVMe ADMIN commands, which are used for managing the NVMe device itself, may
modify global state in the subsystem. For instance, an NVMe ADMIN command may
perform namespace management, such as shrinking a namespace. For these
commands, the subsystem will temporarily enter a paused state by sending a
message to each thread in the system. All new incoming I/O on any thread
targeting the subsystem will be queued during this time. Once the subsystem is
fully paused, the state change will occur, and messages will be sent to each
thread to release queued I/O and resume. Management commands are rare, so this
style of coordination is preferable to forcing all commands to take locks in
the I/O path.

## Zero Copy Support

For the RDMA transport, data is transferred from the RDMA NIC to host memory
and then host memory to the SSD (or vice versa), without any intermediate
copies. Data is never moved from one location in host memory to another. Other
transports in the future may require data copies.

## RDMA

The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and
rdmacm libraries, which are packaged and available on most Linux
distributions. It does not use a user-space RDMA driver stack through DPDK.

In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA
transport allocates a single RDMA completion queue per poll group. All new
qpairs assigned to the poll group are given their own RDMA send and receive
queues, but share this common completion queue. This allows the poll group to
poll a single queue for incoming messages instead of iterating through each
one.

Each RDMA request is handled by a state machine that walks the request through
a number of states. This keeps the code organized and makes all of the corner
cases much more obvious.

RDMA SEND, READ, and WRITE operations are ordered with respect to one another,
but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For
instance, it is possible to detect an incoming RDMA RECV message containing a
new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND
containing an NVMe completion. This is problematic at full queue depth because
there may not yet be a free request structure. To handle this, the RDMA
request structure is broken into two parts - an rdma_recv and an rdma_request.
New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a
queue for a SEND acknowledgement before they can acquire a full rdma_request
object.

Further, RDMA NICs expose different queue depths for READ/WRITE operations
than they do for SEND/RECV operations. The RDMA transport reports available
queue depth based on SEND/RECV operation limits and will queue in software as
necessary to accommodate (usually lower) limits on READ/WRITE operations.
doc: Add the SPDK NVMe-oF Target Programming Guide Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> 2018-01-04 07:32:04 +00:00			`# NVMe over Fabrics Target Programming Guide {#nvmf_tgt_pg}`

			`## Target Audience`

			`This programming guide is intended for developers authoring applications that`
			use the SPDK NVMe-oF target library (`lib/nvmf`). It is intended to provide
			`background context, architectural insight, and design recommendations. This`
			`guide will not cover how to use the SPDK NVMe-oF target application. For a`
			`guide on how to use the existing application as-is, see @ref nvmf.`

			`## Introduction`

			The SPDK NVMe-oF target library is located in `lib/nvmf`. The library
			`implements all logic required to create an NVMe-oF target application. It is`
			`used in the implementation of the example NVMe-oF target application in`
			`app/nvmf_tgt`, but is intended to be consumed independently.

			`This guide is written assuming that the reader is familiar with both NVMe and`
			`NVMe over Fabrics. The best way to become familiar with those is to read their`
			`[specifications](http://nvmexpress.org/resources/specifications/).`

			`## Primitives`

			`The library exposes a number of primitives - basic objects that the user`
			`creates and interacts with. They are:`

			`struct spdk_nvmf_tgt`: An NVMe-oF target. This concept, surprisingly, does
			`not appear in the NVMe-oF specification. SPDK defines this to mean the`
			`collection of subsystems with the associated namespaces, plus the set of`
			`transports and their associated network connections. This will be referred to`
			`throughout this guide as a target.`

			`struct spdk_nvmf_subsystem`: An NVMe-oF subsystem, as defined by the NVMe-oF
			`specification. Subsystems contain namespaces and controllers and perform`
			`access control. This will be referred to throughout this guide as a`
			`subsystem.`

			`struct spdk_nvmf_ns`: An NVMe-oF namespace, as defined by the NVMe-oF
			`specification. Namespaces are bdevs. See @ref bdev for an explanation of`
			`the SPDK bdev layer. This will be referred to throughout this guide as a`
			`namespace.`

			`struct spdk_nvmf_qpair`: An NVMe-oF queue pair, as defined by the NVMe-oF
			`specification. These map 1:1 to network connections. This will be referred to`
			`throughout this guide as a qpair.`

			`struct spdk_nvmf_transport`: An abstraction for a network fabric, as defined
			`by the NVMe-oF specification. The specification is designed to allow for many`
			`different network fabrics, so the code mirrors that and implements a plugin`
			`system. Currently, only the RDMA transport is available. This will be referred`
			`to throughout this guide as a transport.`

			`struct spdk_nvmf_poll_group`: An abstraction for a collection of network
			`connections that can be polled as a unit. This is an SPDK-defined concept that`
			`does not appear in the NVMe-oF specification. Often, network transports have`
			`facilities to check for incoming data on groups of connections more`
			`efficiently than checking each one individually (e.g. epoll), so poll groups`
			`provide a generic abstraction for that. This will be referred to throughout`
			`this guide as a poll group.`

			`struct spdk_nvmf_listener`: A network address at which the target will accept
			`new connections.`

			`struct spdk_nvmf_host`: An NVMe-oF NQN representing a host (initiator)
			`system. This is used for access control.`

			`## The Basics`

			`A user of the NVMe-oF target library begins by creating a target using`
doc: Really minor updates to NVMe-oF programmers guide You can take it or leave it. I read through looking for substantive updates I could make and didn't find much. This guide appears to have stayed pretty true to form. Change-Id: I6f0e53d9bef4c7e6cea40ff6d6605127d0640a63 Signed-off-by: Seth Howell <seth.howell@intel.com> Reviewed-on: https://review.gerrithub.io/426404 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Paul Luse <paul.e.luse@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-09-21 17:04:40 +00:00			`spdk_nvmf_tgt_create(), setting up a set of addresses on which to accept`
			`connections by calling spdk_nvmf_tgt_listen(), then creating a subsystem`
			`using spdk_nvmf_subsystem_create().`
doc: Add the SPDK NVMe-oF Target Programming Guide Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> 2018-01-04 07:32:04 +00:00
			`Subsystems begin in an inactive state and must be activated by calling`
			`spdk_nvmf_subsystem_start(). Subsystems may be modified at run time, but only`
			`when in the paused or inactive state. A running subsystem may be paused by`
			`calling spdk_nvmf_subsystem_pause() and resumed by calling`
			`spdk_nvmf_subsystem_resume().`

			`Namespaces may be added to the subsystem by calling`
			`spdk_nvmf_subsystem_add_ns() when the subsystem is inactive or paused.`
			`Namespaces are bdevs. See @ref bdev for more information about the SPDK bdev`
			`layer. A bdev may be obtained by calling spdk_bdev_get_by_name().`

			`Once a subsystem exists and the target is listening on an address, new`
nvmf: Eliminate spdk_nvmf_tgt_accept() The poller is now created internally to the library whenever a target is constructed. Applications are not expected to poll for connections any longer. Change-Id: I523eb6adcc042c1ba2ed41b1cb41256b8bf63772 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3583 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Aleksey Marchuk <alexeymar@mellanox.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> 2020-06-29 17:56:47 +00:00			`connections will be automatically assigned to poll groups as they are`
			`detected.`
doc: Add the SPDK NVMe-oF Target Programming Guide Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> 2018-01-04 07:32:04 +00:00
			`All I/O to a subsystem is driven by a poll group, which polls for incoming`
			`network I/O. Poll groups may be created by calling`
			`spdk_nvmf_poll_group_create(). They automatically request to begin polling`
			`upon creation on the thread from which they were created. Most importantly, *a`
doc: Really minor updates to NVMe-oF programmers guide You can take it or leave it. I read through looking for substantive updates I could make and didn't find much. This guide appears to have stayed pretty true to form. Change-Id: I6f0e53d9bef4c7e6cea40ff6d6605127d0640a63 Signed-off-by: Seth Howell <seth.howell@intel.com> Reviewed-on: https://review.gerrithub.io/426404 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Paul Luse <paul.e.luse@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-09-21 17:04:40 +00:00			`poll group may only be accessed from the thread on which it was created.*`
doc: Add the SPDK NVMe-oF Target Programming Guide Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> 2018-01-04 07:32:04 +00:00
			`## Access Control`

			`Access control is performed at the subsystem level by adding allowed listen`
			`addresses and hosts to a subsystem (see spdk_nvmf_subsystem_add_listener() and`
			`spdk_nvmf_subsystem_add_host()). By default, a subsystem will not accept`
			`connections from any host or over any established listen address. Listeners`
			`and hosts may only be added to inactive or paused subsystems.`

			`## Discovery Subsystems`

			`A discovery subsystem, as defined by the NVMe-oF specification, is`
			`automatically created for each NVMe-oF target constructed. Connections to the`
nvmf: Eliminate spdk_nvmf_tgt_accept() The poller is now created internally to the library whenever a target is constructed. Applications are not expected to poll for connections any longer. Change-Id: I523eb6adcc042c1ba2ed41b1cb41256b8bf63772 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3583 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Aleksey Marchuk <alexeymar@mellanox.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> 2020-06-29 17:56:47 +00:00			`discovery subsystem are handled in the same way as any other subsystem.`
doc: Add the SPDK NVMe-oF Target Programming Guide Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> 2018-01-04 07:32:04 +00:00
			`## Transports`

			`The NVMe-oF specification defines multiple network transports (the "Fabrics"`
			`in NVMe over Fabrics) and has an extensible system for adding new fabrics`
			`in the future. The SPDK NVMe-oF target library implements a plugin system for`
			`network transports to mirror the specification. The API a new transport must`
			`implement is located in lib/nvmf/transport.h. As of this writing, only an RDMA`
			`transport has been implemented.`

			`The SPDK NVMe-oF target is designed to be able to process I/O from multiple`
			`fabrics simultaneously.`

			`## Choosing a Threading Model`

			`The SPDK NVMe-oF target library does not strictly dictate threading model, but`
			`poll groups do all of their polling and I/O processing on the thread they are`
			`created on. Given that, it almost always makes sense to create one poll group`
nvmf: Eliminate spdk_nvmf_tgt_accept() The poller is now created internally to the library whenever a target is constructed. Applications are not expected to poll for connections any longer. Change-Id: I523eb6adcc042c1ba2ed41b1cb41256b8bf63772 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.spdk.io/gerrit/c/spdk/spdk/+/3583 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Aleksey Marchuk <alexeymar@mellanox.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> 2020-06-29 17:56:47 +00:00			`per thread used in the application.`
doc: Add the SPDK NVMe-oF Target Programming Guide Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> 2018-01-04 07:32:04 +00:00
			`## Scaling Across CPU Cores`

			`Incoming I/O requests are picked up by the poll group polling their assigned`
			`qpair. For regular NVMe commands such as READ and WRITE, the I/O request is`
			`processed on the initial thread from start to the point where it is submitted`
			`to the backing storage device, without interruption. Completions are`
			`discovered by polling the backing storage device and also processed to`
			`completion on the polling thread. **Regular NVMe commands (READ, WRITE, etc.)`
			`do not require any cross-thread coordination, and therefore take no locks.**`

			`NVMe ADMIN commands, which are used for managing the NVMe device itself, may`
			`modify global state in the subsystem. For instance, an NVMe ADMIN command may`
			`perform namespace management, such as shrinking a namespace. For these`
			`commands, the subsystem will temporarily enter a paused state by sending a`
			`message to each thread in the system. All new incoming I/O on any thread`
			`targeting the subsystem will be queued during this time. Once the subsystem is`
			`fully paused, the state change will occur, and messages will be sent to each`
			`thread to release queued I/O and resume. Management commands are rare, so this`
			`style of coordination is preferable to forcing all commands to take locks in`
			`the I/O path.`

			`## Zero Copy Support`

			`For the RDMA transport, data is transferred from the RDMA NIC to host memory`
doc: Really minor updates to NVMe-oF programmers guide You can take it or leave it. I read through looking for substantive updates I could make and didn't find much. This guide appears to have stayed pretty true to form. Change-Id: I6f0e53d9bef4c7e6cea40ff6d6605127d0640a63 Signed-off-by: Seth Howell <seth.howell@intel.com> Reviewed-on: https://review.gerrithub.io/426404 Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Paul Luse <paul.e.luse@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-09-21 17:04:40 +00:00			`and then host memory to the SSD (or vice versa), without any intermediate`
doc: Add the SPDK NVMe-oF Target Programming Guide Add the SPDK NVMe-oF target programming guide based on the most recently implementations of SPDK NVMe-oF library. Change-Id: Idcd5a5ba9a2e4e04392adeb6230d4b18e98dd8e5 Signed-off-by: GangCao <gang.cao@intel.com> Reviewed-on: https://review.gerrithub.io/393631 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: John Kariuki <John.K.Kariuki@intel.com> 2018-01-04 07:32:04 +00:00			`copies. Data is never moved from one location in host memory to another. Other`
			`transports in the future may require data copies.`

			`## RDMA`

			`The SPDK NVMe-oF RDMA transport is implemented on top of the libibverbs and`
			`rdmacm libraries, which are packaged and available on most Linux`
			`distributions. It does not use a user-space RDMA driver stack through DPDK.`

			`In order to scale to large numbers of connections, the SPDK NVMe-oF RDMA`
			`transport allocates a single RDMA completion queue per poll group. All new`
			`qpairs assigned to the poll group are given their own RDMA send and receive`
			`queues, but share this common completion queue. This allows the poll group to`
			`poll a single queue for incoming messages instead of iterating through each`
			`one.`

			`Each RDMA request is handled by a state machine that walks the request through`
			`a number of states. This keeps the code organized and makes all of the corner`
			`cases much more obvious.`

			`RDMA SEND, READ, and WRITE operations are ordered with respect to one another,`
			`but RDMA RECVs are not necessarily ordered with SEND acknowledgements. For`
			`instance, it is possible to detect an incoming RDMA RECV message containing a`
			`new NVMe-oF capsule prior to detecting the acknowledgement of a previous SEND`
			`containing an NVMe completion. This is problematic at full queue depth because`
			`there may not yet be a free request structure. To handle this, the RDMA`
			`request structure is broken into two parts - an rdma_recv and an rdma_request.`
			`New RDMA RECVs will always grab a free rdma_recv, but may need to wait in a`
			`queue for a SEND acknowledgement before they can acquire a full rdma_request`
			`object.`

			`Further, RDMA NICs expose different queue depths for READ/WRITE operations`
			`than they do for SEND/RECV operations. The RDMA transport reports available`
			`queue depth based on SEND/RECV operation limits and will queue in software as`
doc: fix typos in the doc directory Change-Id: Ifff553ed70ce5aa8e7bdf6d8a8e9e9afb73e8a64 Signed-off-by: Chen Wang <chenx.wang@intel.com> Reviewed-on: https://review.gerrithub.io/423497 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Chandler-Test-Pool: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-08-27 08:42:35 +00:00			`necessary to accommodate (usually lower) limits on READ/WRITE operations.`