numam-spdk/doc/nvme.md

192 lines
8.5 KiB
Markdown
Raw Normal View History

# NVMe Driver {#nvme}
# Public Interface {#nvme_interface}
- spdk/nvme.h
# Key Functions {#nvme_key_functions}
Function | Description
------------------------------------------- | -----------
spdk_nvme_probe() | @copybrief spdk_nvme_probe()
spdk_nvme_ns_cmd_read() | @copybrief spdk_nvme_ns_cmd_read()
spdk_nvme_ns_cmd_write() | @copybrief spdk_nvme_ns_cmd_write()
spdk_nvme_ns_cmd_dataset_management() | @copybrief spdk_nvme_ns_cmd_dataset_management()
spdk_nvme_ns_cmd_flush() | @copybrief spdk_nvme_ns_cmd_flush()
spdk_nvme_qpair_process_completions() | @copybrief spdk_nvme_qpair_process_completions()
spdk_nvme_ctrlr_cmd_admin_raw() | @copybrief spdk_nvme_ctrlr_cmd_admin_raw()
spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions()
# NVMe Initialization {#nvme_initialization}
\msc
app [label="Application"], nvme [label="NVMe Driver"];
app=>nvme [label="nvme_probe()"];
app<<nvme [label="probe_cb(pci_dev)"];
nvme=>nvme [label="nvme_attach(devhandle)"];
nvme=>nvme [label="nvme_ctrlr_start(nvme_controller ptr)"];
nvme=>nvme [label="identify controller"];
nvme=>nvme [label="create queue pairs"];
nvme=>nvme [label="identify namespace(s)"];
app<<nvme [label="attach_cb(pci_dev, nvme_controller)"];
app=>app [label="create block devices based on controller's namespaces"];
\endmsc
# NVMe I/O Submission {#nvme_io_submission}
I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions
defined in nvme_ns_cmd.c. The NVMe driver submits the I/O request
as an NVMe submission queue entry on the queue pair specified in the command.
The application must poll for I/O completion on each queue pair with outstanding I/O
to receive completion callbacks.
@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management,
spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions
# NVMe Asynchronous Completion {#nvme_async_completion}
The userspace NVMe driver follows an asynchronous polled model for
I/O completion.
## I/O commands {#nvme_async_io}
The application may submit I/O from one or more threads on one or more queue pairs
and must call spdk_nvme_qpair_process_completions()
for each queue pair that submitted I/O.
When the application calls spdk_nvme_qpair_process_completions(),
if the NVMe driver detects completed I/Os that were submitted on that queue,
it will invoke the registered callback function
for each I/O within the context of spdk_nvme_qpair_process_completions().
## Admin commands {#nvme_async_admin}
The application may submit admin commands from one or more threads
and must call spdk_nvme_ctrlr_process_admin_completions()
from at least one thread to receive admin command completions.
The thread that processes admin completions need not be the same thread that submitted the
admin commands.
When the application calls spdk_nvme_ctrlr_process_admin_completions(),
if the NVMe driver detects completed admin commands submitted from any thread,
it will invote the registered callback function
for each command within the context of spdk_nvme_ctrlr_process_admin_completions().
It is the application's responsibility to manage the order of submitted admin commands.
If certain admin commands must be submitted while no other commands are outstanding,
it is the application's responsibility to enforce this rule
using its own synchronization method.
# NVMe over Fabrics Host Support {#nvme_fabrics_host}
The NVMe driver supports connecting to remote NVMe-oF targets and
interacting with them in the same manner as local NVMe controllers.
## Specifying Remote NVMe over Fabrics Targets {#nvme_fabrics_trid}
The method for connecting to a remote NVMe-oF target is very similar
to the normal enumeration process for local PCIe-attached NVMe devices.
To connect to a remote NVMe over Fabrics subsystem, the user may call
spdk_nvme_probe() with the `trid` parameter specifying the address of
the NVMe-oF target.
The caller may fill out the spdk_nvme_transport_id structure manually
or use the spdk_nvme_transport_id_parse() function to convert a
human-readable string representation into the required structure.
The spdk_nvme_transport_id may contain the address of a discovery service
or a single NVM subsystem. If a discovery service address is specified,
the NVMe library will call the spdk_nvme_probe() `probe_cb` for each
discovered NVM subsystem, which allows the user to select the desired
subsystems to be attached. Alternatively, if the address specifies a
single NVM subsystem directly, the NVMe library will call `probe_cb`
for just that subsystem; this allows the user to skip the discovery step
and connect directly to a subsystem with a known address.
# NVMe Multi Process {#nvme_multi_process}
This capability enables the SPDK NVMe driver to support multiple processes accessing the
same NVMe device. The NVMe driver allocates critical structures from shared memory, so
that each process can map that memory and create its own queue pairs or share the admin
queue. There is a limited number of I/O queue pairs per NVMe controller.
The primary motivation for this feature is to support management tools that can attach
to long running applications, perform some maintenance work or gather information, and
then detach.
## Configuration {#nvme_multi_process_configuration}
DPDK EAL allows different types of processes to be spawned, each with different permissions
on the hugepage memory used by the applications.
There are two types of processes:
1. a primary process which initializes the shared memory and has full privileges and
2. a secondary process which can attach to the primary process by mapping its shared memory
regions and perform NVMe operations including creating queue pairs.
This feature is enabled by default and is controlled by selecting a value for the shared
memory group ID. This ID is a positive integer and two applications with the same shared
memory group ID will share memory. The first application with a given shared memory group
ID will be considered the primary and all others secondary.
Example: identical shm_id and non-overlapping core masks
~~~{.sh}
./perf options [AIO device(s)]...
[-c core mask for I/O submission/completion]
[-i shared memory group ID]
./perf -q 1 -s 4096 -w randread -c 0x1 -t 60 -i 1
./perf -q 8 -s 131072 -w write -c 0x10 -t 60 -i 1
~~~
## Scalability and Performance {#nvme_multi_process_scalability_performance}
To maximize the I/O bandwidth of an NVMe device, ensure that each application has its own
queue pairs.
The optimal threading model for SPDK is one thread per core, regardless of which processes
that thread belongs to in the case of multi-process environment. To achieve maximum
performance, each thread should also have its own I/O queue pair. Applications that share
memory should be given core masks that do not overlap.
However, admin commands may have some performance impact as there is only one admin queue
pair per NVMe SSD. The NVMe driver will automatically take a cross-process capable lock
to enable the sharing of admin queue pair. Further, when each process polls the admin
queue for completions, it will only see completions for commands that it originated.
## Limitations {#nvme_multi_process_limitations}
1. Two processes sharing memory may not share any cores in their core mask.
2. If a primary process exits while secondary processes are still running, those processes
will continue to run. However, a new primary process cannot be created.
3. Applications are responsible for coordinating access to logical blocks.
@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions
# NVMe Hotplug {#nvme_hotplug}
At the NVMe driver level, we provide the following support for Hotplug:
1. Hotplug events detection:
The user of the NVMe library can call spdk_nvme_probe() periodically to detect
hotplug events. The probe_cb, followed by the attach_cb, will be called for each
new device detected. The user may optionally also provide a remove_cb that will be
called if a previously attached NVMe device is no longer present on the system.
All subsequent I/O to the removed device will return an error.
2. Hot remove NVMe with IO loads:
When a device is hot removed while I/O is occurring, all access to the PCI BAR will
result in a SIGBUS error. The NVMe driver automatically handles this case by installing
a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location.
This means I/O in flight during a hot remove will complete with an appropriate error
code and will not crash the application.
@sa spdk_nvme_probe