Doxygen interprets each Markdown input file as a separate section (chapter). Concatenate all of the .md files in directories into a single file per section to get a correctly-nested table of contents. In particular, this matters for the navigation in the PDF output. Change-Id: I778849d89da9a308136e43ac6cb630c4c2bbb3a5 Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
8.5 KiB
NVMe Driver
Public Interface
- spdk/nvme.h
Key Functions
Function | Description |
---|---|
spdk_nvme_probe() | @copybrief spdk_nvme_probe() |
spdk_nvme_ns_cmd_read() | @copybrief spdk_nvme_ns_cmd_read() |
spdk_nvme_ns_cmd_write() | @copybrief spdk_nvme_ns_cmd_write() |
spdk_nvme_ns_cmd_dataset_management() | @copybrief spdk_nvme_ns_cmd_dataset_management() |
spdk_nvme_ns_cmd_flush() | @copybrief spdk_nvme_ns_cmd_flush() |
spdk_nvme_qpair_process_completions() | @copybrief spdk_nvme_qpair_process_completions() |
spdk_nvme_ctrlr_cmd_admin_raw() | @copybrief spdk_nvme_ctrlr_cmd_admin_raw() |
spdk_nvme_ctrlr_process_admin_completions() | @copybrief spdk_nvme_ctrlr_process_admin_completions() |
NVMe Initialization
\msc
app [label="Application"], nvme [label="NVMe Driver"];
app=>nvme [label="nvme_probe()"];
app<<nvme [label="probe_cb(pci_dev)"];
nvme=>nvme [label="nvme_attach(devhandle)"];
nvme=>nvme [label="nvme_ctrlr_start(nvme_controller ptr)"];
nvme=>nvme [label="identify controller"];
nvme=>nvme [label="create queue pairs"];
nvme=>nvme [label="identify namespace(s)"];
app<<nvme [label="attach_cb(pci_dev, nvme_controller)"];
app=>app [label="create block devices based on controller's namespaces"];
\endmsc
NVMe I/O Submission
I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions defined in nvme_ns_cmd.c. The NVMe driver submits the I/O request as an NVMe submission queue entry on the queue pair specified in the command. The application must poll for I/O completion on each queue pair with outstanding I/O to receive completion callbacks.
@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management, spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions
NVMe Asynchronous Completion
The userspace NVMe driver follows an asynchronous polled model for I/O completion.
I/O commands
The application may submit I/O from one or more threads on one or more queue pairs and must call spdk_nvme_qpair_process_completions() for each queue pair that submitted I/O.
When the application calls spdk_nvme_qpair_process_completions(), if the NVMe driver detects completed I/Os that were submitted on that queue, it will invoke the registered callback function for each I/O within the context of spdk_nvme_qpair_process_completions().
Admin commands
The application may submit admin commands from one or more threads and must call spdk_nvme_ctrlr_process_admin_completions() from at least one thread to receive admin command completions. The thread that processes admin completions need not be the same thread that submitted the admin commands.
When the application calls spdk_nvme_ctrlr_process_admin_completions(), if the NVMe driver detects completed admin commands submitted from any thread, it will invote the registered callback function for each command within the context of spdk_nvme_ctrlr_process_admin_completions().
It is the application's responsibility to manage the order of submitted admin commands. If certain admin commands must be submitted while no other commands are outstanding, it is the application's responsibility to enforce this rule using its own synchronization method.
NVMe over Fabrics Host Support
The NVMe driver supports connecting to remote NVMe-oF targets and interacting with them in the same manner as local NVMe controllers.
Specifying Remote NVMe over Fabrics Targets
The method for connecting to a remote NVMe-oF target is very similar
to the normal enumeration process for local PCIe-attached NVMe devices.
To connect to a remote NVMe over Fabrics subsystem, the user may call
spdk_nvme_probe() with the trid
parameter specifying the address of
the NVMe-oF target.
The caller may fill out the spdk_nvme_transport_id structure manually
or use the spdk_nvme_transport_id_parse() function to convert a
human-readable string representation into the required structure.
The spdk_nvme_transport_id may contain the address of a discovery service
or a single NVM subsystem. If a discovery service address is specified,
the NVMe library will call the spdk_nvme_probe() probe_cb
for each
discovered NVM subsystem, which allows the user to select the desired
subsystems to be attached. Alternatively, if the address specifies a
single NVM subsystem directly, the NVMe library will call probe_cb
for just that subsystem; this allows the user to skip the discovery step
and connect directly to a subsystem with a known address.
NVMe Multi Process
This capability enables the SPDK NVMe driver to support multiple processes accessing the same NVMe device. The NVMe driver allocates critical structures from shared memory, so that each process can map that memory and create its own queue pairs or share the admin queue. There is a limited number of I/O queue pairs per NVMe controller.
The primary motivation for this feature is to support management tools that can attach to long running applications, perform some maintenance work or gather information, and then detach.
Configuration
DPDK EAL allows different types of processes to be spawned, each with different permissions on the hugepage memory used by the applications.
There are two types of processes:
- a primary process which initializes the shared memory and has full privileges and
- a secondary process which can attach to the primary process by mapping its shared memory regions and perform NVMe operations including creating queue pairs.
This feature is enabled by default and is controlled by selecting a value for the shared memory group ID. This ID is a positive integer and two applications with the same shared memory group ID will share memory. The first application with a given shared memory group ID will be considered the primary and all others secondary.
Example: identical shm_id and non-overlapping core masks
./perf options [AIO device(s)]...
[-c core mask for I/O submission/completion]
[-i shared memory group ID]
./perf -q 1 -s 4096 -w randread -c 0x1 -t 60 -i 1
./perf -q 8 -s 131072 -w write -c 0x10 -t 60 -i 1
Scalability and Performance
To maximize the I/O bandwidth of an NVMe device, ensure that each application has its own queue pairs.
The optimal threading model for SPDK is one thread per core, regardless of which processes that thread belongs to in the case of multi-process environment. To achieve maximum performance, each thread should also have its own I/O queue pair. Applications that share memory should be given core masks that do not overlap.
However, admin commands may have some performance impact as there is only one admin queue pair per NVMe SSD. The NVMe driver will automatically take a cross-process capable lock to enable the sharing of admin queue pair. Further, when each process polls the admin queue for completions, it will only see completions for commands that it originated.
Limitations
- Two processes sharing memory may not share any cores in their core mask.
- If a primary process exits while secondary processes are still running, those processes will continue to run. However, a new primary process cannot be created.
- Applications are responsible for coordinating access to logical blocks.
@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions
NVMe Hotplug
At the NVMe driver level, we provide the following support for Hotplug:
-
Hotplug events detection: The user of the NVMe library can call spdk_nvme_probe() periodically to detect hotplug events. The probe_cb, followed by the attach_cb, will be called for each new device detected. The user may optionally also provide a remove_cb that will be called if a previously attached NVMe device is no longer present on the system. All subsequent I/O to the removed device will return an error.
-
Hot remove NVMe with IO loads: When a device is hot removed while I/O is occurring, all access to the PCI BAR will result in a SIGBUS error. The NVMe driver automatically handles this case by installing a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location. This means I/O in flight during a hot remove will complete with an appropriate error code and will not crash the application.
@sa spdk_nvme_probe