numam-spdk/doc/nvme.md
Daniel Verkamp 1a787169b2 doc: flatten Markdown docs into chapter-per-file
Doxygen interprets each Markdown input file as a separate section
(chapter).  Concatenate all of the .md files in directories into a
single file per section to get a correctly-nested table of contents.

In particular, this matters for the navigation in the PDF output.

Change-Id: I778849d89da9a308136e43ac6cb630c4c2bbb3a5
Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com>
2017-04-30 07:58:26 -07:00

8.5 KiB

NVMe Driver

Public Interface

  • spdk/nvme.h

Key Functions

Function Description
spdk_nvme_probe() @copybrief spdk_nvme_probe()
spdk_nvme_ns_cmd_read() @copybrief spdk_nvme_ns_cmd_read()
spdk_nvme_ns_cmd_write() @copybrief spdk_nvme_ns_cmd_write()
spdk_nvme_ns_cmd_dataset_management() @copybrief spdk_nvme_ns_cmd_dataset_management()
spdk_nvme_ns_cmd_flush() @copybrief spdk_nvme_ns_cmd_flush()
spdk_nvme_qpair_process_completions() @copybrief spdk_nvme_qpair_process_completions()
spdk_nvme_ctrlr_cmd_admin_raw() @copybrief spdk_nvme_ctrlr_cmd_admin_raw()
spdk_nvme_ctrlr_process_admin_completions() @copybrief spdk_nvme_ctrlr_process_admin_completions()

NVMe Initialization

\msc

app [label="Application"], nvme [label="NVMe Driver"];
app=>nvme [label="nvme_probe()"];
app<<nvme [label="probe_cb(pci_dev)"];
nvme=>nvme [label="nvme_attach(devhandle)"];
nvme=>nvme [label="nvme_ctrlr_start(nvme_controller ptr)"];
nvme=>nvme [label="identify controller"];
nvme=>nvme [label="create queue pairs"];
nvme=>nvme [label="identify namespace(s)"];
app<<nvme [label="attach_cb(pci_dev, nvme_controller)"];
app=>app [label="create block devices based on controller's namespaces"];

\endmsc

NVMe I/O Submission

I/O is submitted to an NVMe namespace using nvme_ns_cmd_xxx functions defined in nvme_ns_cmd.c. The NVMe driver submits the I/O request as an NVMe submission queue entry on the queue pair specified in the command. The application must poll for I/O completion on each queue pair with outstanding I/O to receive completion callbacks.

@sa spdk_nvme_ns_cmd_read, spdk_nvme_ns_cmd_write, spdk_nvme_ns_cmd_dataset_management, spdk_nvme_ns_cmd_flush, spdk_nvme_qpair_process_completions

NVMe Asynchronous Completion

The userspace NVMe driver follows an asynchronous polled model for I/O completion.

I/O commands

The application may submit I/O from one or more threads on one or more queue pairs and must call spdk_nvme_qpair_process_completions() for each queue pair that submitted I/O.

When the application calls spdk_nvme_qpair_process_completions(), if the NVMe driver detects completed I/Os that were submitted on that queue, it will invoke the registered callback function for each I/O within the context of spdk_nvme_qpair_process_completions().

Admin commands

The application may submit admin commands from one or more threads and must call spdk_nvme_ctrlr_process_admin_completions() from at least one thread to receive admin command completions. The thread that processes admin completions need not be the same thread that submitted the admin commands.

When the application calls spdk_nvme_ctrlr_process_admin_completions(), if the NVMe driver detects completed admin commands submitted from any thread, it will invote the registered callback function for each command within the context of spdk_nvme_ctrlr_process_admin_completions().

It is the application's responsibility to manage the order of submitted admin commands. If certain admin commands must be submitted while no other commands are outstanding, it is the application's responsibility to enforce this rule using its own synchronization method.

NVMe over Fabrics Host Support

The NVMe driver supports connecting to remote NVMe-oF targets and interacting with them in the same manner as local NVMe controllers.

Specifying Remote NVMe over Fabrics Targets

The method for connecting to a remote NVMe-oF target is very similar to the normal enumeration process for local PCIe-attached NVMe devices. To connect to a remote NVMe over Fabrics subsystem, the user may call spdk_nvme_probe() with the trid parameter specifying the address of the NVMe-oF target. The caller may fill out the spdk_nvme_transport_id structure manually or use the spdk_nvme_transport_id_parse() function to convert a human-readable string representation into the required structure.

The spdk_nvme_transport_id may contain the address of a discovery service or a single NVM subsystem. If a discovery service address is specified, the NVMe library will call the spdk_nvme_probe() probe_cb for each discovered NVM subsystem, which allows the user to select the desired subsystems to be attached. Alternatively, if the address specifies a single NVM subsystem directly, the NVMe library will call probe_cb for just that subsystem; this allows the user to skip the discovery step and connect directly to a subsystem with a known address.

NVMe Multi Process

This capability enables the SPDK NVMe driver to support multiple processes accessing the same NVMe device. The NVMe driver allocates critical structures from shared memory, so that each process can map that memory and create its own queue pairs or share the admin queue. There is a limited number of I/O queue pairs per NVMe controller.

The primary motivation for this feature is to support management tools that can attach to long running applications, perform some maintenance work or gather information, and then detach.

Configuration

DPDK EAL allows different types of processes to be spawned, each with different permissions on the hugepage memory used by the applications.

There are two types of processes:

  1. a primary process which initializes the shared memory and has full privileges and
  2. a secondary process which can attach to the primary process by mapping its shared memory regions and perform NVMe operations including creating queue pairs.

This feature is enabled by default and is controlled by selecting a value for the shared memory group ID. This ID is a positive integer and two applications with the same shared memory group ID will share memory. The first application with a given shared memory group ID will be considered the primary and all others secondary.

Example: identical shm_id and non-overlapping core masks

./perf options [AIO device(s)]...
	[-c core mask for I/O submission/completion]
	[-i shared memory group ID]

./perf -q 1 -s 4096 -w randread -c 0x1 -t 60 -i 1
./perf -q 8 -s 131072 -w write -c 0x10 -t 60 -i 1

Scalability and Performance

To maximize the I/O bandwidth of an NVMe device, ensure that each application has its own queue pairs.

The optimal threading model for SPDK is one thread per core, regardless of which processes that thread belongs to in the case of multi-process environment. To achieve maximum performance, each thread should also have its own I/O queue pair. Applications that share memory should be given core masks that do not overlap.

However, admin commands may have some performance impact as there is only one admin queue pair per NVMe SSD. The NVMe driver will automatically take a cross-process capable lock to enable the sharing of admin queue pair. Further, when each process polls the admin queue for completions, it will only see completions for commands that it originated.

Limitations

  1. Two processes sharing memory may not share any cores in their core mask.
  2. If a primary process exits while secondary processes are still running, those processes will continue to run. However, a new primary process cannot be created.
  3. Applications are responsible for coordinating access to logical blocks.

@sa spdk_nvme_probe, spdk_nvme_ctrlr_process_admin_completions

NVMe Hotplug

At the NVMe driver level, we provide the following support for Hotplug:

  1. Hotplug events detection: The user of the NVMe library can call spdk_nvme_probe() periodically to detect hotplug events. The probe_cb, followed by the attach_cb, will be called for each new device detected. The user may optionally also provide a remove_cb that will be called if a previously attached NVMe device is no longer present on the system. All subsequent I/O to the removed device will return an error.

  2. Hot remove NVMe with IO loads: When a device is hot removed while I/O is occurring, all access to the PCI BAR will result in a SIGBUS error. The NVMe driver automatically handles this case by installing a SIGBUS handler and remapping the PCI BAR to a new, placeholder memory location. This means I/O in flight during a hot remove will complete with an appropriate error code and will not crash the application.

@sa spdk_nvme_probe