numam-spdk/doc/bdev_module.md

# Writing a Custom Block Device Module {#bdev_module}

## Target Audience

This programming guide is intended for developers authoring their own block
device modules to integrate with SPDK's bdev layer. For a guide on how to use
the bdev layer, see @ref bdev_pg.

## Introduction

A block device module is SPDK's equivalent of a device driver in a traditional
operating system. The module provides a set of function pointers that are
called to service block device I/O requests. SPDK provides a number of block
device modules including NVMe, RAM-disk, and Ceph RBD. However, some users
will want to write their own to interact with either custom hardware or to an
existing storage software stack. This guide is intended to demonstrate exactly
how to write a module.

## Creating A New Module

Block device modules are located in subdirectories under lib/bdev today. It is not
currently possible to place the code for a bdev module elsewhere, but updates
to the build system could be made to enable this in the future. To create a
module, add a new directory with a single C file and a Makefile. A great
starting point is to copy the existing 'null' bdev module.

The primary interface that bdev modules will interact with is in
include/spdk/bdev_module.h. In that header a macro is defined that registers
a new bdev module - SPDK_BDEV_MODULE_REGISTER. This macro take as argument a
pointer spdk_bdev_module structure that is used to register new bdev module.

The spdk_bdev_module structure describes the module properties like
initialization (`module_init`) and teardown (`module_fini`) functions,
the function that returns context size (`get_ctx_size`) - scratch space that
will be allocated in each I/O request for use by this module, and a callback
that will be called each time a new bdev is registered by another module
(`examine_config` and `examine_disk`). Please check the documentation of
struct spdk_bdev_module for more details.

## Creating Bdevs

New bdevs are created within the module by calling spdk_bdev_register(). The
module must allocate a struct spdk_bdev, fill it out appropriately, and pass
it to the register call. The most important field to fill out is `fn_table`,
which points at this data structure:

~~~{.c}
/*
 * Function table for a block device backend.
 *
 * The backend block device function table provides a set of APIs to allow
 * communication with a backend. The main commands are read/write API
 * calls for I/O via submit_request.
 */
struct spdk_bdev_fn_table {
	/* Destroy the backend block device object */
	int (*destruct)(void *ctx);

	/* Process the IO. */
	void (*submit_request)(struct spdk_io_channel *ch, struct spdk_bdev_io *);

	/* Check if the block device supports a specific I/O type. */
	bool (*io_type_supported)(void *ctx, enum spdk_bdev_io_type);

	/* Get an I/O channel for the specific bdev for the calling thread. */
	struct spdk_io_channel *(*get_io_channel)(void *ctx);

	/*
	 * Output driver-specific configuration to a JSON stream. Optional - may be NULL.
	 *
	 * The JSON write context will be initialized with an open object, so the bdev
	 * driver should write a name (based on the driver name) followed by a JSON value
	 * (most likely another nested object).
	 */
	int (*dump_config_json)(void *ctx, struct spdk_json_write_ctx *w);

	/* Get spin-time per I/O channel in microseconds.
	 *  Optional - may be NULL.
	 */
	uint64_t (*get_spin_time)(struct spdk_io_channel *ch);
};
~~~

The bdev module must implement these function callbacks.

The `destruct` function is called to tear down the device when the system no
longer needs it. What `destruct` does is up to the module - it may just be
freeing memory or it may be shutting down a piece of hardware.

The `io_type_supported` function returns whether a particular I/O type is
supported. The available I/O types are:

~~~{.c}
/** bdev I/O type */
enum spdk_bdev_io_type {
	SPDK_BDEV_IO_TYPE_INVALID = 0,
	SPDK_BDEV_IO_TYPE_READ,
	SPDK_BDEV_IO_TYPE_WRITE,
	SPDK_BDEV_IO_TYPE_UNMAP,
	SPDK_BDEV_IO_TYPE_FLUSH,
	SPDK_BDEV_IO_TYPE_RESET,
	SPDK_BDEV_IO_TYPE_NVME_ADMIN,
	SPDK_BDEV_IO_TYPE_NVME_IO,
	SPDK_BDEV_IO_TYPE_NVME_IO_MD,
	SPDK_BDEV_IO_TYPE_WRITE_ZEROES,
};
~~~

For the simplest bdev modules, only `SPDK_BDEV_IO_TYPE_READ` and
`SPDK_BDEV_IO_TYPE_WRITE` are necessary. `SPDK_BDEV_IO_TYPE_UNMAP` is often
referred to as "trim" or "deallocate", and is a request to mark a set of
blocks as no longer containing valid data. `SPDK_BDEV_IO_TYPE_FLUSH` is a
request to make all previously completed writes durable. Many devices do not
require flushes. `SPDK_BDEV_IO_TYPE_WRITE_ZEROES` is just like a regular
write, but does not provide a data buffer (it would have just contained all
0's). If it isn't supported, the generic bdev code is capable of emulating it
by sending regular write requests.

`SPDK_BDEV_IO_TYPE_RESET` is a request to abort all I/O and return the
underlying device to its initial state. Do not complete the reset request
until all I/O has been completed in some way.

`SPDK_BDEV_IO_TYPE_NVME_ADMIN`, `SPDK_BDEV_IO_TYPE_NVME_IO`, and
`SPDK_BDEV_IO_TYPE_NVME_IO_MD` are all mechanisms for passing raw NVMe
commands through the SPDK bdev layer. They're strictly optional, and it
probably only makes sense to implement those if the backing storage device is
capable of handling NVMe commands.

The `get_io_channel` function should return an I/O channel. For a detailed
explanation of I/O channels, see @ref concurrency. The generic bdev layer will
call `get_io_channel` one time per thread, cache the result, and pass that
result to `submit_request`. It will use the corresponding channel for the
thread it calls `submit_request` on.

The `submit_request` function is called to actually submit I/O requests to the
block device. Once the I/O request is completed, the module must call
spdk_bdev_io_complete(). The I/O does not have to finish within the calling
context of `submit_request`.

## Creating Virtual Bdevs

Block devices are considered virtual if they handle I/O requests by routing
the I/O to other block devices. The canonical example would be a bdev module
that implements RAID. Virtual bdevs are created in the same way as regular
bdevs, but take one additional step. The module can look up the underlying
bdevs it wishes to route I/O to using spdk_bdev_get_by_name(), where the string
name is provided by the user in a configuration file or via an RPC. The module
then may proceed is normal by opening the bdev to obtain a descriptor, and
creating I/O channels for the bdev (probably in response to the
`get_io_channel` callback). The final step is to have the module use its open
descriptor to call spdk_bdev_module_claim_bdev(), indicating that it is
consuming the underlying bdev. This prevents other users from opening
descriptors with write permissions. This effectively 'promotes' the descriptor
to write-exclusive and is an operation only available to bdev modules.
doc: Programming guide for block device abstraction layer Change-Id: Ib27462769e146a2b4b69302eac386255262081f6 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/397286 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-01-03 12:33:45 +00:00			`# Writing a Custom Block Device Module {#bdev_module}`

			`## Target Audience`

			`This programming guide is intended for developers authoring their own block`
			`device modules to integrate with SPDK's bdev layer. For a guide on how to use`
			`the bdev layer, see @ref bdev_pg.`

			`## Introduction`

			`A block device module is SPDK's equivalent of a device driver in a traditional`
			`operating system. The module provides a set of function pointers that are`
			`called to service block device I/O requests. SPDK provides a number of block`
			`device modules including NVMe, RAM-disk, and Ceph RBD. However, some users`
			`will want to write their own to interact with either custom hardware or to an`
			`existing storage software stack. This guide is intended to demonstrate exactly`
			`how to write a module.`

			`## Creating A New Module`

doc: fix up some bdev-related reference errors This includes removing use of lib/bdev/<module_name> which confused Doxygen into thinking it was an xml/html tag. Signed-off-by: Jim Harris <james.r.harris@intel.com> Change-Id: Iebfe84c1cbc59cd7a62e88dafbe1725d7c2a49da Reviewed-on: https://review.gerrithub.io/415851 Reviewed-by: Ben Walker <benjamin.walker@intel.com> Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> 2018-06-18 14:14:31 +00:00			`Block device modules are located in subdirectories under lib/bdev today. It is not`
doc: Programming guide for block device abstraction layer Change-Id: Ib27462769e146a2b4b69302eac386255262081f6 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/397286 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-01-03 12:33:45 +00:00			`currently possible to place the code for a bdev module elsewhere, but updates`
			`to the build system could be made to enable this in the future. To create a`
			`module, add a new directory with a single C file and a Makefile. A great`
			`starting point is to copy the existing 'null' bdev module.`

			`The primary interface that bdev modules will interact with is in`
doc: fix location of SPDK_BDEV_MODULE_REGISTER macro Change-Id: Iabea90884f105af495ff0033da6973737babb350 Signed-off-by: John Kariuki <John.K.Kariuki@intel.com> Reviewed-on: https://review.gerrithub.io/c/spdk/spdk/+/454365 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Ben Walker <benjamin.walker@intel.com> 2019-05-13 16:50:43 +00:00			`include/spdk/bdev_module.h. In that header a macro is defined that registers`
bdev: rework bdev module registration Currently SPDK_BDEV_MODULE_REGISTER() take many parameters. Extending it (eg for incoming JSON configuration dump/load) is quite challenging and error prone. As we are already here in next patches, rework this macro to take one parameter - the pointer to struct spdk_bdev_module_if. This patch also remove following macros: SPDK_GET_BDEV_MODULE - this is not really needed, to find module outside module translation unit use spdk_bdev_module_list_find() SPDK_BDEV_MODULE_ASYNC_INIT and SPDK_BDEV_MODULE_ASYNC_FINI - replaced by bool fields in spdk_bdev_module_if struct. Change-Id: Ief88e023fbbaee7d5402c838dbecbdffd4dfb259 Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-on: https://review.gerrithub.io/402883 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> 2018-03-06 18:52:46 +00:00			`a new bdev module - SPDK_BDEV_MODULE_REGISTER. This macro take as argument a`
bdev: rename spdk_bdev_module_if -> spdk_bdev_module This better matches the style in the rest of SPDK. No functional change - this is a pure find/replace of spdk_bdev_module_if to spdk_bdev_module. Instances of this struct will be renamed in another patch. Change-Id: I3f6933c8a366e625fc3a1b6401aee26ee03ba69c Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-on: https://review.gerrithub.io/403368 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-03-09 22:20:21 +00:00			`pointer spdk_bdev_module structure that is used to register new bdev module.`
bdev: rework bdev module registration Currently SPDK_BDEV_MODULE_REGISTER() take many parameters. Extending it (eg for incoming JSON configuration dump/load) is quite challenging and error prone. As we are already here in next patches, rework this macro to take one parameter - the pointer to struct spdk_bdev_module_if. This patch also remove following macros: SPDK_GET_BDEV_MODULE - this is not really needed, to find module outside module translation unit use spdk_bdev_module_list_find() SPDK_BDEV_MODULE_ASYNC_INIT and SPDK_BDEV_MODULE_ASYNC_FINI - replaced by bool fields in spdk_bdev_module_if struct. Change-Id: Ief88e023fbbaee7d5402c838dbecbdffd4dfb259 Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-on: https://review.gerrithub.io/402883 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> 2018-03-06 18:52:46 +00:00
bdev: rename spdk_bdev_module_if -> spdk_bdev_module This better matches the style in the rest of SPDK. No functional change - this is a pure find/replace of spdk_bdev_module_if to spdk_bdev_module. Instances of this struct will be renamed in another patch. Change-Id: I3f6933c8a366e625fc3a1b6401aee26ee03ba69c Signed-off-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-on: https://review.gerrithub.io/403368 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> Reviewed-by: Changpeng Liu <changpeng.liu@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-03-09 22:20:21 +00:00			`The spdk_bdev_module structure describes the module properties like`
bdev: rework bdev module registration Currently SPDK_BDEV_MODULE_REGISTER() take many parameters. Extending it (eg for incoming JSON configuration dump/load) is quite challenging and error prone. As we are already here in next patches, rework this macro to take one parameter - the pointer to struct spdk_bdev_module_if. This patch also remove following macros: SPDK_GET_BDEV_MODULE - this is not really needed, to find module outside module translation unit use spdk_bdev_module_list_find() SPDK_BDEV_MODULE_ASYNC_INIT and SPDK_BDEV_MODULE_ASYNC_FINI - replaced by bool fields in spdk_bdev_module_if struct. Change-Id: Ief88e023fbbaee7d5402c838dbecbdffd4dfb259 Signed-off-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-on: https://review.gerrithub.io/402883 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> 2018-03-06 18:52:46 +00:00			initialization (`module_init`) and teardown (`module_fini`) functions,
			the function that returns context size (`get_ctx_size`) - scratch space that
			`will be allocated in each I/O request for use by this module, and a callback`
			`that will be called each time a new bdev is registered by another module`
bdev: Allow bdev module to finish start asynchronously Currently if module claims a bdev in examine_config, it cannot start asynchronously. This patch changes this behaviour by calling examine_disk on modules which previously claimed bdev trough examine_config. Signed-off-by: Piotr Pelplinski <piotr.pelplinski@intel.com> Change-Id: I85b603590c6dab50e59ef7e68f292cb9abc47d98 Reviewed-on: https://review.gerrithub.io/c/spdk/spdk/+/448132 Reviewed-by: Darek Stojaczyk <dariusz.stojaczyk@intel.com> Reviewed-by: Vitaliy Mysak <vitaliy.mysak@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> 2019-03-15 09:32:24 +00:00			(`examine_config` and `examine_disk`). Please check the documentation of
			`struct spdk_bdev_module for more details.`
doc: Programming guide for block device abstraction layer Change-Id: Ib27462769e146a2b4b69302eac386255262081f6 Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/397286 Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> Reviewed-by: Dariusz Stojaczyk <dariuszx.stojaczyk@intel.com> Reviewed-by: Pawel Wodkowski <pawelx.wodkowski@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> 2018-01-03 12:33:45 +00:00
			`## Creating Bdevs`

			`New bdevs are created within the module by calling spdk_bdev_register(). The`
			`module must allocate a struct spdk_bdev, fill it out appropriately, and pass`
			it to the register call. The most important field to fill out is `fn_table`,
			`which points at this data structure:`

			`~~~{.c}`
			`/*`
			`* Function table for a block device backend.`
			`*`
			`* The backend block device function table provides a set of APIs to allow`
			`* communication with a backend. The main commands are read/write API`
			`* calls for I/O via submit_request.`
			`*/`
			`struct spdk_bdev_fn_table {`
			`/* Destroy the backend block device object */`
			`int (destruct)(void ctx);`

			`/* Process the IO. */`
			`void (submit_request)(struct spdk_io_channel ch, struct spdk_bdev_io *);`

			`/* Check if the block device supports a specific I/O type. */`
			`bool (io_type_supported)(void ctx, enum spdk_bdev_io_type);`

			`/* Get an I/O channel for the specific bdev for the calling thread. */`
			`struct spdk_io_channel (get_io_channel)(void *ctx);`

			`/*`
			`* Output driver-specific configuration to a JSON stream. Optional - may be NULL.`
			`*`
			`* The JSON write context will be initialized with an open object, so the bdev`
			`* driver should write a name (based on the driver name) followed by a JSON value`
			`* (most likely another nested object).`
			`*/`
			`int (dump_config_json)(void ctx, struct spdk_json_write_ctx *w);`

			`/* Get spin-time per I/O channel in microseconds.`
			`* Optional - may be NULL.`
			`*/`
			`uint64_t (get_spin_time)(struct spdk_io_channel ch);`
			`};`
			`~~~`

			`The bdev module must implement these function callbacks.`

			The `destruct` function is called to tear down the device when the system no
			longer needs it. What `destruct` does is up to the module - it may just be
			`freeing memory or it may be shutting down a piece of hardware.`

			The `io_type_supported` function returns whether a particular I/O type is
			`supported. The available I/O types are:`

			`~~~{.c}`
			`/** bdev I/O type */`
			`enum spdk_bdev_io_type {`
			`SPDK_BDEV_IO_TYPE_INVALID = 0,`
			`SPDK_BDEV_IO_TYPE_READ,`
			`SPDK_BDEV_IO_TYPE_WRITE,`
			`SPDK_BDEV_IO_TYPE_UNMAP,`
			`SPDK_BDEV_IO_TYPE_FLUSH,`
			`SPDK_BDEV_IO_TYPE_RESET,`
			`SPDK_BDEV_IO_TYPE_NVME_ADMIN,`
			`SPDK_BDEV_IO_TYPE_NVME_IO,`
			`SPDK_BDEV_IO_TYPE_NVME_IO_MD,`
			`SPDK_BDEV_IO_TYPE_WRITE_ZEROES,`
			`};`
			`~~~`

			For the simplest bdev modules, only `SPDK_BDEV_IO_TYPE_READ` and
			`SPDK_BDEV_IO_TYPE_WRITE` are necessary. `SPDK_BDEV_IO_TYPE_UNMAP` is often
			`referred to as "trim" or "deallocate", and is a request to mark a set of`
			blocks as no longer containing valid data. `SPDK_BDEV_IO_TYPE_FLUSH` is a
			`request to make all previously completed writes durable. Many devices do not`
			require flushes. `SPDK_BDEV_IO_TYPE_WRITE_ZEROES` is just like a regular
			`write, but does not provide a data buffer (it would have just contained all`
			`0's). If it isn't supported, the generic bdev code is capable of emulating it`
			`by sending regular write requests.`

			`SPDK_BDEV_IO_TYPE_RESET` is a request to abort all I/O and return the
			`underlying device to its initial state. Do not complete the reset request`
			`until all I/O has been completed in some way.`

			`SPDK_BDEV_IO_TYPE_NVME_ADMIN`, `SPDK_BDEV_IO_TYPE_NVME_IO`, and
			`SPDK_BDEV_IO_TYPE_NVME_IO_MD` are all mechanisms for passing raw NVMe
			`commands through the SPDK bdev layer. They're strictly optional, and it`
			`probably only makes sense to implement those if the backing storage device is`
			`capable of handling NVMe commands.`

			The `get_io_channel` function should return an I/O channel. For a detailed
			`explanation of I/O channels, see @ref concurrency. The generic bdev layer will`
			call `get_io_channel` one time per thread, cache the result, and pass that
			result to `submit_request`. It will use the corresponding channel for the
			thread it calls `submit_request` on.

			The `submit_request` function is called to actually submit I/O requests to the
			`block device. Once the I/O request is completed, the module must call`
			`spdk_bdev_io_complete(). The I/O does not have to finish within the calling`
			context of `submit_request`.

			`## Creating Virtual Bdevs`

			`Block devices are considered virtual if they handle I/O requests by routing`
			`the I/O to other block devices. The canonical example would be a bdev module`
			`that implements RAID. Virtual bdevs are created in the same way as regular`
			`bdevs, but take one additional step. The module can look up the underlying`
			`bdevs it wishes to route I/O to using spdk_bdev_get_by_name(), where the string`
			`name is provided by the user in a configuration file or via an RPC. The module`
			`then may proceed is normal by opening the bdev to obtain a descriptor, and`
			`creating I/O channels for the bdev (probably in response to the`
			`get_io_channel` callback). The final step is to have the module use its open
			`descriptor to call spdk_bdev_module_claim_bdev(), indicating that it is`
			`consuming the underlying bdev. This prevents other users from opening`
			`descriptors with write permissions. This effectively 'promotes' the descriptor`
			`to write-exclusive and is an operation only available to bdev modules.`