numam-spdk/doc/ssd_internals.md

# NAND Flash SSD Internals {#ssd_internals}

Solid State Devices (SSD) are complex devices and their performance depends on
how they're used. The following description is intended to help software
developers understand what is occurring inside the SSD, so that they can come
up with better software designs. It should not be thought of as a strictly
accurate guide to how SSD hardware really works.

 As of this writing, SSDs are generally implemented on top of
 [NAND Flash](https://en.wikipedia.org/wiki/Flash_memory) memory. At a
 very high level, this media has a few important properties:

* The media is grouped onto chips called NAND dies and each die can
  operate in parallel.
* Flipping a bit is a highly asymmetric process. Flipping it one way is
  easy, but flipping it back is quite hard.

NAND Flash media is grouped into large units often referred to as **erase
blocks**. The size of an erase block is highly implementation specific, but
can be thought of as somewhere between 1MiB and 8MiB. For each erase block,
each bit may be written to (i.e. have its bit flipped from 0 to 1) with
bit-granularity once. In order to write to the erase block a second time, the
entire block must be erased (i.e. all bits in the block are flipped back to
0). This is the asymmetry part from above. Erasing a block causes a measurable
amount of wear and each block may only be erased a limited number of times.

SSDs expose an interface to the host system that makes it appear as if the
drive is composed of a set of fixed size **logical blocks** which are usually
512B or 4KiB in size. These blocks are entirely logical constructs of the
device firmware and they do not statically map to a location on the backing
media. Instead, upon each write to a logical block, a new location on the NAND
Flash is selected and written and the mapping of the logical block to its
physical location is updated. The algorithm for choosing this location is a
key part of overall SSD performance and is often called the **flash
translation layer** or FTL. This algorithm must correctly distribute the
blocks to account for wear (called **wear-leveling**) and spread them across
NAND dies to improve total available performance. The simplest model is to
group all of the physical media on each die together using an algorithm
similar to RAID and then write to that set sequentially. Real SSDs are far
more complicated, but this is an excellent simple model for software
developers - imagine they are simply logging to a RAID volume and updating an
in-memory hash-table.

One consequence of the flash translation layer is that logical blocks do not
necessarily correspond to physical locations on the NAND at all times. In
fact, there is a command that clears the translation for a block. In NVMe,
this command is called deallocate, in SCSI it is called unmap, and in SATA it
is called trim. When a user attempts to read a block that doesn't have a
mapping to a physical location, drives will do one of two things:

1. Immediately complete the read request successfully, without performing any
   data transfer. This is acceptable because the data the drive would return
   is no more valid than the data already in the user's data buffer.
2. Return all 0's as the data.

Choice #1 is much more common and performing reads to a fully deallocated
device will often show performance far beyond what the drive claims to be
capable of precisely because it is not actually transferring any data. Write
to all blocks prior to reading them when benchmarking!

As SSDs are written to, the internal log will eventually consume all of the
available erase blocks. In order to continue writing, the SSD must free some
of them. This process is often called **garbage collection**. All SSDs reserve
some number of erase blocks so that they can guarantee there are free erase
blocks available for garbage collection. Garbage collection generally proceeds
by:

1. Selecting a target erase block (a good mental model is that it picks the least recently used erase block)
2. Walking through each entry in the erase block and determining if it is still a valid logical block.
3. Moving valid logical blocks by reading them and writing them to a different erase block (i.e. the current head of the log)
4. Erasing the entire erase block and marking it available for use.

Garbage collection is clearly far more efficient when step #3 can be skipped
because the erase block is already empty. There are two ways to make it much
more likely that step #3 can be skipped. The first is that SSDs reserve
additional erase blocks beyond their reported capacity (called
**over-provisioning**), so that statistically its much more likely that an
erase block will not contain valid data. The second is software can write to
the blocks on the device in sequential order in a circular pattern, throwing
away old data when it is no longer needed. In this case, the software
guarantees that the least recently used erase blocks will not contain any
valid data that must be moved.

The amount of over-provisioning a device has can dramatically impact the
performance on random read and write workloads, if the workload is filling up
the entire device. However, the same effect can typically be obtained by
simply reserving a given amount of space on the device in software. This
understanding is critical to producing consistent benchmarks. In particular,
if background garbage collection cannot keep up and the drive must switch to
on-demand garbage collection, the latency of writes will increase
dramatically. Therefore the internal state of the device must be forced into
some known state prior to running benchmarks for consistency. This is usually
accomplished by writing to the device sequentially two times, from start to
finish. For a highly detailed description of exactly how to force an SSD into
a known state for benchmarking see this
[SNIA Article](http://www.snia.org/sites/default/files/SSS_PTS_Enterprise_v1.1.pdf).
doc: Rename directory_structure.md to overview.md Change title, move to Concepts section. Otherwise the document remains the same. Change-Id: I15dfe541447a620e270d94df32c76b36082ef4be Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/c/spdk/spdk/+/453451 Tested-by: SPDK CI Jenkins <sys_sgci@intel.com> Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Shuhei Matsumoto <shuhei.matsumoto.xt@hitachi.com> 2019-05-06 17:29:35 +00:00			`# NAND Flash SSD Internals {#ssd_internals}`
doc: Add some documentation about SSD internals Change-Id: I0b215aea443782bdf8ee6772dd2db562597f0d3c Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/394116 Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Tested-by: SPDK Automated Test System <sys_sgsw@intel.com> 2017-04-03 20:01:43 +00:00
			`Solid State Devices (SSD) are complex devices and their performance depends on`
			`how they're used. The following description is intended to help software`
			`developers understand what is occurring inside the SSD, so that they can come`
			`up with better software designs. It should not be thought of as a strictly`
			`accurate guide to how SSD hardware really works.`

			`As of this writing, SSDs are generally implemented on top of`
			`[NAND Flash](https://en.wikipedia.org/wiki/Flash_memory) memory. At a`
			`very high level, this media has a few important properties:`

			`* The media is grouped onto chips called NAND dies and each die can`
			`operate in parallel.`
			`* Flipping a bit is a highly asymmetric process. Flipping it one way is`
			`easy, but flipping it back is quite hard.`

			`NAND Flash media is grouped into large units often referred to as **erase`
			`blocks**. The size of an erase block is highly implementation specific, but`
			`can be thought of as somewhere between 1MiB and 8MiB. For each erase block,`
			`each bit may be written to (i.e. have its bit flipped from 0 to 1) with`
			`bit-granularity once. In order to write to the erase block a second time, the`
			`entire block must be erased (i.e. all bits in the block are flipped back to`
			`0). This is the asymmetry part from above. Erasing a block causes a measurable`
			`amount of wear and each block may only be erased a limited number of times.`

			`SSDs expose an interface to the host system that makes it appear as if the`
			`drive is composed of a set of fixed size logical blocks which are usually`
			`512B or 4KiB in size. These blocks are entirely logical constructs of the`
			`device firmware and they do not statically map to a location on the backing`
			`media. Instead, upon each write to a logical block, a new location on the NAND`
			`Flash is selected and written and the mapping of the logical block to its`
			`physical location is updated. The algorithm for choosing this location is a`
			`key part of overall SSD performance and is often called the **flash`
			`translation layer** or FTL. This algorithm must correctly distribute the`
			`blocks to account for wear (called wear-leveling) and spread them across`
			`NAND dies to improve total available performance. The simplest model is to`
			`group all of the physical media on each die together using an algorithm`
			`similar to RAID and then write to that set sequentially. Real SSDs are far`
			`more complicated, but this is an excellent simple model for software`
			`developers - imagine they are simply logging to a RAID volume and updating an`
			`in-memory hash-table.`

			`One consequence of the flash translation layer is that logical blocks do not`
			`necessarily correspond to physical locations on the NAND at all times. In`
			`fact, there is a command that clears the translation for a block. In NVMe,`
			`this command is called deallocate, in SCSI it is called unmap, and in SATA it`
			`is called trim. When a user attempts to read a block that doesn't have a`
			`mapping to a physical location, drives will do one of two things:`

			`1. Immediately complete the read request successfully, without performing any`
			`data transfer. This is acceptable because the data the drive would return`
			`is no more valid than the data already in the user's data buffer.`
			`2. Return all 0's as the data.`

			`Choice #1 is much more common and performing reads to a fully deallocated`
			`device will often show performance far beyond what the drive claims to be`
			`capable of precisely because it is not actually transferring any data. Write`
			`to all blocks prior to reading them when benchmarking!`

			`As SSDs are written to, the internal log will eventually consume all of the`
			`available erase blocks. In order to continue writing, the SSD must free some`
			`of them. This process is often called garbage collection. All SSDs reserve`
			`some number of erase blocks so that they can guarantee there are free erase`
			`blocks available for garbage collection. Garbage collection generally proceeds`
			`by:`

			`1. Selecting a target erase block (a good mental model is that it picks the least recently used erase block)`
			`2. Walking through each entry in the erase block and determining if it is still a valid logical block.`
			`3. Moving valid logical blocks by reading them and writing them to a different erase block (i.e. the current head of the log)`
			`4. Erasing the entire erase block and marking it available for use.`

			`Garbage collection is clearly far more efficient when step #3 can be skipped`
			`because the erase block is already empty. There are two ways to make it much`
			`more likely that step #3 can be skipped. The first is that SSDs reserve`
			`additional erase blocks beyond their reported capacity (called`
			`over-provisioning), so that statistically its much more likely that an`
			`erase block will not contain valid data. The second is software can write to`
			`the blocks on the device in sequential order in a circular pattern, throwing`
			`away old data when it is no longer needed. In this case, the software`
			`guarantees that the least recently used erase blocks will not contain any`
			`valid data that must be moved.`

			`The amount of over-provisioning a device has can dramatically impact the`
			`performance on random read and write workloads, if the workload is filling up`
			`the entire device. However, the same effect can typically be obtained by`
			`simply reserving a given amount of space on the device in software. This`
			`understanding is critical to producing consistent benchmarks. In particular,`
			`if background garbage collection cannot keep up and the drive must switch to`
			`on-demand garbage collection, the latency of writes will increase`
			`dramatically. Therefore the internal state of the device must be forced into`
			`some known state prior to running benchmarks for consistency. This is usually`
			`accomplished by writing to the device sequentially two times, from start to`
			`finish. For a highly detailed description of exactly how to force an SSD into`
			`a known state for benchmarking see this`
			`[SNIA Article](http://www.snia.org/sites/default/files/SSS_PTS_Enterprise_v1.1.pdf).`