doc: Add some documentation about SSD internals
Change-Id: I0b215aea443782bdf8ee6772dd2db562597f0d3c Signed-off-by: Ben Walker <benjamin.walker@intel.com> Reviewed-on: https://review.gerrithub.io/394116 Reviewed-by: Jim Harris <james.r.harris@intel.com> Reviewed-by: Daniel Verkamp <daniel.verkamp@intel.com> Tested-by: SPDK Automated Test System <sys_sgsw@intel.com>
This commit is contained in:
parent
d9243af2b1
commit
937f1c2bc3
@ -800,6 +800,7 @@ INPUT = ../include/spdk \
|
||||
nvme.md \
|
||||
nvme-cli.md \
|
||||
nvmf.md \
|
||||
ssd_internals.md \
|
||||
userspace.md \
|
||||
vagrant.md \
|
||||
vhost.md \
|
||||
|
@ -13,6 +13,7 @@
|
||||
- @ref userspace
|
||||
- @ref memory
|
||||
- @ref concurrency
|
||||
- @ref ssd_internals
|
||||
- @ref porting
|
||||
|
||||
# User Guides {#user_guides}
|
||||
|
96
doc/ssd_internals.md
Normal file
96
doc/ssd_internals.md
Normal file
@ -0,0 +1,96 @@
|
||||
# SSD Internals {#ssd_internals}
|
||||
|
||||
Solid State Devices (SSD) are complex devices and their performance depends on
|
||||
how they're used. The following description is intended to help software
|
||||
developers understand what is occurring inside the SSD, so that they can come
|
||||
up with better software designs. It should not be thought of as a strictly
|
||||
accurate guide to how SSD hardware really works.
|
||||
|
||||
As of this writing, SSDs are generally implemented on top of
|
||||
[NAND Flash](https://en.wikipedia.org/wiki/Flash_memory) memory. At a
|
||||
very high level, this media has a few important properties:
|
||||
|
||||
* The media is grouped onto chips called NAND dies and each die can
|
||||
operate in parallel.
|
||||
* Flipping a bit is a highly asymmetric process. Flipping it one way is
|
||||
easy, but flipping it back is quite hard.
|
||||
|
||||
NAND Flash media is grouped into large units often referred to as **erase
|
||||
blocks**. The size of an erase block is highly implementation specific, but
|
||||
can be thought of as somewhere between 1MiB and 8MiB. For each erase block,
|
||||
each bit may be written to (i.e. have its bit flipped from 0 to 1) with
|
||||
bit-granularity once. In order to write to the erase block a second time, the
|
||||
entire block must be erased (i.e. all bits in the block are flipped back to
|
||||
0). This is the asymmetry part from above. Erasing a block causes a measurable
|
||||
amount of wear and each block may only be erased a limited number of times.
|
||||
|
||||
SSDs expose an interface to the host system that makes it appear as if the
|
||||
drive is composed of a set of fixed size **logical blocks** which are usually
|
||||
512B or 4KiB in size. These blocks are entirely logical constructs of the
|
||||
device firmware and they do not statically map to a location on the backing
|
||||
media. Instead, upon each write to a logical block, a new location on the NAND
|
||||
Flash is selected and written and the mapping of the logical block to its
|
||||
physical location is updated. The algorithm for choosing this location is a
|
||||
key part of overall SSD performance and is often called the **flash
|
||||
translation layer** or FTL. This algorithm must correctly distribute the
|
||||
blocks to account for wear (called **wear-leveling**) and spread them across
|
||||
NAND dies to improve total available performance. The simplest model is to
|
||||
group all of the physical media on each die together using an algorithm
|
||||
similar to RAID and then write to that set sequentially. Real SSDs are far
|
||||
more complicated, but this is an excellent simple model for software
|
||||
developers - imagine they are simply logging to a RAID volume and updating an
|
||||
in-memory hash-table.
|
||||
|
||||
One consequence of the flash translation layer is that logical blocks do not
|
||||
necessarily correspond to physical locations on the NAND at all times. In
|
||||
fact, there is a command that clears the translation for a block. In NVMe,
|
||||
this command is called deallocate, in SCSI it is called unmap, and in SATA it
|
||||
is called trim. When a user attempts to read a block that doesn't have a
|
||||
mapping to a physical location, drives will do one of two things:
|
||||
|
||||
1. Immediately complete the read request successfully, without performing any
|
||||
data transfer. This is acceptable because the data the drive would return
|
||||
is no more valid than the data already in the user's data buffer.
|
||||
2. Return all 0's as the data.
|
||||
|
||||
Choice #1 is much more common and performing reads to a fully deallocated
|
||||
device will often show performance far beyond what the drive claims to be
|
||||
capable of precisely because it is not actually transferring any data. Write
|
||||
to all blocks prior to reading them when benchmarking!
|
||||
|
||||
As SSDs are written to, the internal log will eventually consume all of the
|
||||
available erase blocks. In order to continue writing, the SSD must free some
|
||||
of them. This process is often called **garbage collection**. All SSDs reserve
|
||||
some number of erase blocks so that they can guarantee there are free erase
|
||||
blocks available for garbage collection. Garbage collection generally proceeds
|
||||
by:
|
||||
|
||||
1. Selecting a target erase block (a good mental model is that it picks the least recently used erase block)
|
||||
2. Walking through each entry in the erase block and determining if it is still a valid logical block.
|
||||
3. Moving valid logical blocks by reading them and writing them to a different erase block (i.e. the current head of the log)
|
||||
4. Erasing the entire erase block and marking it available for use.
|
||||
|
||||
Garbage collection is clearly far more efficient when step #3 can be skipped
|
||||
because the erase block is already empty. There are two ways to make it much
|
||||
more likely that step #3 can be skipped. The first is that SSDs reserve
|
||||
additional erase blocks beyond their reported capacity (called
|
||||
**over-provisioning**), so that statistically its much more likely that an
|
||||
erase block will not contain valid data. The second is software can write to
|
||||
the blocks on the device in sequential order in a circular pattern, throwing
|
||||
away old data when it is no longer needed. In this case, the software
|
||||
guarantees that the least recently used erase blocks will not contain any
|
||||
valid data that must be moved.
|
||||
|
||||
The amount of over-provisioning a device has can dramatically impact the
|
||||
performance on random read and write workloads, if the workload is filling up
|
||||
the entire device. However, the same effect can typically be obtained by
|
||||
simply reserving a given amount of space on the device in software. This
|
||||
understanding is critical to producing consistent benchmarks. In particular,
|
||||
if background garbage collection cannot keep up and the drive must switch to
|
||||
on-demand garbage collection, the latency of writes will increase
|
||||
dramatically. Therefore the internal state of the device must be forced into
|
||||
some known state prior to running benchmarks for consistency. This is usually
|
||||
accomplished by writing to the device sequentially two times, from start to
|
||||
finish. For a highly detailed description of exactly how to force an SSD into
|
||||
a known state for benchmarking see this
|
||||
[SNIA Article](http://www.snia.org/sites/default/files/SSS_PTS_Enterprise_v1.1.pdf).
|
Loading…
x
Reference in New Issue
Block a user