OpenZFS 7090 - zfs should throttle allocations
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.
In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.
The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.
OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7Closes#5258
Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress
- Added module params and documentation
- Updated to use optimize AVL cmp macros
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
References:
https://github.com/freebsd/freebsd@5c7a6f5dhttps://github.com/freebsd/freebsd@31b7f68dhttps://github.com/freebsd/freebsd@e186f564
Performance Testing:
https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141
Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc by my
collegue Andriy Gapon has been included. It applied perfectly, but
added a cstyle regression.
- This replaces 556011dbec entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4334
5244 zio pipeline callers should explicitly invoke next stage
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/5244https://github.com/illumos/illumos-gate/commit/738f37b
Porting Notes:
1. The unported "2932 support crash dumps to raidz, etc. pools"
caused a merge conflict due to a copyright difference in
module/zfs/vdev_raidz.c.
2. The unported "4128 disks in zpools never go away when pulled"
and additional Linux-specific changes caused merge conflicts in
module/zfs/vdev_disk.c.
Ported-by: Richard Yao <richard.yao@clusterhq.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2828
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes. This brings
us back in line with upstream. In some cases this means simply
swapping the flags back. For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.
The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory. This is again the
same as upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When selecting a mirror child it's possible that map allocated by
vdev_mirror_map_allc() contains a NULL for the child vdev. In
this case the child should be skipped and the read issues to
another member of the mirror.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1744
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045illumos/illumos-gate@69962b5647
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1913
During pool import stack overflows may still occur due to the
potentially deep recursion of traverse_visitbp(). This is most
likely to occur when additional layers are added to the block
device stack such as DM multipath. To minimize the stack usage
for this call path the following changes were made:
1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions
to prevent them from being inlined by gcc. This reduced the
stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and
vdev_mirror_io_start() from 144 to 128 bytes.
2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved
from the stack to the heap. This reduced the stack usage of
zfsdev_ioctl() from 368 to 112 bytes.
3) The major saving came from slimming down traverse_visitbp() from
from 224 to 144 bytes. Since this function is called recursively
the 80 bytes saved per invokation adds up. The following changes
were made:
a) The 'hard' local variable was replaced by a TD_HARD() macro.
b) The 'pd' local variable was replaced by 'td->td_pfd' references.
c) The zbookmark_t was moved to the heap. This does cost us an
additional memory allocation per recursion by that cost should
still be minimal. The cost could be further reduced by adding
a dedicated zbookmark_t slab cache.
d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were
restructured to use the minimum amount of stack. This includes
removing the 'cbp' local variable.
Overall for the offending use case roughly 1584 of total stack space
has been saved. This is enough to avoid overflowing the stack on
stock kernels with 8k stacks. See #1778 for additional details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#1778
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/3598illumos/illumos-gate@be6fd75a69
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
Porting notes:
1. include/sys/zfs_context.h has been modified to render some new
macros inert until dtrace is available on Linux.
2. Linux-specific changes have been adapted to use SET_ERROR().
3. I'm NOT happy about this change. It does nothing but ugly
up the code under Linux. Unfortunately we need to take it to
avoid more merge conflicts in the future. -Brian
Locking mutex &vq->vq_lock in vdev_mirror_pending is unneeded:
* no data is modified
* only vq_pending_tree is read
* in case garbage is returned (eg. vq_pending_tree being updated
while the read is made) the worst case would be that a single
read could be queued on a mirror side which more busy than thought
The benefit of this change is streamlining of the code path since
it is taken for *every* mirror member on *every* read.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1739
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.
The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror. It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them. In practice this results in the
leaf vdevs being under utilized.
Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO. This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror. Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.
In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds. Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.
The testing results show a significant performance improvement
for all four workloads tested. The workloads were generated
using the fio benchmark and are as follows.
1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).
| Pristine | With 1461 |
| Sequential Random | Sequential Random |
| 1MB 4KB 1MB 4KB | 1MB 4KB 1MB 4KB |
| MB/s MB/s IO/s IO/s | MB/s MB/s IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped | 226 243 11 304 | 222 255 11 299 |
2 2-Way Mirror | 302 324 16 534 | 433 448 23 571 |
2 3-Way Mirror | 429 458 24 714 | 648 648 41 808 |
2 4-Way Mirror | 562 601 36 849 | 816 828 82 926 |
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1461
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.
* The txg_sync thread
* The zvol write/discard threads
* The zpl_putpage() VFS callback
This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages. If a lock is held over this operation the potential exists to
deadlock the system. To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.
Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread. However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.
This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:
21ade34 Disable direct reclaim for z_wr_* threads
cfc9a5c Fix zpl_writepage() deadlock
eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Fix non-c90 compliant code, for the most part these changes
simply deal with where a particular variable is declared.
Under c90 it must alway be done at the very start of a block.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>