freebsd-dev

History

Matthew Ahrens 8542ef852a OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations Authored by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Don Brady <don.brady@intel.com> Ported-by: Matt Ahrens <mahrens@delphix.com> RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore this problem has been known as "the 1MB 10-wide RAIDZ2 (or 3) problem". The problem also occurs with the following configurations: With recordsize=512KB or 256KB, compression=off, the problem occurs only in rarely-used configurations: * 4-wide RAIDZ1 with recordsize=512KB and ashift=12 (4KB sectors) * 4-wide RAIDZ2 (either recordsize, either ashift) * 5-wide RAIDZ2 with recordsize=512KB (either ashift) * 6-wide RAIDZ2 with recordsize=512KB (either ashift) With recordsize=1MB, compression=off, ashift=9 (512B sectors) * RAIDZ1 with 4 or 8 disks * RAIDZ2 with 4, 8, or 10 disks * RAIDZ3 with 6, 8, 9, or 10 disks With recordsize=1MB, compression=off, ashift=12 (4KB sectors) * RAIDZ1 with 7 or 8 disks * RAIDZ2 with 4, 5, or 10 disks * RAIDZ3 with 6, 9, or 10 disks With recordsize=2MB and larger (which can only be selected by changing kernel tunables), many configurations are affected, including with higher numbers of disks (up to 18 disks with recordsize=2MB). Increase zfs_vdev_aggregation_limit to allow the optional zio to be aggregated, thus eliminating the problem. Setting it to 256KB fixes all commonly-used configurations. The solution is to aggregate optional zio's regardless of the aggregation size limit. Analysis sponsored by Intel Corp. OpenZFS-issue: https://www.illumos.org/issues/8005 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/321 Closes #5931		2017-04-10 15:21:45 -07:00
..
abd.c	Remove dependency on linear ABD	2017-03-29 12:24:51 -07:00
arc.c	OpenZFS 7968 - multi-threaded spa_sync()	2017-03-20 18:36:00 -07:00
blkptr.c	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
bplist.c
bpobj.c	panic in bpobj_space(): null pointer dereference	2017-02-09 10:19:12 -08:00
bptree.c	OpenZFS 7082 - bptree_iterate() passes wrong args to zfs_dbgmsg()	2017-01-17 14:49:24 -08:00
bqueue.c	Fix coverity defects: CID 147565-147567	2016-10-07 13:19:43 -07:00
dbuf_stats.c	OpenZFS 6950 - ARC should cache compressed data	2016-09-13 09:58:33 -07:00
dbuf.c	OpenZFS 8023 - Panic destroying a metaslab deferred range tree	2017-04-09 16:12:35 -07:00
ddt_zap.c
ddt.c	Cache ddt_get_dedup_dspace() value if there was no ddt changes	2016-12-02 16:59:35 -07:00
dmu_diff.c	OpenZFS 6950 - ARC should cache compressed data	2016-09-13 09:58:33 -07:00
dmu_object.c	Clean up by-dnode code in dmu_tx.c	2017-02-24 13:34:26 -08:00
dmu_objset.c	OpenZFS 7968 - multi-threaded spa_sync()	2017-03-20 18:36:00 -07:00
dmu_send.c	OpenZFS 7247 - zfs receive of deduplicated stream fails	2017-02-04 09:10:24 -08:00
dmu_traverse.c	Add TASKQID_INVALID	2016-11-02 12:14:45 -07:00
dmu_tx.c	OpenZFS 7801 - add more by-dnode routines (lint)	2017-03-20 12:33:17 -07:00
dmu_zfetch.c	Use cstyle -cpP in `make cstyle` check	2016-12-12 10:46:26 -08:00
dmu.c	Add missing module_param for zfs_per_txg_dirty_frees_percent	2017-02-07 09:44:03 -08:00
dnode_sync.c	OpenZFS 7968 - multi-threaded spa_sync()	2017-03-20 18:36:00 -07:00
dnode.c	OpenZFS 7968 - multi-threaded spa_sync()	2017-03-20 18:36:00 -07:00
dsl_bookmark.c	OpenZFS 1300 - filename normalization doesn't work for removes	2017-02-02 14:13:41 -08:00
dsl_dataset.c	OpenZFS 7968 - multi-threaded spa_sync()	2017-03-20 18:36:00 -07:00
dsl_deadlist.c	panic in bpobj_space(): null pointer dereference	2017-02-09 10:19:12 -08:00
dsl_deleg.c	Performance optimization of AVL tree comparator functions	2016-08-31 14:35:34 -07:00
dsl_destroy.c	OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs	2017-01-27 11:43:42 -08:00
dsl_dir.c	OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space	2017-03-07 09:51:59 -08:00
dsl_pool.c	OpenZFS 8027 - tighten up dsl_pool_dirty_delta	2017-04-09 16:04:35 -07:00
dsl_prop.c	Fix dsl_props_set_sync_impl to work with nested nvlist	2016-12-20 18:46:59 -08:00
dsl_scan.c	OpenZFS 7254 - ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs	2017-01-27 11:43:42 -08:00
dsl_synctask.c	Illumos 4951 - ZFS administrative commands should use reserved space	2015-05-04 09:41:10 -07:00
dsl_userhold.c	OpenZFS 6314 - buffer overflow in dsl_dataset_name	2016-06-28 13:47:03 -07:00
edonr_zfs.c	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
fm.c	Use cstyle -cpP in `make cstyle` check	2016-12-12 10:46:26 -08:00
gzip.c	GZIP compression offloading with QAT accelerator	2017-03-22 17:58:47 -07:00
lz4.c	Fix spelling	2017-01-03 11:31:18 -06:00
lzjb.c
Makefile.in	GZIP compression offloading with QAT accelerator	2017-03-22 17:58:47 -07:00
metaslab.c	OpenZFS 8023 - Panic destroying a metaslab deferred range tree	2017-04-09 16:12:35 -07:00
multilist.c	OpenZFS 7968 - multi-threaded spa_sync()	2017-03-20 18:36:00 -07:00
pathname.c	Add pn_alloc()/pn_free() functions	2016-04-21 09:49:25 -07:00
policy.c	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
qat_compress.c	GZIP compression offloading with QAT accelerator	2017-03-22 17:58:47 -07:00
qat_compress.h	GZIP compression offloading with QAT accelerator	2017-03-22 17:58:47 -07:00
range_tree.c	Performance optimization of AVL tree comparator functions	2016-08-31 14:35:34 -07:00
refcount.c	Linux 4.11 compat: avoid refcount_t name conflict	2017-02-28 16:10:18 -08:00
rrwlock.c	Fix spelling	2017-01-03 11:31:18 -06:00
sa.c	OpenZFS 6676 - Race between unique_insert() and unique_remove() causes ZFS fsid change	2017-01-26 14:43:28 -08:00
sha256.c	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
skein_zfs.c	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
spa_boot.c
spa_config.c	Fix spelling	2017-01-03 11:31:18 -06:00
spa_errlog.c
spa_history.c	Fix indefinite article	2016-08-11 11:23:49 -07:00
spa_misc.c	OpenZFS 8023 - Panic destroying a metaslab deferred range tree	2017-04-09 16:12:35 -07:00
spa_stats.c	Fix spelling	2017-01-03 11:31:18 -06:00
spa.c	OpenZFS 3821 - Race in rollback, zil close, and zil flush	2017-03-23 18:20:58 -07:00
space_map.c	OpenZFS 8023 - Panic destroying a metaslab deferred range tree	2017-04-09 16:12:35 -07:00
space_reftree.c	OpenZFS 6328 - Fix cstyle errors in zfs codebase	2017-01-12 09:42:11 -08:00
trace.c	OpenZFS 6531 - Provide mechanism to artificially limit disk performance	2016-05-26 10:11:51 -07:00
txg.c	Refactor txg history kstat	2016-12-02 16:57:49 -07:00
uberblock.c	Illumos 5347 - idle pool may run itself out of space	2015-07-14 10:35:21 -07:00
unique.c	Performance optimization of AVL tree comparator functions	2016-08-31 14:35:34 -07:00
vdev_cache.c	Fix wrong offset args in vdev_cache_write	2017-03-28 11:06:22 -07:00
vdev_disk.c	OpenZFS 7448 - ZFS doesn't notice when disk vdevs have no write cache	2017-02-04 09:23:50 -08:00
vdev_file.c	Use a dedicated taskq for vdev_file	2016-12-21 10:47:15 -08:00
vdev_label.c	OpenZFS 6328 - Fix cstyle errors in zfs codebase	2017-01-12 09:42:11 -08:00
vdev_mirror.c	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
vdev_missing.c	Illumos #5244 - zio pipeline callers should explicitly invoke next stage	2015-04-30 15:07:47 -07:00
vdev_queue.c	OpenZFS 8005 - poor performance of 1MB writes on certain RAID-Z configurations	2017-04-10 15:21:45 -07:00
vdev_raidz_math_aarch64_neon_common.h	ABD raidz NEON support	2016-11-29 14:34:33 -08:00
vdev_raidz_math_aarch64_neon.c	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
vdev_raidz_math_aarch64_neonx2.c	ABD raidz NEON support	2016-11-29 14:34:33 -08:00
vdev_raidz_math_avx2.c	ABD raidz avx512f support	2016-11-29 14:34:33 -08:00
vdev_raidz_math_avx512bw.c	ABD: Adapt avx512bw raidz assembly	2016-12-15 17:31:33 -08:00
vdev_raidz_math_avx512f.c	Use cstyle -cpP in `make cstyle` check	2016-12-12 10:46:26 -08:00
vdev_raidz_math_impl.h	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
vdev_raidz_math_scalar.c	ABD Vectorized raidz	2016-11-29 14:34:33 -08:00
vdev_raidz_math_sse2.c	ABD raidz avx512f support	2016-11-29 14:34:33 -08:00
vdev_raidz_math_ssse3.c	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
vdev_raidz_math.c	codebase style improvements for OpenZFS 6459 port	2017-01-22 13:25:40 -08:00
vdev_raidz.c	Remove dependency on linear ABD	2017-03-29 12:24:51 -07:00
vdev_root.c
vdev.c	OpenZFS 7885 - zpool list can report 16.0e for expandsz	2017-04-05 09:33:20 -07:00
zap_leaf.c	OpenZFS 1300 - filename normalization doesn't work for removes	2017-02-02 14:13:41 -08:00
zap_micro.c	OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space	2017-03-07 09:51:59 -08:00
zap.c	OpenZFS 7793 - ztest fails assertion in dmu_tx_willuse_space	2017-03-07 09:51:59 -08:00
zfeature_common.c	OpenZFS 2932 - support crash dumps to raidz, etc. pools	2017-04-10 10:24:17 -07:00
zfeature.c	OpenZFS 6328 - Fix cstyle errors in zfs codebase	2017-01-12 09:42:11 -08:00
zfs_acl.c	Rename zfs_sb_t -> zfsvfs_t	2017-03-10 09:51:33 -08:00
zfs_byteswap.c
zfs_ctldir.c	Restructure mount option handling	2017-03-10 09:51:41 -08:00
zfs_debug.c	OpenZFS 7277 - zdb should be able to print zfs_dbgmsg's	2017-01-28 12:16:43 -08:00
zfs_dir.c	Rename zfs_sb_t -> zfsvfs_t	2017-03-10 09:51:33 -08:00
zfs_fm.c	Fix regression in zfs_ereport_start()	2017-04-05 14:24:26 -07:00
zfs_fuid.c	Rename zfs_sb_t -> zfsvfs_t	2017-03-10 09:51:33 -08:00
zfs_ioctl.c	Restructure mount option handling	2017-03-10 09:51:41 -08:00
zfs_log.c	OpenZFS 6328 - Fix cstyle errors in zfs codebase	2017-01-12 09:42:11 -08:00
zfs_onexit.c	zfsdev_getminor() should check for invalid file handles	2015-06-22 17:02:13 -07:00
zfs_replay.c	Rename zfs_sb_t -> zfsvfs_t	2017-03-10 09:51:33 -08:00
zfs_rlock.c	Fix spelling	2017-01-03 11:31:18 -06:00
zfs_sa.c	Rename zfs_sb_t -> zfsvfs_t	2017-03-10 09:51:33 -08:00
zfs_vfsops.c	OpenZFS 6874 - rollback and receive need to reset ZPL state to what's on disk	2017-03-13 17:42:42 -07:00
zfs_vnops.c	Rename zfs_* functions	2017-03-10 09:51:35 -08:00
zfs_znode.c	Retry zfs_znode_alloc() in zfs_mknode()	2017-03-23 18:26:50 -07:00
zil.c	OpenZFS 3821 - Race in rollback, zil close, and zil flush	2017-03-23 18:20:58 -07:00
zio_checksum.c	Remove dependency on linear ABD	2017-03-29 12:24:51 -07:00
zio_compress.c	DLPX-44812 integrate EP-220 large memory scalability	2016-11-29 14:34:27 -08:00
zio_inject.c	Compile zio.h and zio_impl.h mutual include	2016-12-01 16:36:25 -07:00
zio.c	Remove dependency on linear ABD	2017-03-29 12:24:51 -07:00
zle.c
zpl_ctldir.c	Linux 4.11 compat: iops.getattr and friends	2017-03-20 17:51:16 -07:00
zpl_export.c	Use cstyle -cpP in `make cstyle` check	2016-12-12 10:46:26 -08:00
zpl_file.c	Rename zfs_sb_t -> zfsvfs_t	2017-03-10 09:51:33 -08:00
zpl_inode.c	Linux 4.11 compat: iops.getattr and friends	2017-03-20 17:51:16 -07:00
zpl_super.c	Restructure mount option handling	2017-03-10 09:51:41 -08:00
zpl_xattr.c	Rename zfs_sb_t -> zfsvfs_t	2017-03-10 09:51:33 -08:00
zrlock.c	OpenZFS 3746 - ZRLs are racy	2017-01-23 10:35:58 -08:00
zvol.c	Fix ZVOL BLKFLSBUF ioctl	2017-03-09 17:43:36 -08:00