freebsd-nq

History

Etienne Dechamps 920dd524fb Add FASTWRITE algorithm for synchronous writes. Currently, ZIL blocks are spread over vdevs using hint block pointers managed by the ZIL commit code and passed to metaslab_alloc(). Spreading log blocks accross vdevs is important for performance: indeed, using mutliple disks in parallel decreases the ZIL commit latency, which is the main performance metric for synchronous writes. However, the current implementation suffers from the following issues: 1) It would be best if the ZIL module was not aware of such low-level details. They should be handled by the ZIO and metaslab modules; 2) Because the hint block pointer is managed per log, simultaneous commits from multiple logs might use the same vdevs at the same time, which is inefficient; 3) Because dmu_write() does not honor the block pointer hint, indirect writes are not spread. The naive solution of rotating the metaslab rotor each time a block is allocated for the ZIL or dmu_sync() doesn't work in practice because the first ZIL block to be written is actually allocated during the previous commit. Consequently, when metaslab_alloc() decides the vdev for this block, it will do so while a bunch of other allocations are happening at the same time (from dmu_sync() and other ZILs). This means the vdev for this block is chosen more or less at random. When the next commit happens, there is a high chance (especially when the number of blocks per commit is slightly less than the number of the disks) that one disk will have to write two blocks (with a potential seek) while other disks are sitting idle, which defeats spreading and increases the commit latency. This commit introduces a new concept in the metaslab allocator: fastwrites. Basically, each top-level vdev maintains a counter indicating the number of synchronous writes (from dmu_sync() and the ZIL) which have been allocated but not yet completed. When the metaslab is called with the FASTWRITE flag, it will choose the vdev with the least amount of pending synchronous writes. If there are multiple vdevs with the same value, the first matching vdev (starting from the rotor) is used. Once metaslab_alloc() has decided which vdev the block is allocated to, it updates the fastwrite counter for this vdev. The rationale goes like this: when an allocation is done with FASTWRITE, it "reserves" the vdev until the data is written. Until then, all future allocations will naturally avoid this vdev, even after a full rotation of the rotor. As a result, pending synchronous writes at a given point in time will be nicely spread over all vdevs. This contrasts with the previous algorithm, which is based on the implicit assumption that blocks are written instantaneously after they're allocated. metaslab_fastwrite_mark() and metaslab_fastwrite_unmark() are used to manually increase or decrease fastwrite counters, respectively. They should be used with caution, as there is no per-BP tracking of fastwrite information, so leaks and "double-unmarks" are possible. There is, however, an assert in the vdev teardown code which will fire if the fastwrite counters are not zero when the pool is exported or the vdev removed. Note that as stated above, marking is also done implictly by metaslab_alloc(). ZIO also got a new FASTWRITE flag; when it is used, ZIO will pass it to the metaslab when allocating (assuming ZIO does the allocation, which is only true in the case of dmu_sync). This flag will also trigger an unmark when zio_done() fires. A side-effect of the new algorithm is that when a ZIL stops being used, its last block can stay in the pending state (allocated but not yet written) for a long time, polluting the fastwrite counters. To avoid that, I've implemented a somewhat crude but working solution which unmarks these pending blocks in zil_sync(), thus guaranteeing that linguering fastwrites will get pruned at each sync event. The best performance improvements are observed with pools using a large number of top-level vdevs and heavy synchronous write workflows (especially indirect writes and concurrent writes from multiple ZILs). Real-life testing shows a 200% to 300% performance increase with indirect writes and various commit sizes. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1013		2012-10-17 08:56:41 -07:00
..
fm	Remove autotools products	2012-08-27 11:47:44 -07:00
fs	Illumos #2703 : add mechanism to report ZFS send progress	2012-09-19 13:39:06 -07:00
arc.h	Fix zfs_write_limit_max integer size mismatch on 32-bit systems	2012-10-11 11:09:25 -07:00
avl_impl.h
avl.h
bplist.h
bpobj.h
dbuf.h	Switch KM_SLEEP to KM_PUSHPAGE	2012-08-27 12:01:37 -07:00
ddt.h
dmu_impl.h	Illumos #2703 : add mechanism to report ZFS send progress	2012-09-19 13:39:06 -07:00
dmu_objset.h	Illumos #3100 : zvol rename fails with EBUSY when dirty.	2012-10-03 13:59:02 -07:00
dmu_traverse.h
dmu_tx.h	Add --enable-debug-dmu-tx configure option	2012-03-23 12:25:17 -07:00
dmu_zfetch.h	Add missing ZFS tunables	2011-05-04 10:02:37 -07:00
dmu.h	Illumos #2703 : add mechanism to report ZFS send progress	2012-09-19 13:39:06 -07:00
dnode.h
dsl_dataset.h	Illumos #3100 : zvol rename fails with EBUSY when dirty.	2012-10-03 13:59:02 -07:00
dsl_deadlist.h
dsl_deleg.h	Illumos #1644 , #1645 , #1646 , #1647 , #1708	2012-07-31 09:25:30 -07:00
dsl_dir.h	Switch KM_SLEEP to KM_PUSHPAGE	2012-08-27 12:01:37 -07:00
dsl_pool.h
dsl_prop.h
dsl_scan.h
dsl_synctask.h
efi_partition.h	Move partition scanning from userspace to module.	2012-07-17 09:17:31 -07:00
Makefile.am	Add .zfs control directory	2012-03-22 13:03:47 -07:00
metaslab_impl.h	Add FASTWRITE algorithm for synchronous writes.	2012-10-17 08:56:41 -07:00
metaslab.h	Add FASTWRITE algorithm for synchronous writes.	2012-10-17 08:56:41 -07:00
nvpair_impl.h
nvpair.h
refcount.h
rrwlock.h
sa_impl.h
sa.h	Add sa_spill_rele() interface	2012-03-07 16:28:00 -08:00
spa_boot.h
spa_impl.h	Illumos #1693 : persistent 'comment' field for a zpool	2012-08-08 11:49:37 -07:00
spa.h	Switch KM_SLEEP to KM_PUSHPAGE	2012-08-27 12:01:37 -07:00
space_map.h
txg_impl.h
txg.h	Fix zfs_txg_timeout module parameter	2012-10-11 15:07:09 -07:00
u8_textprep_data.h
u8_textprep.h
uberblock_impl.h
uberblock.h
uio_impl.h
unique.h
uuid.h
vdev_disk.h
vdev_file.h
vdev_impl.h	Add FASTWRITE algorithm for synchronous writes.	2012-10-17 08:56:41 -07:00
vdev.h	Illumos #1949 , #1953	2012-07-11 13:33:31 -07:00
xvattr.h
zap_impl.h
zap_leaf.h
zap.h	Export symbols for the full ZAP API	2011-09-27 16:12:36 -07:00
zfs_acl.h	Fix build failures on PaX/GRSecurity patched kernels	2012-07-17 09:22:43 -07:00
zfs_context.h	Fix VOP_CLOSE() in userspace.	2012-10-03 13:32:48 -07:00
zfs_ctldir.h	Use ULL suffix in constants	2012-07-10 11:31:55 -07:00
zfs_debug.h	Cleanly support debug packages	2012-02-27 14:08:17 -08:00
zfs_dir.h
zfs_fuid.h
zfs_ioctl.h	Speed up 'zfs list -t snapshot -o name -s name'	2012-06-14 09:49:04 -07:00
zfs_onexit.h
zfs_rlock.h	Range lock performance improvements	2011-03-08 12:44:06 -08:00
zfs_sa.h	Implement SA based xattrs	2011-11-28 15:45:51 -08:00
zfs_stat.h
zfs_vfsops.h	Add .zfs control directory	2012-03-22 13:03:47 -07:00
zfs_vnops.h	Cleanup mmap(2) writes	2011-08-02 10:34:55 -07:00
zfs_znode.h	Add .zfs control directory	2012-03-22 13:03:47 -07:00
zil_impl.h	Add FASTWRITE algorithm for synchronous writes.	2012-10-17 08:56:41 -07:00
zil.h	Add ZIL statistics.	2012-06-29 09:56:51 -07:00
zio_checksum.h
zio_compress.h
zio_impl.h
zio.h	Add FASTWRITE algorithm for synchronous writes.	2012-10-17 08:56:41 -07:00
zpl.h	Linux 3.3 compat, iops->create()/mkdir()/mknod()	2012-04-30 12:52:38 -07:00
zrlock.h
zvol.h