freebsd-dev

Author	SHA1	Message	Date
Adam Leventhal	64dbba3679	Illumos 5174 - add sdt probe for blocked read in dbuf_read() 5174 add sdt probe for blocked read in dbuf_read() Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@gmail.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/5174 https://github.com/illumos/illumos-gate/commit/f6164ad Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2710	2014-09-22 14:20:25 -07:00
Matthew Ahrens	6d9036f350	Illumos 5140 - message about "%recv could not be opened" is printed when booting after crash Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/projects/illumos-gate//issues/5140 https://github.com/illumos/illumos-gate/commit/2243853 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2676	2014-09-18 15:04:59 -07:00
Brian Behlendorf	2d50158343	Fix z_teardown_inactive_lock deadlock When rolling back a mounted filesystem zfs_suspend() is called which acquires the z_teardown_inactive_lock. This lock can not be dropped until the filesystem has been rolled back and resumed in zfs_resume_fs(). Therefore, we must not call iput() under this lock because it may result in the inode->evict() handler being called which also takes this lock. Instead use zfs_iput_async() to ensure dropping the last reference is deferred and runs in a safe context. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2670	2014-09-11 09:11:45 -07:00
Tim Chase	223df0161f	Implement fallocate FALLOC_FL_PUNCH_HOLE Add support for the FALLOC_FL_PUNCH_HOLE \| FALLOC_FL_KEEP_SIZE mode of fallocate(2). Mimic the behavior of other native file systems such as ext4 in cases where the file might be extended. If the offset is beyond the end of the file, return success without changing the file. If the extent of the punched hole would extend the file, only the existing tail of the file is punched. Add the zfs_zero_partial_page() function, modeled after update_page(), to handle zeroing partial pages in a hole-punching operation. It must be used under a range lock for the requested region in order that the ARC and page cache stay in sync. Move the existing page cache truncation via truncate_setsize() into zfs_freesp() for better source structure compatibility with upstream code. Add page cache truncation to zfs_freesp() and zfs_free_range() to handle hole punching. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #2619	2014-09-08 13:52:25 -07:00
George Wilson	4f68d7878f	Illumos 5117 - spacemap reallocation can cause corruption 5117 space map reallocation can cause corruption Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> References: https://www.illumos.org/projects/illumos-gate/issues/5117 https://github.com/illumos/illumos-gate/commit/e503a68 Ported by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2662	2014-09-08 09:42:39 -07:00
Brian Behlendorf	ceb49b0acd	Add object type checking to zap_lockdir() If a non-ZAP object is passed to zap_lockdir() it will be treated as a valid ZAP object. This can result in zap_lockdir() attempting to read what it believes are leaf blocks from invalid disk locations. The SCSI layer will eventually generate errors for these bogus IOs but the caller will hang in zap_get_leaf_byblk(). The good news is that is a situation which can not occur unless the pool has been damaged. The bad news is that there are reports from both FreeBSD and Solaris of damaged pools. Specifically, there are normal files in the filesystem which reference another normal file as their parent. Since pools like this are known to exist the zap_lockdir() function has been updated to verify the type of the object. If a non-ZAP object has been passed it EINVAL will be returned immediately. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2597 Issue #2602	2014-09-08 09:15:38 -07:00
Richard Yao	cd3939c5f0	Linux AIO Support nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to know that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's responsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #223 Closes #2373	2014-09-05 15:11:43 -07:00
Alex Reece	f38dfec3fd	Illumos 5049 - panic when removing log device Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Mattew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <skiselkov@gmail.com> Approved by: Rich Lowe <richlowe@richlowe.net> References: https://www.illumos.org/issues/5049 https://github.com/illumos/illumos-gate/commit/2986efa Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2636	2014-09-05 09:06:27 -07:00
Stanislav Seletskiy	2078f21015	Fix invalid locking order in rename operation This commit should prevent a deadlock on dp_config_rwlock when running `zfs rename` by ensuring zvol_rename_minors() is not called under this lock. Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2652. Closes #2525.	2014-09-04 09:50:46 -07:00
Alexey Smirnoff	0dfc732416	Change the default 'zfs_dedup_prefetch' value to '0' This gives a huge performance improvement in operations with deduped datasets especially when the bottleneck is the amount of ram available for zfs. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2639	2014-09-04 09:50:45 -07:00
Dan Swartzendruber	287be44f53	Improve handling of filesystem versions Change mount code to diagnose filesystem versions that are not supported by the current implementation. Change upgrade code to do likewise and refuse to upgrade a pool if any filesystems on it are a version which is not supported by the current implementation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Dan Swartzendruber <dswartz@druber.com> Closes: #2616	2014-09-03 09:17:14 -07:00
Matthew Ahrens	dea377c0d9	Illumos 4970-4974 - extreme rewind enhancements 4970 need controls on i/o issued by zpool import -XF 4971 zpool import -T should accept hex values 4972 zpool import -T implies extreme rewind, and thus a scrub 4973 spa_load_retry retries the same txg 4974 spa_load_verify() reads all data twice Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4970 https://www.illumos.org/issues/4971 https://www.illumos.org/issues/4972 https://www.illumos.org/issues/4973 https://www.illumos.org/issues/4974 https://github.com/illumos/illumos-gate/commit/e42d205 Notes: This set of patches adds a set of tunable parameters for the "extreme rewind" mode of pool import which allows control over the traversal performed during such an import. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2598	2014-08-26 16:29:57 -07:00
Matthew Ahrens	49ddb31506	Illumos 5034 - ARC's buf_hash_table is too small 5034 ARC's buf_hash_table is too small Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Richard Elling <richard.elling@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/5034 https://github.com/illumos/illumos-gate/commit/63e911b Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2615	2014-08-26 16:14:49 -07:00
Isaac Huang	0426c16804	Fixed memory leaks in zevent handling Some nvlist_t could be leaked in error handling paths. Also make sure cb argument to zfs_zevent_post() cannnot be NULL. Signed-off-by: Isaac Huang <he.huang@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2158	2014-08-20 10:45:16 -07:00
Matthew Ahrens	bd089c5477	Illumos 4631 - zvol_get_stats triggering too many reads 4631 zvol_get_stats triggering too many reads Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4631 https://github.com/illumos/illumos-gate/commit/bbfa8ea Ported-by: Boris Protopopov <bprotopopov@hotmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2612 Closes #2480	2014-08-20 09:17:00 -07:00
Tim Chase	8b0a0840b4	Don't upgrade a metaslab when the pool is not writable Illumos 4982 added code to metaslab_fragmentation() to proactively update space maps when the spacemap_histogram feature is enabled. This should only happen when the pool is writeable. References: https://www.illumos.org/issues/4982 https://github.com/illumos/illumos-gate/commit/2e4c998 Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2595	2014-08-18 08:47:19 -07:00
George Wilson	f3a7f6610f	Illumos 4976-4984 - metaslab improvements 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4983 need to collect metaslab information via mdb 4984 device selection should use fragmentation metric Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <adam.leventhal@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4976 https://www.illumos.org/issues/4978 https://www.illumos.org/issues/4979 https://www.illumos.org/issues/4980 https://www.illumos.org/issues/4981 https://www.illumos.org/issues/4982 https://www.illumos.org/issues/4983 https://www.illumos.org/issues/4984 https://github.com/illumos/illumos-gate/commit/2e4c998 Notes: The "zdb -M" option has been re-tasked to display the new metaslab fragmentation metric and the new "zdb -I" option is used to control the maximum number of in-flight I/Os. The new fragmentation metric is derived from the space map histogram which has been rolled up to the vdev and pool level and is presented to the user via "zpool list". Add a number of module parameters related to the new metaslab weighting logic. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2595	2014-08-18 08:40:49 -07:00
Turbo Fredriksson	f67d709080	Create an 'overlay' property Add a new 'overlay' property (default 'off') that controls whether the filesystem should be mounted even if the mountpoint is busy or if it should fail with a 'mountpoint not empty'. Doing overlay mounts is the default mount behavior on Linux, but not in ZFS. It have been decided that following the ZFS behavior should be the default, but this overlay allows for site administrator to override this decision on a per-dataset basis. Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes: #2503	2014-08-15 13:39:19 -07:00
Brian Behlendorf	0d5c500d6c	Revert "Revert "Revert "Fix unlink/xattr deadlock""" This reverts commit `7973e46` which brings the basic flow of the code back in line with the other ZFS implementations. This was possible due to the following related changes. `e89260a` Directory xattr znodes hold a reference on their parent `6f9548c` Fix deadlock in zfs_zget() `0a50679` Add zfs_iput_async() interface `4dd1893` Avoid 128K kmem allocations in mzap_upgrade() Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #457 Closes #2058 Closes #2128 Closes #2240	2014-08-11 16:12:36 -07:00
Brian Behlendorf	0a50679ce9	Add zfs_iput_async() interface Handle all iputs in zfs_purgedir() and zfs_inode_destroy() asynchronously to prevent deadlocks. When the iputs are allowed to run synchronously in the destroy call path deadlocks between xattr directory inodes and their parent file inodes are possible. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #457	2014-08-11 16:11:43 -07:00
Brian Behlendorf	4dd18932ba	Avoid 128K kmem allocations in mzap_upgrade() As originally implemented the mzap_upgrade() function will perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc(). These large allocations can potentially block indefinitely if contiguous memory is not available. Since this allocation is done under the zap->zap_rwlock it can appear as if there is a deadlock in zap_lockdir(). This is shown below. The optimal fix for this would be to rework mzap_upgrade() such that no large allocations are required. This could be done but it would result in us diverging further from the other implementations. Therefore I've opted against doing this unless it becomes absolutely necessary. Instead mzap_upgrade() has been updated to use zio_buf_alloc() which can reliably provide buffers of up to SPA_MAXBLOCKSIZE. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Close #2580	2014-08-11 16:10:32 -07:00
Brian Behlendorf	50b25b2187	Avoid dynamic allocation of 'search zio' As part of commit `e8b96c6` the search zio used by the vdev_queue_io_to_issue() function was moved to the heap to minimize stack usage. Functionally this is fine, but to maximize performance it's best to minimize the number of dynamic allocations. To avoid this allocation temporary space for the search zio has been reserved in the vdev_queue structure. All access must be serialized through the vq_lock. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2572	2014-08-11 08:44:54 -07:00
Brian Behlendorf	ab6f407faa	Use KM_PUSHPAGE in dsl_dataset_rollback_check() The dsl_dataset_rollback_check() function is executed in the txg_sync context. To prevent a potential deadlock due to direct memory reclaim it must use KM_PUSHPAGE. This was introduced by the recent 'zfs bookmark' features, commit `da53684`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Eric Dillmann <eric@jave.fr> Closes #2569	2014-08-06 16:09:28 -07:00
Matthew Ahrens	5dbd68a352	Illumos 4914 - zfs on-disk bookmark structure should be named _phys_t 4914 zfs on-disk bookmark structure should be named _phys_t Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> References: https://www.illumos.org/issues/4914 https://github.com/illumos/illumos-gate/commit/7802d7b Porting notes: There were a number of zfsonlinux-specific uses of zbookmark_t which needed to be updated. This should reduce the likelihood of further problems like issue #2094 from occurring. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2558	2014-08-06 14:48:41 -07:00
Matthew Ahrens	1fa8f795d5	Illumos 4881 - zfs send performance regression with embedded data 4881 zfs send performance degradation when embedded block pointers are encountered Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4881 https://github.com/illumos/illumos-gate/commit/06315b7 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2547	2014-08-06 14:44:10 -07:00
Saso Kiselkov	3bec585e6c	Illumos 4897 - Space accounting mismatch in L2ARC/zpool 4897 Space accounting mismatch in L2ARC/zpool Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Dan McDonald <danmcd@omniti.com> From the illumos issue tracker: L2ARC vdev space usage statistics are calculated as the delta between the maximum and minimum vdev offset ever written to by the L2ARC fill thread, but do not inform the user of how much space in between these two offsets is actually taken up by cached buffers. This fix changes that so that vdev space usage stats on L2ARC devices accurately track the volume of buffers stored on them, allowing users to see the exact L2ARC usage in "zpool iostat -v". References: https://www.illumos.org/issues/4897 https://github.com/illumos/illumos-gate/commit/3038a2b Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2555	2014-08-06 13:44:10 -07:00
Matthew Ahrens	fbeddd60b7	Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol 4390 i/o errors when deleting filesystem/zvol can lead to space map corruption Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4390 https://github.com/illumos/illumos-gate/commit/7fd05ac Porting notes: Previous stack-reduction efforts in traverse_visitb() caused a fair number of un-mergable pieces of code. This patch should reduce its stack footprint a bit more. The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated using kmem_zalloc() for the purpose of stack reduction. The new global zfs_free_leak_on_eio has been defined as an integer rather than a boolean_t as was the case with the related zfs_recover global. Also, zfs_free_leak_on_eio's definition has been inserted into zfs_debug.c for consistency with the existing definition of zfs_recover. Illumos placed it in spa_misc.c. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2545	2014-08-04 11:50:52 -07:00
Matthew Ahrens	9b67f60560	Illumos 4757, 4913 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4757 https://www.illumos.org/issues/4913 https://github.com/illumos/illumos-gate/commit/5d7b4d4 Porting notes: For compatibility with the fastpath code the zio_done() function needed to be updated. Because embedded-data block pointers do not require DVAs to be allocated the associated vdevs will not be marked and therefore should not be unmarked. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2544	2014-08-01 14:28:05 -07:00
Matthew Ahrens	faf0f58c69	Illumos 3835 zfs need not store 2 copies of all metadata Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Description from Matt Ahrens's bug report at Delphix: Add a new zfs property, "redundant_metadata" which can have values "all" or "most". The default will be "all", which is the current behavior. Setting to "most" will cause us to only store 1 copy of level-1 indirect blocks of user data files. Additional notes: The new man page section for this property states "The exact behavior of which metadata blocks are stored redundantly may change in future releases." and: "When set to most, ZFS stores an extra copy of most types of metadata. This can improve performance of random writes, because less metadata must be written." The current implementation is as described above in Matt's blog. It is controlled by a new global integer "zfs_redundant_metadata_most_ditto_level", currently initialized to 2. When "redundant_metadata" is set to "most", only indirect blocks of the specified level and higher will have additional ditto blocks created. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2542	2014-07-31 09:49:34 -07:00
George Wilson	672692c7b7	Illumos 4754, 4755 4754 io issued to near-full luns even after setting noalloc threshold 4755 mg_alloc_failures is no longer needed Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4754 https://www.illumos.org/issues/4755 https://github.com/illumos/illumos-gate/commit/b6240e8 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2533	2014-07-30 10:30:05 -07:00
Matthew Ahrens	9bd274ddd8	Illumos #4374 4374 dn_free_ranges should use range_tree_t Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4374 https://github.com/illumos/illumos-gate/commit/bf16b11 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2531	2014-07-30 09:20:35 -07:00
Matthew Ahrens	da536844d5	Illumos 4368, 4369. 4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> References: https://www.illumos.org/issues/4369 https://www.illumos.org/issues/4368 https://github.com/illumos/illumos-gate/commit/78f1710 Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2530	2014-07-29 10:55:29 -07:00
Max Grossman	b0bc7a84d9	Illumos 4370, 4371 4370 avoid transmitting holes during zfs send 4371 DMU code clean up Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Garrett D'Amore <garrett@damore.org>a References: https://www.illumos.org/issues/4370 https://www.illumos.org/issues/4371 https://github.com/illumos/illumos-gate/commit/43466aa Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2529	2014-07-28 14:29:58 -07:00
Matthew Ahrens	fa86b5dbb6	Illumos 4171, 4172 4171 clean up spa_feature_*() interfaces 4172 implement extensible_dataset feature for use by other zpool features Reviewed by: Max Grossman <max.grossman@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org>a References: https://www.illumos.org/issues/4171 https://www.illumos.org/issues/4172 https://github.com/illumos/illumos-gate/commit/2acef22 Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2528	2014-07-25 16:40:07 -07:00
Brian Behlendorf	7a8f0e80ea	zfs_trunc() should use dmu_tx_assign(tx, TXG_WAIT) As part of the write throttle & i/o schedule performance work the zfs_trunc() function should have been updated to use TXG_WAIT. Using TXG_WAIT ensures that the tx will be part of the next txg. If TXG_NOWAIT is used and retried for ERESTART errors then the tx can suffer from starvation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2488	2014-07-22 09:41:38 -07:00
George Wilson	080b310015	Illumos #4756 Fix metaslab_group_preload deadlock 4756 metaslab_group_preload() could deadlock Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> The metaslab_group_preload() function grabs the mg_lock and then later tries to grab the metaslab lock. This lock ordering may lead to a deadlock since other consumers of the mg_lock will grab the metaslab lock first. References: https://www.illumos.org/issues/4756 https://github.com/illumos/illumos-gate/commit/30beaff Ported-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2488	2014-07-22 09:41:32 -07:00
George Wilson	3c51c5cb1f	Illumos #4730 destroy metaslab group taskq 4730 metaslab group taskq should be destroyed in metaslab_group_destroy() Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> References: https://www.illumos.org/issues/4730 https://github.com/illumos/illumos-gate/commit/be08211 Porting notes: Under ZFSonlinux, one of the effects of not destroying the taskq is that zdb would never exit (due to the SPL taskq implementation). Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2488	2014-07-22 09:41:06 -07:00
George Wilson	93cf20764a	Illumos #4101 , #4102 , #4103 , #4105 , #4106 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Sebastien Roy <seb@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Prior to this patch, space_maps were preferred solely based on the amount of free space left in each. Unfortunately, this heuristic didn't contain any information about the make-up of that free space, which meant we could keep preferring and loading a highly fragmented space map that wouldn't actually have enough contiguous space to satisfy the allocation; then unloading that space_map and repeating the process. This change modifies the space_map's to store additional information about the contiguous space in the space_map, so that we can use this information to make a better decision about which space_map to load. This requires reallocating all space_map objects to increase their bonus buffer size sizes enough to fit the new metadata. The above feature can be enabled via a new feature flag introduced by this change: com.delphix:spacemap_histogram In addition to the above, this patch allows the space_map block size to be increase. Currently the block size is set to be 4K in size, which has certain implications including the following: * 4K sector devices will not see any compression benefit * large space_maps require more metadata on-disk * large space_maps require more time to load (typically random reads) Now the space_map block size can adjust as needed up to the maximum size set via the space_map_max_blksz variable. A bug was fixed which resulted in potentially leaking an object when removing a mirrored log device. The previous logic for vdev_remove() did not deal with removing top-level vdevs that are interior vdevs (i.e. mirror) correctly. The problem would occur when removing a mirrored log device, and result in the DTL space map object being leaked; because top-level vdevs don't have DTL space map objects associated with them. References: https://www.illumos.org/issues/4101 https://www.illumos.org/issues/4102 https://www.illumos.org/issues/4103 https://www.illumos.org/issues/4105 https://www.illumos.org/issues/4106 https://github.com/illumos/illumos-gate/commit/0713e23 Porting notes: A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also, the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary. Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2488	2014-07-22 09:39:16 -07:00
Prakash Surya	1be627f5c2	Move metaslab_group_alloc_update() call This changes moves the called to metaslab_group_alloc_update() to the metaslab_sync_reassess() function. The original placement of the call within metaslab_sync_done() appears to have been a simple mistake, introduced by `ac72fac3ea`. This aligns us more closely to the upstream illumos code base. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-07-22 09:38:16 -07:00
Brian Behlendorf	1e8db77102	Fix zil_commit() NULL dereference Update the current code to ensure inodes are never dirtied if they are part of a read-only file system or snapshot. If they do somehow get dirtied an attempt will make made to write them to disk. In the case of snapshots, which don't have a ZIL, this will result in a NULL dereference in zil_commit(). Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2405	2014-07-17 15:15:07 -07:00
George Wilson	2fbc542ebd	Illumos 4168, 4169, 4170: ztest, zdb and zhack fixes 4168 ztest assertion failure in dbuf_undirty 4169 verbatim import causes zdb to segfault 4170 zhack leaves pool in ACTIVE state Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Eric Schrock <eric.schrock@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4168 https://www.illumos.org/issues/4169 https://www.illumos.org/issues/4170 https://github.com/illumos/illumos-gate/commit/7fdd916 Porting notes: Of particular interest when troubleshooting corrupted pools, the commonly-used "zdb -e" operation may perform verbatim imports and furthermore, it will soon have direct support for verbatim imports via a new "-V" option. The 4169 fix eliminates a common segfault case in which spa_history_log_version() tries to access an un-opened dsl_pool_t. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2451 Closes #2283 Closes #2467	2014-07-17 11:37:57 -07:00
Tim Chase	f4a4046bd6	Convert zfs_mg_noalloc_threshold to a module parameter and document The parameter was added as illumos issue 4081 which was committed to zfsonlinux in `ac72fac3ea`. This patch documents the parameter and allows for it to be set as a module parameter. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2483	2014-07-16 16:49:25 -07:00
Andrew Barnes	61e99a73bc	Preserve asize when last mirror child promoted to top-level vdev If the smaller of 2 different sized child vdev's of a mirrored vdev is detached, and the pool has the autoexpand property set to off, as the remaining larger vdev is promoted to a top level vdev it fails to retain the asize of the original top level mirror vdev and therefore partially autoexpands. This partially autoexpanded state leaves the new vdev too large to re-mirror by adding the smaller vdev back in, and the pool fails to utilize the space until next imported. If the autoexpand property is set to on, the child vdev grows in size after it has been promoted to a top level vdev as expected. This commit causes the remaining child mirror to retain the asize of its old parent mirror vdev if the autoexpand property is set to off, this allows the smaller vdev to be re-added if required the vdev can then be told to expand if required by the usual using zpool online -e. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andrew Barnes <barnes333@gmail.com> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #1208	2014-07-02 14:04:29 -07:00
Dan McDonald	ee4712284c	Illumos #4936 fix potential overflow in lz4 4936 lz4 could theoretically overflow a pointer with a certain input Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Keith Wesolowski <keith.wesolowski@joyent.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Ported by: Tim Chase <tim@chase2k.com> References: https://illumos.org/issues/4936 https://github.com/illumos/illumos-gate/commit/58d0718 Porting notes: This fixes the widely-reported "20-year-old vulnerability" in LZO/LZ4 implementations which inherited said bug from the reference implementation. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2429	2014-07-01 14:10:47 -07:00
Tim Chase	4240dc332d	Comment the lack of real_LZ4_uncompress() Added several comments regarding the removal of real_LZ4_uncompress() which exists in the reference implementation but has been removed here since it's not used. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-07-01 14:09:56 -07:00
Brian Behlendorf	7f6884f419	Revert "Fix __zio_execute() asynchronous dispatch" This reverts commit `91579709fc` which limited the asynchronous dispatch to kernel space. We want to do this for two reasons: 1) While we have slightly more headroom in user space excessively deep stacks have been observed while running ztest, see #2293. 2) Removing this conditional makes the pipeline behave consistently regardless of if it's executing in kernel space or user space. This way we're more likely to uncover subtle issues with ztest. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2384	2014-06-11 16:32:57 -07:00
Brian Behlendorf	6795a698f4	Use default slab types We should not override the default memory type of the kmem cache. This was done previously to force certain objects which were slightly over object size limit cut off in to KMC_KMEM caches for better performance. The zfsonlinux/spl#356 patch slightly increases the default cut off from 511 bytes 1024 bytes for x86_64. This means there is long longer a need to override the default for the caches. And since the default values are now being used the new spl_kmem_cache_slab_limit and spl_kmem_cache_kmem_limit tunables will apply to all kmem caches. The following is a list of caches that will be impacted: \| object size \| forced type \| default type ----------------- \| ------------- \| ------------- \| -------------- dnode_t \| 936 bytes \| KMC_KMEM \| KMC_KMEM zio_cache \| 1104 bytes \| KMC_KMEM \| KMC_VMEM zio_link_cache \| 48 bytes \| KMC_KMEM \| KMC_KMEM zio_vdev_cache \| 131088 bytes \| KMC_VMEM \| KMC_VMEM zio_buf_512 \| 512 bytes \| KMC_KMEM \| KMC_KMEM zio_data_buf_512 \| 512 bytes \| KMC_KMEM \| KMC_KMEM zio_buf_1024 \| 1024 bytes \| KMC_KMEM \| KMC_KMEM zio_data_buf_1024 \| 1024 bytes \| +KMC_VMEM \| +KMC_KMEM * Cache memory type will change from KMC_KMEM to KMC_VMEM. + Cache memory type will change from KMC_VMEM to KMC_KMEM. This patch removes another slight point of divergence between ZoL and Illumos. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Closes #2337	2014-05-22 10:39:52 -07:00
HC	f9a1ac4d59	Honor zfs_nocacheflush for file vdevs For consistency with disk vdevs honor the zfs_nocacheflush tunable. This setting is available primarily for debugging and performance analysis. Signed-off-by: HC <mmttdebbcc@yahoo.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2336	2014-05-19 13:30:48 -07:00
Tim Chase	83021b47c2	Calculate header size correctly in sa_find_sizes() In the case where a variable-sized SA overlaps the spill block pointer and a new variable-sized SA is being added, the header size was improperly calculated to include the to-be-moved SA. This problem could be reproduced when xattr=sa enabled as follows: ln -s $(perl -e 'print "x" x 120') blah setfattr -n security.selinux -v blahblah -h blah The symlink is large enough to interfere with the spill block pointer and has a typical SA registration as follows (shown in modified "zdb -dddd" <SA attr layout obj> format): [ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ] Adding the SA xattr will attempt to extend the registration to: [ ... ZPL_DACL_COUNT ZPL_DACL_ACES ZPL_SYMLINK ZPL_DXATTR ] but since the ZPL_SYMLINK SA interferes with the spill block pointer, it must also be moved to the spill block which will have a registration of: [ ZPL_SYMLINK ZPL_DXATTR ] This commit updates extra_hdrsize when this condition occurs, allowing hdrsize to be subsequently decreased appropriately. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Issue #2214 Issue #2228 Issue #2316 Issue #2343	2014-05-19 11:55:50 -07:00
Tim Chase	3937ab20f3	Allow for lock-free reading zfsdev_state_list. Restructure the zfsdev_state_list to allow for lock-free reading by converting to a simple singly-linked list from which items are never deleted and over which only forward iterations are performed. It depends on, among other things, the atomicity of accessing the zs_minor integer and zs_next pointer. This fixes a lock inversion in which the zfsdev_state_lock is used by both the sync task (txg_sync) and indirectly by any user program which uses /dev/zfs; the zfsdev_release method uses the same lock and then blocks on the sync task. The most typical failure scenerio occurs when the sync task is cleaning up a user hold while various concurrent "zfs" commands are in progress. Neither Illumos nor Solaris are affected by this issue because they use DDI interface which provides lock-free reading of device state via the ddi_get_soft_state() function. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2301	2014-05-19 11:45:11 -07:00
Chunwei Chen	bc25c9325b	Use a dedicated taskq for vdev_file Originally, vdev_file used system_taskq. This would cause a deadlock, especially on system with few CPUs. The reason is that the prefetcher threads, which are on system_taskq, will sometimes be blocked waiting for I/O to finish. If the prefetcher threads consume all the tasks in system_taskq, the I/O cannot be served and thus results in a deadlock. We fix this by creating a dedicated vdev_file_taskq for vdev_file I/O. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2270	2014-05-14 16:20:21 -07:00
Brian Behlendorf	2c33b91275	Handle vdev_lookup_top() failure in dva_get_dsize_sync() The dva_get_dsize_sync() function incorrectly assumes that the call to vdev_lookup_top() cannot fail. However, the NULL dereference at clearly shows that under certain circumstances it is possible. Note that offset 0x570 (1376) maps as expected to vd->vdev_deflate_ratio. BUG: unable to handle kernel NULL pointer dereference at 00000570 crash> struct -o vdev struct vdev { [0] uint64_t vdev_id; ... ... [1376] uint64_t vdev_deflate_ratio; Given that this can happen this patch add the required error handling. In the case where vdev_lookup_top() fails assume that no deflation will occur for the DVA and use the asize. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Alexey Zhuravlev <alexey.zhuravlev@intel.com> Closes #1707 Closes #1987 Closes #1891 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-05-06 10:41:48 -07:00
Tim Chase	962d524212	Check the dataset type more rigorously when fetching properties. When fetching property values of snapshots, a check against the head dataset type must be performed. Previously, this additional check was performed only when fetching "version", "normalize", "utf8only" or "case". This caused the ZPL properties "acltype", "exec", "devices", "nbmand", "setuid" and "xattr" to be erroneously displayed with meaningless values for snapshots of volumes. It also did not allow for the display of "volsize" of a snapshot of a volume. This patch adds the headcheck flag paramater to zfs_prop_valid_for_type() and zprop_valid_for_type() to indicate the check is being done against a head dataset's type in order that properties valid only for snapshots are handled correctly. This allows the the head check in get_numeric_property() to be performed when fetching a property for a snapshot. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2265	2014-05-06 10:41:46 -07:00
Brian Behlendorf	1ce0457348	Fix style A minor style issue was accidentally introduced by `aa7d06a`. This change resolves that style problem. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-05-06 10:41:17 -07:00
George Wilson	aa7d06a98a	Illumos #4101 finer-grained control of metaslab_debug Today the metaslab_debug logic performs two tasks: - load all metaslabs on import/open - don't unload metaslabs at the end of spa_sync This change provides knobs for each of these independently. References: https://illumos.org/issues/4101 https://github.com/illumos/illumos-gate/commit/0713e23 Notes: 1) This is a small piece of the metaslab improvement patch from Illumos. It was worth bringing over before the rest, since it's low risk and it can be useful on fragmented pools (e.g. Lustre MDTs). metaslab_debug_unload would give the performance benefit of the old metaslab_debug option without causing unwanted delay during pool import. Ported-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2227	2014-05-06 09:46:04 -07:00
Brian Behlendorf	cc79a5c263	Treat spill block dbufs as meta data When the system attributes (SAs) for an object exceed what can can be stored in the bonus area of a dnode a spill block is allocated. These spill blocks are currently considered data blocks. However, they should be accounted for as meta data because they are effectively an extension of the dnode. While this may seem like a minor accounting issue it has broader implications. The key thing to be aware of is that each spill block will hold a reference on its parent dnode. The dnode in turn holds a reference on its dbuf in the dnode object. This means that a single 512 byte data buffer for a spill block can pin over 16k of meta data. This is analogous to the small file situation described in `2b13331` where a relatively small number of data buffer can cause the ARC to exceed the meta limit. However, unlike the small file case a spill block can legitimately be considered meta data. By changing the spill block to meta data they will now be dropped from the cache when the meta limit is reached. This then allows the dnodes and dbufs which the spill block was pinning to be released. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Closes #2294	2014-05-05 13:56:59 -07:00
Brian Behlendorf	12f9a6a3f9	dmu_tx_assign() should not return ENOMEM As described in the comment above dmu_tx_assign() this function must only fail if the pool is out of space. If for some other reason the TX cannot be assigned (such as memory pressure) ERESTART must be returned. Alternately, EAGAIN could be returned to inject a delay but that isn't required because the caller will block on the condition variable waiting for the next TXG. /* * Assign tx to a transaction group. txg_how can be one of: * * (1) TXG_WAIT. If the current open txg is full, waits until there's * a new one. This should be used when you're not holding locks. * It will only fail if we're truly out of space (or over quota). * ... */ Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ned Bass <bass6@llnl.gov> Closes #2287	2014-05-01 12:08:53 -07:00
Richard Yao	9d317793aa	Implement File Attribute Support We add support for lsattr and chattr to resolve a regression caused by `88c283952f` that broke Python's xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr, which depended on Python's xattr.list(). Only attributes common to both Solaris and Linux are supported. These are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File attributes exclusive to Solaris are present in the ZFS code, but cannot be accessed or modified through this method. That was the case prior to this patch. The resolution of issue zfsonlinux/zfs#229 should implement some method to permit access and modification of Solaris-specific attributes. References: https://bugs.gentoo.org/show_bug.cgi?id=483516 Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1691	2014-05-01 10:11:18 -07:00
Chunwei Chen	17584980b9	Add assertion to catch 0-count page Some network related block device uses tcp_sendpage, which doesn't behave well when using 0-count page. Add assertion to catch them. This has a runtime dependency on: zfsonlinux/spl@ae16ed9 Fix crash when using ZFS on Ceph rbd Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2277	2014-04-25 15:41:19 -07:00
Ned Bass	de39ec11b8	Fix LZ4 endianness autodetection Endianness detection in LZ4 is broken in user-space builds. This bug corrupts compressed data and manifests itself in several ztest failures. When LZ4 was originally ported to Illumos ZFS, the proper checks for Linux were stripped out. The Linux port then inherited the remaining detection code that works on Illumos but not on Linux. The current LZ4 endianness check misuses the condition defined(__BIG_ENDIAN) to indicate a big-endian system. On Linux __BIG_ENDIAN is defined uncondtionally in the user-space header /usr/include/endian.h, regardless of the endianness of the system. The kernel does not use this header, so only user-space builds are affected. While we could fix this by restoring the upstream LZ4 endianness detection code, reliable checks already exist in libspl/include/sys/isa_defs.h. This change uses the libspl results to replace the word-size and endianness checks in LZ4, simplifying the code and reducing duplication. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Fixes #1963 Fixes #1964 Fixes #1965	2014-04-20 16:55:42 -07:00
Brian Behlendorf	4fd762f8ad	Fix zfsdev_ioctl() kmem leak warning Due to an asymmetry in the kmem accounting a memory leak was being reported when it was only an accounting issue. All memory allocated with kmem_alloc() must be released with kmem_free() or it will not be properly accounted for. In this case the code used strfree() to release the memory allocated by kmem_alloc(). Presumably this was done because the size of the memory region wasn't available when the memory needed to be freed. To resolve this issue the code has been updated to use strdup() instead of kmem_alloc() to allocate the memory. Like strfree(), strdup() is not integrated with the memory accounting. This means we can use strfree() to release it like Illumos. SPL: kmem leaked 10/4368729 bytes address size data func:line ffff880067e9aa40 10 ZZZZZZZZZZ zfsdev_ioctl:5655 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Closes #2262	2014-04-18 13:30:15 -07:00
DHE	2dbedf5484	Uninitialized variable spa_autoreplace used Caught by ztest and valgrind. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2259	2014-04-16 10:59:24 -07:00
Chunwei Chen	0b75bdb369	Use ddi_time_after and friends to compare time Also, make sure we use clock_t for ddi_get_lbolt to prevent type conversion from screwing things. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2142	2014-04-14 13:27:56 -07:00
Chunwei Chen	b761912b34	Linux 3.14 compat: rq_for_each_segment in dmu_req_copy rq_for_each_segment changed from taking bio_vec * to taking bio_vec. We provide rq_for_each_segment4 which takes both. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:51 -07:00
Chunwei Chen	22760eebef	Revert "Fix zvol+btrfs hang" After the dmu_req_copy change, bi_io_vecs are not touched, so this is no longer needed. This reverts commit `e26ade5101`. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:47 -07:00
Chunwei Chen	215b4634c7	Refactor dmu_req_copy for immutable biovec changes Originally, dmu_req_copy modifies bv_len and bv_offset in bio_vec so that it can continue in subsequent passes. However, after the immutable biovec changes in Linux 3.14, this is not allowed. So instead, we just tell dmu_req_copy how many bytes are already copied and it will skip to the right spot accordingly. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:43 -07:00
Chunwei Chen	d4541210f3	Linux 3.14 compat: Immutable biovec changes in vdev_disk.c bi_sector, bi_size and bi_idx are moved from bio to bio->bi_iter. This patch creates BIO_BI_*(bio) macros to hide the differences. Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:28:38 -07:00
Chunwei Chen	408ec0d2e1	Linux 3.14 compat: posix_acl_{create,chmod} posix_acl_{create,chmod} is changed to __posix_acl_{create_chmod} Signed-off-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2124	2014-04-10 14:27:03 -07:00
Richard Yao	f3ad9cd67a	Fix locking order in zfs_zget() Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-04-04 09:12:47 -07:00
Richard Yao	6f9548c487	Fix deadlock in zfs_zget() zfsonlinux/zfs#180 occurred because of a race between inode eviction and zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call to the VFS to learn whether an inode is being evicted. If it was being evicted the operation was retried after dropping and reacquiring the relevant resources. Unfortunately, this introduced another deadlock. INFO: task kworker/u24:6:891 blocked for more than 120 seconds. Tainted: P O 3.13.6 #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u24:6 D ffff88107fcd2e80 0 891 2 0x00000000 Workqueue: writeback bdi_writeback_workfn (flush-zfs-5) ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80 ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098 ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000 Call Trace: [<ffffffff813dd1d4>] schedule+0x24/0x70 [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0 [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0 [<ffffffff811608c5>] ilookup+0x65/0xd0 [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs] [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs] [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs] [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs] [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs] [<ffffffff811071ec>] do_writepages+0x1c/0x40 [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0 [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420 [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320 [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490 [<ffffffff8106072c>] process_one_work+0x16c/0x490 [<ffffffff810613f3>] worker_thread+0x113/0x390 [<ffffffff81066edf>] kthread+0xdf/0x100 This patch implements the original fix in a slightly different manner in order to avoid both deadlocks. Instead of relying on a call to ilookup() which can block in __wait_on_freeing_inode() the return value from igrab() is used. This gives us the information that ilookup() provided without the risk of a deadlock. Alternately, this race could be closed by registering an sops->drop_inode() callback. The callback would need to detect the active SA hold thereby informing the VFS that this inode should not be evicted. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #180	2014-04-04 09:11:54 -07:00
Brian Behlendorf	8ac67298b1	Revert "Fixed a use-after-free bug in zfs_zget()." This reverts commit `36df284366`. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-04-03 16:23:28 -07:00
Brian Behlendorf	904ea2763e	Add automatic hot spare functionality When a vdev starts getting I/O or checksum errors it is now possible to automatically rebuild to a hot spare device. To cleanly support this functionality in a shell script some additional information was added to all zevent ereports which include a vdev. This covers both io and checksum zevents but may be used but other scripts. In the Illumos FMA solution the same information is required but it is retrieved through the libzfs library interface. Specifically the following members were added: vdev_spare_paths - List of vdev paths for all hot spares. vdev_spare_guids - List of vdev guids for all hot spares. vdev_read_errors - Read errors for the problematic vdev vdev_write_errors - Write errors for the problematic vdev vdev_cksum_errors - Checksum errors for the problematic vdev. By default the required hot spare scripts are installed but this functionality is disabled. To enable hot sparing uncomment the ZED_SPARE_ON_IO_ERRORS and ZED_SPARE_ON_CHECKSUM_ERRORS in the /etc/zfs/zed.d/zed.rc configuration file. These scripts do no add support for the autoexpand property. At a minimum this requires adding a new udev rule to detect when a new device is added to the system. It also requires that the autoexpand policy be ported from Illumos, see: https://github.com/illumos/illumos-gate/blob/master/usr/src/cmd/syseventd/modules/zfs_mod/zfs_mod.c Support for detecting the correct name of a vdev when it's not a whole disk was added by Turbo Fredriksson. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Signed-off-by: Turbo Fredriksson <turbo@bayour.com> Issue #2	2014-04-02 13:10:08 -07:00
Brian Behlendorf	9b101a7320	Clarify zpool_events_next() comment Due to the very poorly chosen argument name 'cleanup_fd' it was completely unclear that this file descriptor is used to track the current cursor location. When the file descriptor is created by opening ZFS_DEV a private cursor is created in the kernel for the returned file descriptor. Subsequent calls to zpool_events_next() and zpool_events_seek() then require the file descriptor as an argument to reposition the cursor. When the file descriptor is closed the kernel state tracking the cursor is destroyed. This patch contains no functional change, it just changes a few variable names and clarifies the documentation. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Issue #2	2014-03-31 16:11:08 -07:00
Brian Behlendorf	75e3ff58fe	Add zpool_events_seek() functionality The ZFS_IOC_EVENTS_SEEK ioctl was added to allow user space callers to seek around the zevent file descriptor by EID. When a specific EID is passed and it exists the cursor will be positioned there. If the EID is no longer cached by the kernel ENOENT is returned. The caller may also pass ZEVENT_SEEK_START or ZEVENT_SEEK_END to seek to those respective locations. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Issue #2	2014-03-31 16:10:57 -07:00
Brian Behlendorf	a2f1945ee3	Add a unique "eid" value to all zevents Tagging each zevent with a unique monotonically increasing EID (Event IDentifier) provides the required infrastructure for a user space daemon to reliably process zevents. By writing the EID to persistent storage the daemon can safely resume where it left off in the event stream when it's restarted. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlap <cdunlap@llnl.gov> Issue #2	2014-03-31 16:10:41 -07:00
Boris Protopopov	0ed212dc0e	Illumos #4089 NULL pointer dereference in arc_read() 4089 NULL pointer dereference in arc_read() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4089 illumos/illumos-gate@57815f6b95 Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2171 Issue #2165 Closes #2198	2014-03-24 11:06:57 -07:00
Richard Yao	26b42f3f9d	Implement -t option to zpool import for temporary pool names Originally, users had to handle spa namespace collisions by either exporting the already imported pool or by specifying a new name for the pool with a conflicting name. In the case of root pools from virtual guests, neither approach to collision resolution is reasonable. This is addressed by extending the new name syntax with a -t option to specify that the new name is temporary. When specified, this sets an internal flag that is passed into the kernel to tell it that all label updates should refer to the name used in the original label. Consequently, the original pool name will be retained on export. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2189	2014-03-20 12:05:30 -07:00
Andrey Vesnovaty	00fcdee1f8	Fix regression introduced in port of Illumos #3744 Remove the redundant call to zfs_unmount_snap() which was being done after char array was freed, This fixes an upstream regression that was introduced in commit zfsonlinux/zfs@d09f25dc66, which ported the Illumos 3744 changes. Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2156	2014-03-20 11:00:48 -07:00
Boris Protopopov	47fe91b54c	Illumos #4088 use after free in arc_release() 4088 use after free in arc_release() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com> Approved by: Dan McDonald <danmcd@nexenta.com> References: https://www.illumos.org/issues/4088 illumos/illumos-gate@ccc22e1304 From the illumos issue: A race-induced use after free occurs in arc_release() where the ARC header is used outside the critical section protected by the hash_lock. Ported by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <ryao@gentoo.org> Closes #2162	2014-03-10 09:11:15 -07:00
Tim Chase	a45fc6a677	Use KM_PUSHPAGE in spa_add() for spa_label_features. The spa_label_features nvlist is used in the sync context during pool version upgrade. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2168	2014-03-10 09:09:30 -07:00
Brian Behlendorf	e74400155f	Export symbols dsl_sync_task{_nowait} These are needed by consumers (i.e. Lustre) who wish to perform a callback in the syncing context. Both a blocking and non-blocking version are available to callers. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2014-03-07 10:01:36 -08:00
Ned Bass	a77c4c8332	Improve reporting of tx assignment wait times Some callers of dmu_tx_assign() use the TXG_NOWAIT flag and call dmu_tx_wait() themselves before retrying if the assignment fails. The wait times for such callers are not accounted for in the dmu_tx_assign kstat histogram, because the histogram only records time spent in dmu_tx_assign(). This change moves the histogram update to dmu_tx_wait() to properly account for all time spent there. One downside of this approach is that it is possible to call dmu_tx_wait() multiple times before successfully assigning a transaction, in which case the cumulative wait time would not be recorded. However, this case should not often arise in practice, because most callers currently use one of these forms: dmu_tx_assign(tx, TXG_WAIT); dmu_tx_assign(tx, waited ? TXG_WAITED : TXG_NOWAIT); The first form should make just one call to dmu_tx_delay() inside of dmu_tx_assign(). The second form retries with TXG_WAITED if the first assignment fails and incurs a delay, in which case no further waiting is performed. Therefore transaction delays normally occur in one call to dmu_tx_wait() so the histogram should be fairly accurate. Another possible downside of this approach is that the histogram will no longer record overhead outside of dmu_tx_wait() such as in dmu_tx_try_assign(). While I'm not aware of any reason for concern on this point, it is conceivable that lock contention, long list traversal, etc. could cause assignment delays that would not be reflected in the histogram. Therefore the histogram should strictly be used for visibility in to the normal delay mechanisms and not as a profiling tool for code performance. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1915	2014-03-04 12:22:24 -08:00
Ned Bass	3ccab25205	replace nreserved with ndirty in txgs kstat The nreserved column in the txgs kstat file always contains 0 following the write throttle restructuring of commit `e8b96c6007`. Prior to that commit, the nreserved column showed the number of bytes temporarily reserved in the pool by a transaction group at sync time. The new write throttle did away with temporary reservations and uses the amount of dirty data instead. To approximate the old output of the txgs kstat, the number of dirty bytes per-txg was passed in as the nreserved value to spa_txg_history_set_io(). This approach did not work as intended, because the per-txg dirty value is decremented as data is written out to disk, so it is zero by the time we call spa_txg_history_set_io(). To fix this, save the number of dirty bytes before calling spa_sync(), and pass this value in to spa_txg_history_set_io(). Also, since the name "nreserved" is now a misnomer, the column heading is now labeled "ndirty". Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #1696	2014-03-04 12:22:24 -08:00
Ned Bass	3d920a1567	dmu_tx kstat cleanup A few counters in the dmu_tx kstats are obsolete or no longer bumped properly. - The sync task restructuring commit `13fe019870` removed the code that bumpted dmu_tx_quota. The counter is now bumped in two cases, instead of just the one case as before (after the result of dsl_dataset_check_quota call). The second case is where we check the requested reservation against the actual pool size, as this is an implicit quota of sorts. - The write throttle restructuring commit `e8b96c6007` makes dmu_tx_how and dmu_tx_inflight obsolete, so they are removed. Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org> Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #1914	2014-03-04 12:22:24 -08:00
Richard Yao	cecb7487fc	Invalidate Linux buffer cache on vdevs upon each flush Userland tools such as blkid, grub2-probe and zdb will go through the buffer cache. However, ZFS uses on submit_bio() to bypass the buffer cache when performing IO operations on vdevs for efficiency purposes. This permits the on-disk state and buffer cache to fall out of synchronization. That causes seemingly random failures when tools reading stale metadata from the buffer cache try to access references to data that is no longer there. A particularly bad failure this causes involves grub2-probe, which is used by grub2-mkconfig. Ordinarily, a rootfs might be called rpool/ROOT/gentoo. However, when a failure occurs in grub2-probe, grub2-mkconfig will generate a configuration file containing /ROOT/gentoo, which omits the pool name and causes a boot failure. This is avoidable by calling invalidate_bdev() on each flush, which is a simple way to ensure that all non-dirty pages are wiped. Since userland tools rarely access vdevs directly, this should be a fancy noop >99.999% of the time and have little impact on IO. We could have tried a finer grained approach for the rare instances in which the vdevs are accessed frequently by userland. However, that would require consideration of corner cases and it is not worth the effort. Memory-wise, it would have been better to use a Linux kernel API hook to disable the buffer cache on such devices, but it provides us no way of doing that, so we opt for this approach instead. We should revisit that idea in the future when higher priority issues have been tackled. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2150	2014-03-04 12:22:03 -08:00
Alexander Stetsenko	36f92e93e5	Illumos #4574 get_clones_stat does not call zap_count in non-debug kernel 4574 get_clones_stat does not call zap_count in non-debug kernel Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Gordon Ross <gwr@nexenta.com> References: https://www.illumos.org/issues/4574 illumos/illumos-gate@03d1795fa6 Ported-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2147	2014-03-04 11:50:13 -08:00
Tim Chase	13a7ba1c2c	Fix zap_lookup() in feature_is_supported(). The length (number of integers) argument passed to zap_lookup was wrong; likely as a result of performing stack-reduction on the function. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2141	2014-03-04 11:44:44 -08:00
Andrew Barnes	1ba1615925	Remove recursion from dsl_dir_willuse_space() Remove recursion from dsl_dir_willuse_space() to reduce stack usage. Issues with stack overflow were observed in zfs recv of zvols, likelihood of an overflow is proportional to the depth of the dataset as dsl_dir_willuse_space() recurses to parent datasets. Signed-off-by: Andrew Barnes <barnes333@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #2069	2014-03-04 11:22:27 -08:00
Prakash Surya	2b13331d62	Set "arc_meta_limit" to 3/4 arc_c_max by default Unfortunately, this change is an cheap attempt to work around a pathological workload for the ARC. A "real" solution still needs to be fleshed out, so this patch is intended to alleviate the situation in the meantime. Let me try and describe the problem.. Data buffers residing in the dbuf hash table (dbuf cache) will keep a hold on their respective dnode, this dnode will in turn keep a hold on its backing dbuf (the physical block of the dnode object backing it). Since the dnode has a hold on its backing dbuf, the arc buffer for this dbuf is unevictable. What this essentially boils down to, "data" buffers have the potential to pin "metadata" in the arc (as a result of these dnode object buffers being unevictable). This scenario becomes a real problem when the workload consists of many small files (e.g. creating millions of 4K files). With this workload, the arc's "arc_meta_used" space get filled up with buffers for any resident directories as well as buffers for the objset's dnode object. Once the "arc_meta_limit" is reached, the directory buffers will be evicted and only the unevictable dnode object buffers will reside. If the workload is simply creating new small files, these dnode object buffers will never even be needed again, whereas the directory buffers will be used constantly until the creates move to a new directory. If "arc_c" and "arc_meta_limit" are sized appropriately, this situation wont occur. This is because as the data buffers accumulate, "arc_size" will eventually approach "arc_c" (before "arc_meta_used" reaches "arc_meta_limit"); at that point the data buffers will be evicted, which releases the hold on the dnode, which releases the hold on the dnode object's dbuf, which allows that buffer to be evicted from the arc in preference to more "useful" metadata. So, to side step the issue, we simply need to ensure "arc_size" reaches "arc_c" before "arc_meta_used" reaches "arc_meta_limit". In order to pick a proper limit, we have to do some math. To make things a little easier to follow, it is assumed that there will only be a single data buffer per file (which is probably always the case for "small" files anyways). Based on the current internals of the arc, if N files residing in the dbuf cache all pin a single dnode buffer (i.e. their dnodes all share the same physical dnode object block), then the following amount of "arc_meta_used" space will be consumed: - 16K for the dnode object's block - [ 16384 bytes] - N * sizeof(dnode_t) -------------- [ N * 928 bytes] - (N + 1) * sizeof(arc_buf_t) ------ [(N + 1) * 72 bytes] - (N + 1) * sizeof(arc_buf_hdr_t) -- [(N + 1) * 264 bytes] - (N + 1) * sizeof(dmu_buf_impl_t) - [(N + 1) * 280 bytes] To simplify, these N files will pin the following amount of "arc_meta_used" space as unevictable: Pinned "arc_meta_used" bytes = 16384 + N * 928 + (N + 1) * (72 + 264 + 280) Pinned "arc_meta_used" bytes = 17000 + N * 1544 This pinned space is regardless of the size of the files, and is only dependent on the number of pinned dnodes sharing a physical block (i.e. N). For example, 32 512b files sharing a single dnode object block would consume the same "arc_meta_used" space as 32 4K files sharing a single dnode object block. Now, given a files size of S, we can determine the total amount of space that will be consumed in the arc: Total = 17000 + N * 1544 + S * N ^^^^^^^^^^^^^^^^ ^^^^^ metadata data So, given these formulas, we can generate a table which states the ratio of pinned metadata to total arc (meta + data) using different values of N (number of pinned dnodes per pinned physical dnode block) and S (size of the file). File Sizes (S) \| 512 \| 1024 \| 2048 \| 4096 \| 8192 \| 16384 \| ---+----------+----------+----------+----------+----------+----------+ 1 \| 0.973132 \| 0.947670 \| 0.900544 \| 0.819081 \| 0.693597 \| 0.530921 \| 2 \| 0.951497 \| 0.907481 \| 0.830632 \| 0.710325 \| 0.550779 \| 0.380051 \| N 4 \| 0.918807 \| 0.849809 \| 0.738842 \| 0.585844 \| 0.414271 \| 0.261250 \| 8 \| 0.877541 \| 0.781803 \| 0.641770 \| 0.472505 \| 0.309333 \| 0.182965 \| 16 \| 0.835819 \| 0.717945 \| 0.559996 \| 0.388885 \| 0.241376 \| 0.137253 \| 32 \| 0.802106 \| 0.669597 \| 0.503304 \| 0.336277 \| 0.202123 \| 0.112423 \| As you can see, if we wanted to support the absolute worst case of 1 dnode per physical dnode block and 512b files, we would have to set the "arc_meta_limit" to something greater than 97.3132% of "arc_c_max". At that point, it essentially defeats the purpose of having an "arc_meta_limit" at all. This patch changes the default value of "arc_meta_limit" to be 75% of "arc_c_max", which should be good enough for "most" workloads (I think). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	cc7f677c16	Split "data_size" into "meta" and "data" Previously, the "data_size" field in the arcstats kstat contained the amount of cached "metadata" and "data" in the ARC. The problem is this then made it difficult to extract out just the "metadata" size, or just the "data" size. To make it easier to distinguish the two values, "data_size" has been modified to count only buffers of type ARC_BUFC_DATA, and "meta_size" was added to count only buffers of type ARC_BUFC_METADATA. If one wants the old "data_size" value, simply sum the new "data_size" and "meta_size" values. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	da8ccd0ee0	Prioritize "metadata" in arc_get_data_buf When the arc is at it's size limit and a new buffer is added, data will be evicted (or recycled) from the arc to make room for this new buffer. As far as I can tell, this is to try and keep the arc from over stepping it's bounds (i.e. keep it below the size limitation placed on it). This makes sense conceptually, but there appears to be a subtle flaw in its current implementation, resulting in metadata buffers being throttled. When it evicts from the arc's lists, it also passes in a "type" so as to remove a buffer of the same type that it is adding. The problem with this is that once the size limit is hit, the ratio of "metadata" to "data" contained in the arc essentially becomes fixed. For example, consider the following scenario: * the size of the arc is capped at 10G * the meta_limit is capped at 4G * 9G of the arc contains "data" * 1G of the arc contains "metadata" Now, every time a new "metadata" buffer is created and added to the arc, an older "metadata" buffer(s) will be removed from the arc; preserving the 9G "data" to 1G "metadata" ratio that was in-place when the size limit was reached. This occurs even though the amount of "metadata" is far below the "metadata" limit. This can result in the arc behaving pathologically for certain workloads. To fix this, the arc_get_data_buf function was modified to evict "data" from the arc even when adding a "metadata" buffer; unless it's at the "metadata" limit. In addition, arc_evict now more closely resembles arc_evict_ghost; such that when evicting "data" from the arc, it may make a second pass over the arc lists and evict "metadata" if it cannot meet the eviction size the first time around. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	77765b540b	Remove "arc_meta_used" from arc_adjust calculation Using "arc_meta_used" to determine if the arc's mru list is over it's target value of "arc_p" doesn't seem correct. The size of the mru list and the value of "arc_meta_used", although related, are completely independent. Buffers contained in "arc_meta_used" may not even be contained in the arc's mru list. As such, this patch removes "arc_meta_used" from the calculation in arc_adjust. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	94520ca462	Prune metadata from ghost lists in arc_adjust_meta To maintain a strict limit on the metadata contained in the arc, while preventing the arc buffer headers from completely consuming the "arc_meta_used" space, we need to evict metadata buffers from the arc's ghost lists along with the regular lists. This change modifies arc_adjust_meta such that it more closely models the adjustments made in arc_adjust. "arc_meta_used" is used similarly to "arc_size", and "arc_meta_limit" is used similarly to "arc_c". Testing metadata intensive workloads (e.g. creating, copying, and removing millions of small files and/or directories) has shown this change to make a dramatic improvement to the hit rate maintained in the arc. While I think there is still room for improvement, this is a big step in the right direction. In addition, zpl_free_cached_objects was made into a no-op as I'm not yet sure how to properly implement that function. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	1e3cb67b53	Revert "Return -1 from arc_shrinker_func()" This reverts commit `c11a12bc3b`. Out of memory events were fixed by reverting this patch. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	624227854e	Disable arc_p adapt dampener by default It's unclear why adjustments to arc_p need to be dampened as they are in arc_adjust. With that said, it's removal significantly improves the arc's ability to "warm up" to a given workload. Thus, I'm disabling by default until its usefulness is better understood. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:49 -08:00
Prakash Surya	f521ce1b9c	Allow "arc_p" to drop to zero or grow to "arc_c" Setting a limit on the minimum value of "arc_p" has been shown to have detrimental effects on the arc hit rate for certain "metadata" intensive workloads. Specifically, this has been exhibited with a workload that constantly dirties new "metadata" but also frequently touches a "small" amount of mfu data (e.g. mkdir's). What is seen is that the new anon data throttles the mfu list to a negligible size (because arc_p > anon + mru in arc_get_data_buf), even though the mfu ghost list receives a constant stream of hits. To remedy this, arc_p is now allowed to drop to zero if the algorithm deems it necessary. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 16:10:27 -08:00
Prakash Surya	89c8cac493	Disable aggressive arc_p growth by default For specific workloads consisting mainly of mfu data and new anon data buffers, the aggressive growth of arc_p found in the arc_get_data_buf() function can have detrimental effects on the mfu list size and ghost list hit rate. Running a workload consisting of two processes: * Process 1 is creating many small files * Process 2 is tar'ing a directory consisting of many small files I've seen arc_p and the mru grow to their maximum size, while the mru ghost list receives 100K times fewer hits than the mfu ghost list. Ideally, as the mfu ghost list receives hits, arc_p should be driven down and the size of the mfu should increase. Given the specific workload I was testing with, the mfu list size should grow to a point where almost no mfu ghost list hits would occur. Unfortunately, this does not happen because the newly dirtied anon buffers constancy drive arc_p to its maximum value and keep it there (effectively prioritizing the mru list and starving the mfu list down to a negligible size). The logic to increment arc_p from within the arc_get_data_buf() function was introduced many years ago in this upstream commit: commit 641fbdae3a027d12b3c3dcd18927ccafae6d58bc Author: maybee <none@none> Date: Wed Dec 20 15:46:12 2006 -0800 6505658 target MRU size (arc.p) needs to be adjusted more aggressively and since I don't fully understand the motivation for the change, I am reluctant to completely remove it. As a way to test out how it's removal might affect performance, I've disabled that code by default, but left it tunable via a module option. Thus, if its removal is found to be grossly detrimental for certain workloads, it can be re-enabled on the fly, without a code change. Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 14:53:28 -08:00
Prakash Surya	39e055c44b	Adjust arc_p based on "bytes" in arc_shrink In an attempt to prevent arc_c from collapsing "too fast", the arc_shrink() function was updated to take a "bytes" parameter by this change: commit `302f753f16` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Tue Mar 13 14:29:16 2012 -0700 Integrate ARC more tightly with Linux Unfortunately, that change failed to make a similar change to the way that arc_p was updated. So, there still exists the possibility for arc_p to collapse to near 0 when the kernel start calling the arc's shrinkers. This change attempts to fix this, by decrementing arc_p by the "bytes" parameter in the same way that arc_c is updated. In addition, a minimum value of arc_p is attempted to be maintained, similar to the way a minimum arc_p value is maintained in arc_adapt(). Signed-off-by: Prakash Surya <surya1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2110	2014-02-21 14:53:08 -08:00
Brian Behlendorf	9141582592	Set zfs_arc_min to 4MB Decrease the mimimum ARC size from 1/32 of total system memory (or 64MB) to a much smaller 4MB. 1) Large systems with over a 1TB of memory are being deployed and reserving 1/32 of this memory (32GB) as the mimimum requirement is overkill. 2) Tiny systems like the raspberry pi may only have 256MB of memory in which case 64MB is far too large. The ARC should be reclaimable if the VFS determines it needs the memory for some other purpose. If you want to ensure the ARC is never completely reclaimed due to memory pressure you may still set a larger value with zfs_arc_min. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov> Issue #2110	2014-02-21 14:52:02 -08:00
Richard Yao	4f2dcb3eee	Add erratum for issue #2094 ZoL commit `1421c89` unintentionally changed the disk format in a forward- compatible, but not backward compatible way. This was accomplished by adding an entry to zbookmark_t, which is included in a couple of on-disk structures. That lead to the creation of pools with incorrect dsl_scan_phys_t objects that could only be imported by versions of ZoL containing that commit. Such pools cannot be imported by other versions of ZFS or past versions of ZoL. The additional field has been removed by the previous commit. However, affected pools must be imported and scrubbed using a version of ZoL with this commit applied. This will return the pools to a state in which they may be imported by other implementations. The 'zpool import' or 'zpool status' command can be used to determine if a pool is impacted. A message similar to one of the following means your pool must be scrubbed to restore compatibility. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #1 detected. action: The pool can be imported using its name or numeric identifier, however there is a compatibility issue which should be corrected by running 'zpool scrub' see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... $ zpool status pool: zol-0.6.2-173 state: ONLINE scan: pool compatibility issue detected. see: https://github.com/zfsonlinux/zfs/issues/2094 action: To correct the issue run 'zpool scrub'. config: ... If there was an async destroy in progress 'zpool import' will prevent the pool from being imported. Further advice on how to proceed will be provided by the error message as follows. $ zpool import pool: zol-0.6.2-173 id: 1165955789558693437 state: ONLINE status: Errata #2 detected. action: The pool can not be imported with this version of ZFS due to an active asynchronous destroy. Revert to an earlier version and allow the destroy to complete before updating. see: http://zfsonlinux.org/msg/ZFS-8000-ER config: ... Pools affected by the damaged dsl_scan_phys_t can be detected prior to an upgrade by running the following command as root: zdb -dddd poolname 1 \| grep -P '^\t\tscan = ' \| sed -e 's;scan = ;;' \| wc -w Note that `poolname` must be replaced with the name of the pool you wish to check. A value of 25 indicates the dsl_scan_phys_t has been damaged. A value of 24 indicates that the dsl_scan_phys_t is normal. A value of 0 indicates that there has never been a scrub run on the pool. The regression caused by the change to zbookmark_t never made it into a tagged release, Gentoo backports, Ubuntu, Debian, Fedora, or EPEL stable respositorys. Only those using the HEAD version directly from Github after the 0.6.2 but before the 0.6.3 tag are affected. This patch does have one limitation that should be mentioned. It will not detect errata #2 on a pool unless errata #1 is also present. It expected this will not be a significant problem because pools impacted by errata #2 have a high probably of being impacted by errata #1. End users can ensure they do no hit this unlikely case by waiting for all asynchronous destroy operations to complete before updating ZoL. The presence of any background destroys on any imported pools can be checked by running `zpool get freeing` as root. This will display a non-zero value for any pool with an active asynchronous destroy. Lastly, it is expected that no user data has been lost as a result of this erratum. Original-patch-by: Tim Chase <tim@chase2k.com> Reworked-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #2094	2014-02-21 12:10:40 -08:00

1 2 3 4 5 ...

811 Commits