freebsd-nq

Author	SHA1	Message	Date
Serapheim Dimitropoulos	1c44a5c97f	hdr_recl calls zthr_wakeup() on destroyed zthr There exists a race condition were hdr_recl() calls zthr_wakeup() on a destroyed zthr. The timeline is the following: [1] hdr_recl() runs first and goes intro zthr_wakeup() because arc_initialized is set. [2] arc_fini() is called by another thread, zeroes that flag, destroying the zthr, and goes into buf_init(). [3] hdr_recl() tries to enter the destroyed mutex and we blow up. This patch ensures that the ARC's zthrs are not offloaded any new work once arc_initialized is set and then destroys them after all of the ARC state has been deleted. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #9047	2019-07-18 12:55:29 -07:00
Tomohiro Kusumi	f79121d114	Fix wrong comment on zcr_blksz_{min,max} These aren't tunable; illumos has this comment fixed in "3742 zfs comments need cleaner, more consistent style", so sync with that. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9052	2019-07-18 12:48:46 -07:00
jdike	a649768a17	Fix lockdep recursive locking false positive in dbuf_destroy lockdep reports a possible recursive lock in dbuf_destroy. It is true that dbuf_destroy is acquiring the dn_dbufs_mtx on one dnode while holding it on another dnode. However, it is impossible for these to be the same dnode because, among other things,dbuf_destroy checks MUTEX_HELD before acquiring the mutex. This fix defines a class NESTED_SINGLE == 1 and changes that lock to call mutex_enter_nested with a subclass of NESTED_SINGLE. In order to make the userspace code compile, include/sys/zfs_context.h now defines mutex_enter_nested and NESTED_SINGLE. This is the lockdep report: [ 122.950921] ============================================ [ 122.950921] WARNING: possible recursive locking detected [ 122.950921] 4.19.29-4.19.0-debug-d69edad5368c1166 #1 Tainted: G O [ 122.950921] -------------------------------------------- [ 122.950921] dbu_evict/1457 is trying to acquire lock: [ 122.950921] 0000000083e9cbcf (&dn->dn_dbufs_mtx){+.+.}, at: dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] but task is already holding lock: [ 122.950921] 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] other info that might help us debug this: [ 122.950921] Possible unsafe locking scenario: [ 122.950921] CPU0 [ 122.950921] ---- [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] lock(&dn->dn_dbufs_mtx); [ 122.950921] * DEADLOCK * [ 122.950921] May be due to missing lock nesting notation [ 122.950921] 1 lock held by dbu_evict/1457: [ 122.950921] #0: 0000000055523987 (&dn->dn_dbufs_mtx){+.+.}, at: dnode_evict_dbufs+0x90/0x740 [zfs] [ 122.950921] stack backtrace: [ 122.950921] CPU: 0 PID: 1457 Comm: dbu_evict Tainted: G O 4.19.29-4.19.0-debug-d69edad5368c1166 #1 [ 122.950921] Hardware name: Supermicro H8SSL-I2/H8SSL-I2, BIOS 080011 03/13/2009 [ 122.950921] Call Trace: [ 122.950921] dump_stack+0x91/0xeb [ 122.950921] __lock_acquire+0x2ca7/0x4f10 [ 122.950921] lock_acquire+0x153/0x330 [ 122.950921] dbuf_destroy+0x3c0/0xdb0 [zfs] [ 122.950921] dbuf_evict_one+0x1cc/0x3d0 [zfs] [ 122.950921] dbuf_rele_and_unlock+0xb84/0xd60 [zfs] [ 122.950921] dnode_evict_dbufs+0x3a6/0x740 [zfs] [ 122.950921] dmu_objset_evict+0x7a/0x500 [zfs] [ 122.950921] dsl_dataset_evict_async+0x70/0x480 [zfs] [ 122.950921] taskq_thread+0x979/0x1480 [spl] [ 122.950921] kthread+0x2e7/0x3e0 [ 122.950921] ret_from_fork+0x27/0x50 Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jeff Dike <jdike@akamai.com> Closes #8984	2019-07-17 09:18:24 -07:00
Tomohiro Kusumi	c8802ba08d	Use zfsctl_snapshot_hold() wrapper zfs_refcount_() are to be wrapped by zfsctl_snapshot_() in this file. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9039	2019-07-17 09:07:53 -07:00
Brian Behlendorf	3b03ff2276	Fix get_special_prop() build failure The cast of the size_t returned by strlcpy() to a uint64_t by the VERIFY3U can result in a build failure when CONFIG_FORTIFY_SOURCE is set. This is due to the additional hardening. Since the token is expected to always fit in strval the VERIFY3U has been removed. If somehow it doesn't, it will still be safely truncated. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8999 Closes #9020	2019-07-16 14:14:12 -07:00
Serapheim Dimitropoulos	93e28d661e	Log Spacemap Project = Motivation At Delphix we've seen a lot of customer systems where fragmentation is over 75% and random writes take a performance hit because a lot of time is spend on I/Os that update on-disk space accounting metadata. Specifically, we seen cases where 20% to 40% of sync time is spend after sync pass 1 and ~30% of the I/Os on the system is spent updating spacemaps. The problem is that these pools have existed long enough that we've touched almost every metaslab at least once, and random writes scatter frees across all metaslabs every TXG, thus appending to their spacemaps and resulting in many I/Os. To give an example, assuming that every VDEV has 200 metaslabs and our writes fit within a single spacemap block (generally 4K) we have 200 I/Os. Then if we assume 2 levels of indirection, we need 400 additional I/Os and since we are talking about metadata for which we keep 2 extra copies for redundancy we need to triple that number, leading to a total of 1800 I/Os per VDEV every TXG. We could try and decrease the number of metaslabs so we have less I/Os per TXG but then each metaslab would cover a wider range on disk and thus would take more time to be loaded in memory from disk. In addition, after it's loaded, it's range tree would consume more memory. Another idea would be to just increase the spacemap block size which would allow us to fit more entries within an I/O block resulting in fewer I/Os per metaslab and a speedup in loading time. The problem is still that we don't deal with the number of I/Os going up as the number of metaslabs is increasing and the fact is that we generally write a lot to a few metaslabs and a little to the rest of them. Thus, just increasing the block size would actually waste bandwidth because we won't be utilizing our bigger block size. = About this patch This patch introduces the Log Spacemap project which provides the solution to the above problem while taking into account all the aforementioned tradeoffs. The details on how it achieves that can be found in the references sections below and in the code (see Big Theory Statement in spa_log_spacemap.c). Even though the change is fairly constraint within the metaslab and lower-level SPA codepaths, there is a side-change that is user-facing. The change is that VDEV IDs from VDEV holes will no longer be reused. To give some background and reasoning for this, when a log device is removed and its VDEV structure was replaced with a hole (or was compacted; if at the end of the vdev array), its vdev_id could be reused by devices added after that. Now with the pool-wide space maps recording the vdev ID, this behavior can cause problems (e.g. is this entry referring to a segment in the new vdev or the removed log?). Thus, to simplify things the ID reuse behavior is gone and now vdev IDs for top-level vdevs are truly unique within a pool. = Testing The illumos implementation of this feature has been used internally for a year and has been in production for ~6 months. For this patch specifically there don't seem to be any regressions introduced to ZTS and I have been running zloop for a week without any related problems. = Performance Analysis (Linux Specific) All performance results and analysis for illumos can be found in the links of the references. Redoing the same experiments in Linux gave similar results. Below are the specifics of the Linux run. After the pool reached stable state the percentage of the time spent in pass 1 per TXG was 64% on average for the stock bits while the log spacemap bits stayed at 95% during the experiment (graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png). Sync times per TXG were 37.6 seconds on average for the stock bits and 22.7 seconds for the log spacemap bits (related graph: sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result the log spacemap bits were able to push more TXGs, which is also the reason why all graphs quantified per TXG have more entries for the log spacemap bits. Another interesting aspect in terms of txg syncs is that the stock bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8, and 20% reach 9. The log space map bits reached sync pass 4 in 79% of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This emphasizes the fact that not only we spend less time on metadata but we also iterate less times to convergence in spa_sync() dirtying objects. [related graphs: stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png] Finally, the improvement in IOPs that the userland gains from the change is approximately 40%. There is a consistent win in IOPS as you can see from the graphs below but the absolute amount of improvement that the log spacemap gives varies within each minute interval. sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png = Porting to Other Platforms For people that want to port this commit to other platforms below is a list of ZoL commits that this patch depends on: Make zdb results for checkpoint tests consistent `db587941c5` Update vdev_is_spacemap_addressable() for new spacemap encoding `419ba59145` Simplify spa_sync by breaking it up to smaller functions `8dc2197b7b` Factor metaslab_load_wait() in metaslab_load() `b194fab0fb` Rename range_tree_verify to range_tree_verify_not_present `df72b8bebe` Change target size of metaslabs from 256GB to 16GB `c853f382db` zdb -L should skip leak detection altogether `21e7cf5da8` vs_alloc can underflow in L2ARC vdevs `7558997d2f` Simplify log vdev removal code `6c926f426a` Get rid of space_map_update() for ms_synced_length `425d3237ee` Introduce auxiliary metaslab histograms `928e8ad47d` Error path in metaslab_load_impl() forgets to drop ms_sync_lock `8eef997679` = References Background, Motivation, and Internals of the Feature - OpenZFS 2017 Presentation: youtu.be/jj2IxRkl5bQ - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project Flushing Algorithm Internals & Performance Results (Illumos Specific) - Blogpost: sdimitro.github.io/post/zfs-lsm-flushing/ - OpenZFS 2018 Presentation: youtu.be/x6D2dHRjkxw - Slides: slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm Upstream Delphix Issues: DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320 DLPX-63385 Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8442	2019-07-16 10:11:49 -07:00
Tomohiro Kusumi	6993e01202	Drop redundant POSIX ACL check in zpl_init_acl() ZFS_ACLTYPE_POSIXACL has already been tested in zpl_init_acl(), so no need to test again on POSIX ACL access. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9009	2019-07-15 16:26:52 -07:00
Brian Behlendorf	64f3d39ae4	Export dnode symbols External consumers such as Lustre require access to the dnode interfaces in order to correctly manipulate dnodes. Reviewed-by: James Simmons <uja.ornl@yahoo.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8994 Closes #9027	2019-07-15 16:11:55 -07:00
Tom Caputi	9949b856a0	Ensure dsl_destroy_head() decrypts objsets This patch corrects a small issue where the dsl_destroy_head() code that runs when the async_destroy feature is disabled would not properly decrypt the dataset before beginning processing. If the dataset is not able to be decrypted, the optimization code now simply does not run and the dataset is completely destroyed in the DSL sync task. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #9021	2019-07-15 16:08:42 -07:00
Tomohiro Kusumi	ff9630d1a8	Disable unused pathname::pn_path* (unneeded in Linux) struct pathname is originally from Solaris VFS, and it has been used in ZoL to merely call VOP from Linux VFS interface without API change, therefore pathname::pn_path* are unused and unneeded. Technically, struct pathname is a wrapper for C string in ZoL. Saves stack a bit on lookup and unlink. (#if0'd members instead of removing since comments refer to them.) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #9025	2019-07-15 13:57:56 -07:00
Brian Behlendorf	e5db313494	Linux 5.0 compat: SIMD compatibility Restore the SIMD optimization for 4.19.38 LTS, 4.14.120 LTS, and 5.0 and newer kernels. This is accomplished by leveraging the fact that by definition dedicated kernel threads never need to concern themselves with saving and restoring the user FPU state. Therefore, they may use the FPU as long as we can guarantee user tasks always restore their FPU state before context switching back to user space. For the 5.0 and 5.1 kernels disabling preemption and local interrupts is sufficient to allow the FPU to be used. All non-kernel threads will restore the preserved user FPU state. For 5.2 and latter kernels the user FPU state restoration will be skipped if the kernel determines the registers have not changed. Therefore, for these kernels we need to perform the additional step of saving and restoring the FPU registers. Invalidating the per-cpu global tracking the FPU state would force a restore but that functionality is private to the core x86 FPU implementation and unavailable. In practice, restricting SIMD to kernel threads is not a major restriction for ZFS. The vast majority of SIMD operations are already performed by the IO pipeline. The remaining cases are relatively infrequent and can be handled by the generic code without significant impact. The two most noteworthy cases are: 1) Decrypting the wrapping key for an encrypted dataset, i.e. `zfs load-key`. All other encryption and decryption operations will use the SIMD optimized implementations. 2) Generating the payload checksums for a `zfs send` stream. In order to avoid making any changes to the higher layers of ZFS all of the `*_get_ops()` functions were updated to take in to consideration the calling context. This allows for the fastest implementation to be used as appropriate (see kfpu_allowed()). The only other notable instance of SIMD operations being used outside a kernel thread was at module load time. This code was moved in to a taskq in order to accommodate the new kernel thread restriction. Finally, a few other modifications were made in order to further harden this code and facilitate testing. They include updating each implementations operations structure to be declared as a constant. And allowing "cycle" to be set when selecting the preferred ops in the kernel as well as user space. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8754 Closes #8793 Closes #8965	2019-07-12 09:31:20 -07:00
Nick Mattis	d230a65c3b	Fixes: #8934 Large kmem_alloc Large allocation over the spl_kmem_alloc_warn value was being performed. Switched to vmem_alloc interface as specified for large allocations. Changed the subsequent frees to match. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: nmattis <nickm970@gmail.com> Closes #8934 Closes #9011	2019-07-10 15:54:49 -07:00
Paul Dagnelie	f664f1ee7f	Decrease contention on dn_struct_rwlock Currently, sequential async write workloads spend a lot of time contending on the dn_struct_rwlock. This lock is responsible for protecting the entire block tree below it; this naturally results in some serialization during heavy write workloads. This can be resolved by having per-dbuf locking, which will allow multiple writers in the same object at the same time. We introduce a new rwlock, the db_rwlock. This lock is responsible for protecting the contents of the dbuf that it is a part of; when reading a block pointer from a dbuf, you hold the lock as a reader. When writing data to a dbuf, you hold it as a writer. This allows multiple threads to write to different parts of a file at the same time. Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens matt@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-52564 External-issue: DLPX-53085 External-issue: DLPX-57384 Closes #8946	2019-07-08 13:18:50 -07:00
Brad Lewis	cb70964221	8659 static dtrace probes unavailable on non-GPL modules ZFS tracing efforts are hampered by the inability to access zfs static probes(probes using DTRACE_PROBE macros). The probes are available via tracepoints for GPL modules only. The build could be modified to generate a function for each unique DTRACE_PROBE invocation. These could be then accessed via kprobes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <matt@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Brad Lewis <brad.lewis@delphix.com> Closes #8659 Closes #8663	2019-07-08 11:20:53 -07:00
Brian Behlendorf	1086f54219	Revert "Fail early on bio corruption confirmed on 5.2-rc1" This reverts commit `aa7aab6c45`. The change is not compatible with CentOS 6's 2.6.32 based kernel due to differnces in the bio layer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #8961	2019-07-05 20:38:56 -07:00
Tom Caputi	52a83dc694	Remove VERIFY from dsl_dataset_crypt_stats() This patch fixes an issue where dsl_dataset_crypt_stats() would VERIFY that it was able to hold the encryption root. This function should instead silently continue without populating the related field in the nvlist, as is the convention for this code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8976	2019-07-05 16:53:14 -07:00
Paul Dagnelie	fe0ea84812	Don't activate metaslabs with weight 0 We return ENOSPC in metaslab_activate if the metaslab has weight 0, to avoid activating a metaslab with no space available. For sanity checking, we also assert that there is no free space in the range tree in that case. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8968	2019-07-05 16:45:20 -07:00
Paul Zuchowski	6dbca94f0c	Improve "Unable to automount" error message. Having the mountpoint and dataset name both in the message made it confusing to read. Additionally, convert this to a zfs_dbgmsg rather than sending it to the console. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #8959	2019-07-03 13:05:02 -07:00
Tomohiro Kusumi	aa7aab6c45	Fail early on bio corruption confirmed on 5.2-rc1 Unable to import zpool with "Large kmem_alloc" warning due to corrupted bio's with invalid # of page vectors. See #8867 for details. Fail early with ENOMEM. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8867 Closes #8961	2019-07-03 13:03:22 -07:00
Brian Behlendorf	46db9d6161	Check b_freeze_cksum under ZFS_DEBUG_MODIFY conditional The b_freeze_cksum field can only have data when ZFS_DEBUG_MODIFY is set. Therefore, the EQUIV check must be wrapped accordingly. For the same reason the ASSERT in arc_buf_fill() in unsafe. However, since it's largely redundant it has simply been removed. Reviewed-by: George Wilson <gwilson@delphix.com> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8979	2019-07-03 13:01:54 -07:00
Tomohiro Kusumi	df358db7c3	Don't use d_path() for automount mount point for chroot'd process Chroot'd process fails to automount snapshots due to realpath(3) failure in mount.zfs(8). Construct a mount point path from sb of the ctldir inode and dirent name, instead of from d_path(), so that chroot'd process doesn't get affected by its view of fs. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8903 Closes #8966	2019-07-02 08:25:23 -07:00
George Wilson	681a85cb01	nopwrites on dmu_sync-ed blocks can result in a panic After device removal, performing nopwrites on a dmu_sync-ed block will result in a panic. This panic can show up in two ways: 1. an attempt to issue an IOCTL in vdev_indirect_io_start() 2. a failed comparison of zio->io_bp and zio->io_bp_orig in zio_done() To resolve both of these panics, nopwrites of blocks on indirect vdevs should be ignored and new allocations should be performed on concrete vdevs. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #8957	2019-06-28 12:40:23 -07:00
Paul Dagnelie	679b0f2abf	Concurrent small allocation defeats large allocation With the new parallel allocators scheme, there is a possibility for a problem where two threads, allocating from the same allocator at the same time, conflict with each other. There are two primary cases to worry about. First, another thread working on another allocator activates the same metaslab that the first thread was trying to activate. This results in the first thread needing to go back and reselect a new metaslab, even though it may have waited a long time for this metaslab to load. Second, another thread working on the same allocator may have activated a different metaslab while the first thread was waiting for its metaslab to load. Both of these cases can cause the first thread to be significantly delayed in issuing its IOs. The second case can also cause metaslab load/unload churn; because the metaslab is loaded but not fully activated, we never set the selected_txg, which results in the metaslab being immediately unloaded again. This process can repeat many times, wasting disk and cpu resources. This is more likely to happen when the IO of the first thread is a larger one (like a ZIL write) and the other thread is doing a smaller write, because it is more likely to find an acceptable metaslab quickly. There are two primary changes. The first is to always proceed with the allocation when returning from metaslab_activate if we were preempted in either of the ways described in the previous section. The second change is to set the selected_txg before we do the call to activate so that even if the metaslab is not used for an allocation, we won't immediately attempt to unload it. Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> External-issue: DLPX-61314 Closes #8843	2019-06-26 11:00:12 -07:00
Alexander Motin	fc7546777b	Avoid extra taskq_dispatch() calls by DMU DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes and os_synced_dnodes. Since the number of sublists by default is equal to number of CPUs, it will dispatch equal, potentially large, number of tasks, waking up many CPUs to handle them, even if only one or few of sublists actually have any work to do. This change adds check for empty sublists to avoid this. Reviewed by: Sean Eric Fagan <sef@ixsystems.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8909	2019-06-25 12:03:38 -07:00
loli10K	746d4a451e	Fix bp_embedded_type enum definition With the addition of BP_EMBEDDED_TYPE_REDACTED in `30af21b0` a couple of codepaths make wrong assumptions and could potentially result in errors. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8951	2019-06-24 18:02:17 -07:00
Matthew Ahrens	59ec30a329	Remove code for zfs remap The "zfs remap" command was disabled by `6e91a72fe3`, because it has little utility and introduced some tricky bugs. This commit removes the code for it, the associated ZFS_IOC_REMAP ioctl, and tests. Note that the ioctl and property will remain, but have no functionality. This allows older software to fail gracefully if it attempts to use these, and avoids a backwards incompatibility that would be introduced if we renumbered the later ioctls/props. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8944	2019-06-24 16:44:01 -07:00
Tom Caputi	53864800f6	Fix error message on promoting encrypted dataset This patch corrects the error message reported when attempting to promote a dataset outside of its encryption root. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8905 Closes #8935	2019-06-24 16:42:52 -07:00
Don Brady	186898bbb5	OpenZFS 9425 - channel programs can be interrupted Problem Statement ================= ZFS Channel program scripts currently require a timeout, so that hung or long-running scripts return a timeout error instead of causing ZFS to get wedged. This limit can currently be set up to 100 million Lua instructions. Even with a limit in place, it would be desirable to have a sys admin (support engineer) be able to cancel a script that is taking a long time. Proposed Solution ================= Make it possible to abort a channel program by sending an interrupt signal.In the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to catch the signal. Once a signal is encountered, the dsl_sync_task function can install a Lua hook that will get called before the Lua interpreter executes a new line of code. The dsl_sync_task can resume with a standard txg_wait_sync call and wait for the txg to complete. Meanwhile, the hook will abort the script and indicate that the channel program was canceled. The kernel returns a EINTR to indicate that the channel program run was canceled. Porting notes: Added missing return value from cv_wait_sig() Authored by: Don Brady <don.brady@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> Signed-off-by: Don Brady <don.brady@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/9425 OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926 Closes #8904	2019-06-22 16:51:46 -07:00
Matthew Ahrens	cb9e5b7e84	dn_struct_rwlock can not be held in dmu_tx_try_assign() The thread calling dmu_tx_try_assign() can't hold the dn_struct_rwlock while assigning the tx, because this can lead to deadlock. Specifically, if this dnode is already assigned to an earlier txg, this thread may need to wait for that txg to sync (the ERESTART case below). The other thread that has assigned this dnode to an earlier txg prevents this txg from syncing until its tx can complete (calling dmu_tx_commit()), but it may need to acquire the dn_struct_rwlock to do so (e.g. via dmu_buf_hold*()). This commit adds an assertion to dmu_tx_try_assign() to ensure that this deadlock is not inadvertently introduced. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8929	2019-06-22 16:48:54 -07:00
Paul Dagnelie	a370182fed	Add SCSI_PASSTHROUGH to zvols to enable UNMAP support When exporting ZVOLs as SCSI LUNs, by default Windows will not issue them UNMAP commands. This reduces storage efficiency in many cases. We add the SCSI_PASSTHROUGH flag to the zvol's device queue, which lets the SCSI target logic know that it can handle SCSI commands. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8933	2019-06-21 09:40:56 -07:00
Tomohiro Kusumi	9585497208	Prevent pointer to an out-of-scope local variable `show_str` could be a pointer to a local variable in stack which is out-of-scope by the time `return (snprintf(buf, buflen, "%s\n", show_str));` is called. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8924 Closes #8940	2019-06-20 18:31:52 -07:00
Matthew Ahrens	accd6d9dc4	dedup=verify doesn't clear the blkptr's dedup flag The logic to handle strong checksum collisions where the data doesn't match is incorrect. It is not clearing the dedup bit of the blkptr, which can cause a panic later in zio_ddt_free() due to the dedup table not matching what is in the blkptr. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-48097 Closes #8936	2019-06-20 18:30:40 -07:00
Igor K	a64f8276c7	Update vdev_ops_t from illumos Align vdev_ops_t from illumos for better compatibility. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Igor Kozhukhov <igor@dilos.org> Closes #8925	2019-06-20 18:29:02 -07:00
Tom Caputi	da68988708	Allow unencrypted children of encrypted datasets When encryption was first added to ZFS, we made a decision to prevent users from creating unencrypted children of encrypted datasets. The idea was to prevent users from inadvertently leaving some of their data unencrypted. However, since the release of 0.8.0, some legitimate reasons have been brought up for this behavior to be allowed. This patch simply removes this limitation from all code paths that had checks for it and updates the tests accordingly. Reviewed-by: Jason King <jason.king@joyent.com> Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8737 Closes #8870	2019-06-20 12:29:51 -07:00
Matthew Ahrens	050d720c43	Remove dedupditto functionality If dedup is in use, the `dedupditto` property can be set, causing ZFS to keep an extra copy of data that is referenced many times (>100x). The idea was that this data is more important than other data and thus we want to be really sure that it is not lost if the disk experiences a small amount of random corruption. ZFS (and system administrators) rely on the pool-level redundancy to protect their data (e.g. mirroring or RAIDZ). Since the user/sysadmin doesn't have control over what data will be offered extra redundancy by dedupditto, this extra redundancy is not very useful. The bulk of the data is still vulnerable to loss based on the pool-level redundancy. For example, if particle strikes corrupt 0.1% of blocks, you will either be saved by mirror/raidz, or you will be sad. This is true even if dedupditto saved another 0.01% of blocks from being corrupted. Therefore, the dedupditto functionality is rarely enabled (i.e. the property is rarely set), and it fulfills its promise of increased redundancy even more rarely. Additionally, this feature does not work as advertised (on existing releases), because scrub/resilver did not repair the extra (dedupditto) copy (see https://github.com/zfsonlinux/zfs/pull/8270). In summary, this seldom-used feature doesn't work, and even if it did it wouldn't provide useful data protection. It has a non-trivial maintenance burden (again see https://github.com/zfsonlinux/zfs/pull/8270). We should remove the dedupditto functionality. For backwards compatibility with the existing CLI, "zpool set dedupditto" will still "succeed" (exit code zero), but won't have any effect. For backwards compatibility with existing pools that had dedupditto enabled at some point, the code will still be able to understand dedupditto blocks and free them when appropriate. However, ZFS won't write any new dedupditto blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Issue #8270 Closes #8310	2019-06-19 14:54:02 -07:00
Paul Dagnelie	30af21b025	Implement Redacted Send/Receive Redacted send/receive allows users to send subsets of their data to a target system. One possible use case for this feature is to not transmit sensitive information to a data warehousing, test/dev, or analytics environment. Another is to save space by not replicating unimportant data within a given dataset, for example in backup tools like zrepl. Redacted send/receive is a three-stage process. First, a clone (or clones) is made of the snapshot to be sent to the target. In this clone (or clones), all unnecessary or unwanted data is removed or modified. This clone is then snapshotted to create the "redaction snapshot" (or snapshots). Second, the new zfs redact command is used to create a redaction bookmark. The redaction bookmark stores the list of blocks in a snapshot that were modified by the redaction snapshot(s). Finally, the redaction bookmark is passed as a parameter to zfs send. When sending to the snapshot that was redacted, the redaction bookmark is used to filter out blocks that contain sensitive or unwanted information, and those blocks are not included in the send stream. When sending from the redaction bookmark, the blocks it contains are considered as candidate blocks in addition to those blocks in the destination snapshot that were modified since the creation_txg of the redaction bookmark. This step is necessary to allow the target to rehydrate data in the case where some blocks are accidentally or unnecessarily modified in the redaction snapshot. The changes to bookmarks to enable fast space estimation involve adding deadlists to bookmarks. There is also logic to manage the life cycles of these deadlists. The new size estimation process operates in cases where previously an accurate estimate could not be provided. In those cases, a send is performed where no data blocks are read, reducing the runtime significantly and providing a byte-accurate size estimate. Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Chris Williamson <chris.williamson@delphix.com> Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #7958	2019-06-19 09:48:12 -07:00
Alexander Motin	c1b5801bb5	Minimize aggsum_compare(&arc_size, arc_c) calls. For busy ARC situation when arc_size close to arc_c is desired. But then it is quite likely that aggsum_compare(&arc_size, arc_c) will need to flush per-CPU buckets to find exact comparison result. Doing that often in a hot path penalizes whole idea of aggsum usage there, since it replaces few simple atomic additions with dozens of lock acquisitions. Replacing aggsum_compare() with aggsum_upper_bound() in code increasing arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles allows to save ~5% of CPU time in aggsum code during sequential write to 12 ZVOLs with 16KB block size on large dual-socket system. I suppose there some minor arc_p behavior change due to lower precision of the new code, but I don't think it is a big deal, since it should affect only very small window in time (aggsum buckets are flushed every second) and in ARC size (buckets are limited to 10 average ARC blocks per CPU). Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8901	2019-06-14 14:07:33 -07:00
Matthew Ahrens	53dce5acc6	panic in removal_remap test on 4K devices If the zfs_remove_max_segment tunable is changed to be not a multiple of the sector size, then the device removal code will malfunction and try to create mappings that are smaller than one sector, leading to a panic. On debug bits this assertion will fail in spa_vdev_copy_segment(): ASSERT3U(DVA_GET_ASIZE(&dst), ==, size); On nondebug, the system panics with a stack like: metaslab_free_concrete() metaslab_free_impl() metaslab_free_impl_cb() vdev_indirect_remap() free_from_removing_vdev() metaslab_free_impl() metaslab_free_dva() metaslab_free() Fortunately, the default for zfs_remove_max_segment is 1MB, so this can't occur by default. We hit it during this test because removal_remap.ksh changes zfs_remove_max_segment to 1KB. When testing on 4KB-sector disks, we hit the bug. This change makes the zfs_remove_max_segment tunable more robust, automatically rounding it up to a multiple of the sector size. We also turn some key assertions into VERIFY's so that similar bugs would be caught before they are encoded on disk (and thus avoid a panic-reboot-loop). Reviewed-by: Sean Eric Fagan <sef@ixsystems.com> Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-61342 Closes #8893	2019-06-13 13:12:39 -07:00
Matthew Ahrens	be89734a29	compress metadata in later sync passes Starting in sync pass 5 (zfs_sync_pass_dont_compress), we disable compression (including of metadata). Ostensibly this helps the sync passes to converge (i.e. for a sync pass to not need to allocate anything because it is 100% overwrites). However, in practice it increases the average number of sync passes, because when we turn compression off, a lot of block's size will change and thus we have to re-allocate (not overwrite) them. It also increases the number of 128KB allocations (e.g. for indirect blocks and spacemaps) because these will not be compressed. The 128K allocations are especially detrimental to performance on highly fragmented systems, which may have very few free segments of this size, and may need to load new metaslabs to satisfy 128K allocations. We should increase zfs_sync_pass_dont_compress. In practice on a highly fragmented system we see a few 5-pass txg's, a tiny number of 6-pass txg's, and no txg's with more than 6 passes. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-63431 Closes #8892	2019-06-13 13:10:18 -07:00
Alexander Motin	ae5c78e0b1	Move write aggregation memory copy out of vq_lock Memory copy is too heavy operation to do under the congested lock. Moving it out reduces congestion by many times to almost invisible. Since the original zio removed from the queue, and the child zio is not executed yet, I don't see why would the copy need protection. My guess it just remained like this from the time when lock was not dropped here, which was added later to fix lock ordering issue. Multi-threaded sequential write tests with both HDD and SSD pools with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major reduction of lock congestion, saving from 15% to 35% of CPU time and increasing throughput from 10% to 40%. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8890	2019-06-13 13:08:24 -07:00
Matthew Ahrens	d3230d761a	looping in metaslab_block_picker impacts performance on fragmented pools On fragmented pools with high-performance storage, the looping in metaslab_block_picker() can become the performance-limiting bottleneck. When looking for a larger block (e.g. a 128K block for the ZIL), we may search through many free segments (up to hundreds of thousands) to find one that is large enough to satisfy the allocation. This can take a long time (up to dozens of ms), and is done while holding the ms_lock, which other threads may spin waiting for. When this performance problem is encountered, profiling will show high CPU time in metaslab_block_picker, as well as in mutex_enter from various callers. The problem is very evident on a test system with a sync write workload with 8K writes to a recordsize=8k filesystem, with 4TB of SSD storage, 84% full and 88% fragmented. It has also been observed on production systems with 90TB of storage, 76% full and 87% fragmented. The fix is to change metaslab_df_alloc() to search only up to 16MB from the previous allocation (of this alignment). After that, we will pick a segment that is of the exact size requested (or larger). This reduces the number of iterations to a few hundred on fragmented pools (a ~100x improvement). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-62324 Closes #8877	2019-06-13 13:06:15 -07:00
Matthew Ahrens	3475724ea4	ztest: dmu_tx_assign() gets ENOSPC in spa_vdev_remove_thread() When running zloop, we occasionally see the following crash: dmu_tx_assign(tx, TXG_WAIT) == 0 (0x1c == 0) ASSERT at ../../module/zfs/vdev_removal.c:1507:spa_vdev_remove_thread()/sbin/ztest(+0x89c3)[0x55faf567b9c3] The error value 0x1c is ENOSPC. The transaction used by spa_vdev_remove_thread() should not be able to fail due to being out of space. i.e. we should not call dmu_tx_hold_space(). This will allow the removal thread to schedule its work even when the pool is low on space. The "slop space" will provide enough free space to sync out the txg. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-37853 Closes #8889	2019-06-13 08:48:42 -07:00
Tomohiro Kusumi	daddbdc7cc	Fix lockdep warning on insmod sysfs_attr_init() is required to make lockdep happy for dynamically allocated sysfs attributes. This fixed #8868 on Fedora 29 running kernel-debug. This requirement was introduced in 2.6.34. See include/linux/sysfs.h for what it actually does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8868 Closes #8884	2019-06-12 17:15:06 -07:00
Matthew Ahrens	d9b4bf0665	fat zap should prefetch when iterating When iterating over a ZAP object, we're almost always certain to iterate over the entire object. If there are multiple leaf blocks, we can realize a performance win by issuing reads for all the leaf blocks in parallel when the iteration begins. For example, if we have 10,000 snapshots, "zfs destroy -nv pool/fs@1%9999" can take 30 minutes when the cache is cold. This change provides a >3x performance improvement, by issuing the reads for all ~64 blocks of each ZAP object in parallel. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-58347 Closes #8862	2019-06-12 13:13:09 -07:00
Matthew Ahrens	d9cd66e45f	Target ARC size can get reduced to arc_c_min Sometimes the target ARC size is reduced to arc_c_min, which impacts performance. We've seen this happen as part of the random_reads performance regression test, where the ARC size is reduced before the reads test starts which impacts how long it takes for system to reach good IOPS performance. We call arc_reduce_target_size when arc_reap_cb_check() returns TRUE, and arc_available_memory() is less than arc_c>>arc_shrink_shift. However, arc_available_memory() could easily be low, even when arc_c is low, because we can have tons of unused bufs in the abd kmem cache. This would be especially true just after the DMU requests a bunch of stuff be evicted from the ARC (e.g. due to "zpool export"). To fix this, the ARC should reduce arc_c by the requested amount, not all the way down to arc_size (or arc_c_min), which can be very small. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-59431 Closes #8864	2019-06-12 13:06:55 -07:00
bnjf	10269e02f9	Fix typo in vdev_raidz_math.c Fix typo in vdev_raidz_math.c Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brad Forschinger <github@bnjf.id.au> Closes #8875 Closes #8880	2019-06-12 13:03:33 -07:00
Matthew Ahrens	5662fd5794	single-chunk scatter ABDs can be treated as linear Scatter ABD's are allocated from a number of pages. In contrast to linear ABD's, these pages are disjoint in the kernel's virtual address space, so they can't be accessed as a contiguous buffer. Therefore routines that need a linear buffer (e.g. abd_borrow_buf() and friends) must allocate a separate linear buffer (with zio_buf_alloc()), and copy the contents of the pages to/from the linear buffer. This can have a measurable performance overhead on some workloads. https://github.com/zfsonlinux/zfs/commit/87c25d567fb7969b44c7d8af63990e ("abd_alloc should use scatter for >1K allocations") increased the use of scatter ABD's, specifically switching 1.5K through 4K (inclusive) buffers from linear to scatter. For workloads that access blocks whose compressed sizes are in this range, that commit introduced an additional copy into the read code path. For example, the sequential_reads_arc_cached tests in the test suite were reduced by around 5% (this is doing reads of 8K-logical blocks, compressed to 3K, which are cached in the ARC). This commit treats single-chunk scattered buffers as linear buffers, because they are contiguous in the kernel's virtual address space. All single-page (4K) ABD's can be represented this way. Some multi-page ABD's can also be represented this way, if we were able to allocate a single "chunk" (higher-order "page" which represents a power-of-2 series of physically-contiguous pages). This is often the case for 2-page (8K) ABD's. Representing a single-entry scatter ABD as a linear ABD has the performance advantage of avoiding the copy (and allocation) in abd_borrow_buf_copy / abd_return_buf_copy. A performance increase of around 5% has been observed for ARC-cached reads (of small blocks which can take advantage of this), fixing the regression introduced by `87c25d567`. Note that this optimization is only possible because all physical memory is always mapped into the kernel's address space. This is not the case for HIGHMEM pages, so the optimization can not be made on 32-bit systems. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8580	2019-06-11 09:02:31 -07:00
Matthew Ahrens	b8738257c2	make zil max block size tunable We've observed that on some highly fragmented pools, most metaslab allocations are small (~2-8KB), but there are some large, 128K allocations. The large allocations are for ZIL blocks. If there is a lot of fragmentation, the large allocations can be hard to satisfy. The most common impact of this is that we need to check (and thus load) lots of metaslabs from the ZIL allocation code path, causing sync writes to wait for metaslabs to load, which can take a second or more. In the worst case, we may not be able to satisfy the allocation, in which case the ZIL will resort to txg_wait_synced() to ensure the change is on disk. To provide a workaround for this, this change adds a tunable that can reduce the size of ZIL blocks. External-issue: DLPX-61719 Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8865	2019-06-10 11:48:42 -07:00
Alexander Motin	5a902f5aaa	Fix comparison signedness in arc_is_overflowing() When ARC size is very small, aggsum_lower_bound(&arc_size) may return negative values, that due to unsigned comparison caused delays, waiting for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use of signed comparison there fixes the problem. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8873	2019-06-10 09:52:25 -07:00
Tom Caputi	c08c30ed13	Fix incorrect error message for raw receive This patch fixes an incorrect error message that comes up when doing a non-forcing, raw, incremental receive into a dataset that has a newer snapshot than the "from" snapshot. In this case, the current code prints a confusing message about an IVset guid mismatch. This functionality is supported by non-raw receives as an undocumented feature, but was never supported by the raw receive code. If this is desired in the future, we can probably figure out a way to make it work. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #8758 Closes #8863	2019-06-10 09:45:08 -07:00
Paul Dagnelie	893a6d62c1	Allow metaslab to be unloaded even when not freed from On large systems, the memory used by loaded metaslabs can become a concern. While range trees are a fairly efficient data structure, on heavily fragmented pools they can still consume a significant amount of memory. This problem is amplified when we fail to unload metaslabs that we aren't using. Currently, we only unload a metaslab during metaslab_sync_done; in order for that function to be called on a given metaslab in a given txg, we have to have dirtied that metaslab in that txg. If the dirtying was the result of an allocation, we wouldn't be unloading it (since it wouldn't be 8 txgs since it was selected), so in effect we only unload a metaslab during txgs where it's being freed from. We move the unload logic from sync_done to a new function, and call that function on all metaslabs in a given vdev during vdev_sync_done(). Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8837	2019-06-06 19:10:43 -07:00
Tom Caputi	fd7a657bff	Reinstate raw receive check when truncating This patch re-adds a check that was removed in `369aa50`. The check confirms that a raw receive is not occuring before truncating an object's dn_maxblkid. At the time, it was believed that all cases that would hit this code path would be handled in other places, but that was not the case. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8852 Closes #8857	2019-06-06 13:47:33 -07:00
Allan Jude	b7109a413c	l2arc_apply_transforms: Fix typo in comment Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Allan Jude <allanjude@freebsd.org> Closes #8822	2019-06-06 13:14:48 -07:00
Serapheim Dimitropoulos	cb020f0d86	Reduced IOPS when all vdevs are in the zfs_mg_fragmentation_threshold Historically while doing performance testing we've noticed that IOPS can be significantly reduced when all vdevs in the pool are hitting the zfs_mg_fragmentation_threshold percentage. Specifically in a hypothetical pool with two vdevs, what can happen is the following: Vdev A would go above that threshold and only vdev B would be used. Then vdev B would pass that threshold but vdev A would go below it (we've been freeing from A to allocate to B). The allocations would go back and forth utilizing one vdev at a time with IOPS taking a hit. Empirically, we've seen that our vdev selection for allocations is good enough that fragmentation increases uniformly across all vdevs the majority of the time. Thus we set the threshold percentage high enough to avoid hitting the speed bump on pools that are being pushed to the edge. We effectively disable its effect in the majority of the cases but we don't remove (at least for now) just in case we hit any weird behavior in the future. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8859	2019-06-06 13:08:41 -07:00
Tom Caputi	3ce85b5e60	Fix integer overflow of ZTOI(zp)->i_generation The ZFS on-disk format stores each inode's generation ID as a 64 bit number on disk and in-core. However, the Linux kernel's inode is only a 32 bit number. In most places, the code handles this correctly, but the cast is missing in zfs_rezget(). For many pools, this isn't an issue since the generation ID is computed as the current txg when the inode is created and many pools don't have more than 2^32 txgs. For the pools that have more txgs, this issue causes any inode with a high enough generation number to report IO errors after a call to "zfs rollback" while holding the file or directory open. This patch simply adds the missing cast. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8858	2019-06-06 12:59:39 -07:00
Tomohiro Kusumi	1a6889947e	Drop objid argument in zfs_znode_alloc() (sync with OpenZFS) Since zfs_znode_alloc() already takes dmu_buf_t*, taking another uint64_t argument for objid is redundant. inode's ->i_ino does and needs to match znode's ->z_id. zfs_znode_alloc() in FreeBSD and illumos doesn't have this argument since vnode doesn't have vnode# in VFS (hence ->z_id exists). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8841	2019-06-05 14:18:46 -07:00
DeHackEd	df24bcf00a	Wait in 'S' state when send/recv pipe is blocking Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #8733 Closes #8752	2019-06-03 20:54:43 -07:00
TulsiJain	a3c98d5728	Make zfs_async_block_max_blocks handle zero correctly Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: TulsiJain <tulsi.jain@delphix.com> Closes #8829 Closes #8289	2019-06-03 09:40:48 -07:00
Brian Behlendorf	2531ce3720	Revert "Report holes when there are only metadata changes" This reverts commit `ec4f9b8f30` which introduced a narrow race which can lead to lseek(, SEEK_DATA) incorrectly returning ENXIO. Resolve the issue by revering this change to restore the previous behavior which depends solely on checking the dirty list. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8816 Closes #8834	2019-05-30 17:13:18 -07:00
Tomohiro Kusumi	fe0c9f409a	Remove vn_set_fs_pwd()/vn_set_pwd() (no need to be at / during insmod) Per suggestion from @behlendorf in #8777, remove vn_set_fs_pwd() and vn_set_pwd() which are only used in zfs_ioctl.c:_init() while loading zfs.ko. The rest of initialization functions being called here after cwd set to / don't depend on cwd of the process except for spa_config_load(). spa_config_load() uses a relative path ".//etc/zfs/zpool.cache" when `rootdir` is non-NULL, which is "/etc/zfs/zpool.cache" given cwd is /, so just unconditionally use the absolute path without "./", so that `vn_set_pwd("/")` as well as the entire functions can be removed. This is also what FreeBSD does. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8826	2019-05-29 16:18:14 -07:00
Brian Behlendorf	1e724f4f34	Exclude log device ashift from normal class When opening a log device during import its allocation bias will not yet have been set by vdev_load(). This results in the log device's ashift being incorrectly applied to the maximum ashift of the vdevs in the normal class. Which in turn prevents the removal of any top-level devices due to the ashift check in the spa_vdev_remove_top_check() function. This issue is resolved by including vdev_islog in the check since it will be set correctly during vdev_open(). Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8735	2019-05-29 11:35:50 -07:00
madz	ec4afd27f1	Fix integer overflow in get_next_chunk() dn->dn_datablksz type is uint32_t and need to be casted to uint64_t to avoid an overflow when the record size is greater than 4 MiB. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olivier Mazouffre <olivier.mazouffre@ims-bordeaux.fr> Closes #8778 Closes #8797	2019-05-29 10:17:25 -07:00
loli10K	ab1a9705f8	Double-free of encryption wrapping key due to invalid pool properties This commits fixes a double-free in zfs_ioc_pool_create() triggered by specifying an unsupported combination of properties when creating a pool with encryption enabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8791	2019-05-28 15:19:50 -07:00
Tomohiro Kusumi	841a7a98fc	Update descriptions for vnops These descriptions are not uptodate with the code. Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8767	2019-05-25 14:29:10 -07:00
Tom Caputi	3b61ca3e57	Fix embedded bp accounting in count_block() Currently, count_block() does not correctly account for the possibility that the bp that is passed to it could be embedded. These blocks shouldn't be counted since the work of scanning these blocks in already handled when the containing block is scanned. This patch simply resolves this issue by returning early in this case. Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Authored-by: Bill Sommerfeld <sommerfeld@alum.mit.edu> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8800 Closes #8766	2019-05-25 13:52:23 -07:00
Tomohiro Kusumi	bfd5a709e7	Linux 5.2 compat: Directly call wait_on_page_bit() wait_on_page_writeback() was made GPL only in torvalds/linux@19343b5bdd. Directly call wait_on_page_bit() without using wait_on_page_writeback() interface, given zfs_putpage() is the only caller for now. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8794	2019-05-25 13:42:09 -07:00
Tomohiro Kusumi	c3e5907fdb	Drop local definition of MOUNT_BUSY It's accessible via <sys/mntent.h>. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8765	2019-05-24 16:43:23 -07:00
loli10K	2e8c315fc6	Device removal panics on 32-bit systems The issue is caused by an incorrect usage of the sizeof() operator in vdev_obsolete_sm_object(): on 64-bit systems this is not an issue since both "uint64_t" and "uint64_t*" are 8 bytes in size. However on 32-bit systems pointers are 4 bytes long which is not supported by zap_lookup_impl(). Trying to remove a top-level vdev on a 32-bit system will cause the following failure: VERIFY3(0 == vdev_obsolete_sm_object(vd, &obsolete_sm_object)) failed (0 == 22) PANIC at vdev_indirect.c:833:vdev_indirect_sync_obsolete() Showing stack for process 1315 CPU: 6 PID: 1315 Comm: txg_sync Tainted: P OE 4.4.69+ #2 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014 c1abc6e7 0ae10898 00000286 d4ac3bc0 c14397bc da4cd7d8 d4ac3bf0 d4ac3bd0 d790e7ce d7911cc1 00000523 d4ac3d00 d790e7d7 d7911ce4 da4cd7d8 00000341 da4ce664 da4cd8c0 da33fa6e 49524556 28335946 3d3d2030 65647620 626f5f76 Call Trace: [<>] dump_stack+0x58/0x7c [<>] spl_dumpstack+0x23/0x27 [spl] [<>] spl_panic.cold.0+0x5/0x41 [spl] [<>] ? dbuf_rele+0x3e/0x90 [zfs] [<>] ? zap_lookup_norm+0xbe/0xe0 [zfs] [<>] ? zap_lookup+0x57/0x70 [zfs] [<>] ? vdev_obsolete_sm_object+0x102/0x12b [zfs] [<>] vdev_indirect_sync_obsolete+0x3e1/0x64d [zfs] [<>] ? txg_verify+0x1d/0x160 [zfs] [<>] ? dmu_tx_create_dd+0x80/0xc0 [zfs] [<>] vdev_sync+0xbf/0x550 [zfs] [<>] ? mutex_lock+0x10/0x30 [<>] ? txg_list_remove+0x9f/0x1a0 [zfs] [<>] ? zap_contains+0x4d/0x70 [zfs] [<>] spa_sync+0x9f1/0x1b10 [zfs] ... [<>] ? kthread_stop+0x110/0x110 This commit simply corrects the "integer_size" parameter used to lookup the vdev's ZAP object. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8790	2019-05-24 12:17:52 -07:00
loli10K	75c09c5060	Fix coverity defects: CID 186143 CID 186143: Memory - illegal accesses (USE_AFTER_FREE) This patch fixes an use-after-free in spa_import_progress_destroy() moving the kmem_free() call at the end of the function. Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8788	2019-05-23 19:17:00 -07:00
Richard Elling	78fac8d925	Fix kstat state update during pool transition When reading kstats, the health (aka state) of the pool is stored into /proc/spl/kstat/zfs/POOLNAME/state via spa_state_to_name(). However, during import/export there is a case where the spa exists, but the root vdev does not exist. This fix checks that case and sets the state to "TRANSITIONING" Unfortunately, it is not easy to reproduce a test for this. It was detected randomly during ZTS runs while kstats were also being sampled regularly. After this change, further testing did not trip on the case and the TRANSITIONING state was collected at least once by the kstats. For posterity, the backtrace prior to this fix is: [Mon May 13 17:21:00 2019] RIP: 0010:spa_state_to_name+0x10/0xb0 [zfs] ... Mon May 13 17:21:00 2019] Call Trace: [Mon May 13 17:21:00 2019] spa_state_data+0x1a/0x40 [zfs] [Mon May 13 17:21:00 2019] kstat_seq_show+0x117/0x440 [spl] [Mon May 13 17:21:00 2019] seq_read+0xe5/0x430 [Mon May 13 17:21:00 2019] proc_reg_read+0x45/0x70 [Mon May 13 17:21:00 2019] __vfs_read+0x1b/0x40 [Mon May 13 17:21:00 2019] vfs_read+0x8e/0x130 [Mon May 13 17:21:00 2019] SyS_read+0x55/0xc0 [Mon May 13 17:21:00 2019] ? SyS_fcntl+0x5d/0xb0 [Mon May 13 17:21:00 2019] do_syscall_64+0x73/0x130 [Mon May 13 17:21:00 2019] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com> Closes #8746	2019-05-23 14:28:53 -07:00
Brian Behlendorf	bff2361aeb	Linux 5.2 compat: rw_tryupgrade() Commit torvalds/linux@46ad0840b has removed the architecture specific rwsem source and headers leaving only the generic version. As part of this change the RWSEM_ACTIVE_READ_BIAS and RWSEM_ACTIVE_WRITE_BIAS macros were moved to the private kernel/locking/rwsem.h header. This results in a build failure because these macros were required to implement the rw_tryupgrade() compatibility function. In practice, this isn't a major problem because there are only a few consumers of rw_tryupgrade() and because consumers of rw_tryupgrade should be written to retry using rw_enter(RW_WRITER). After auditing all of the callers only dmu_zfetch() was determined not to perform a retry. It has been updated in this commit to resolve this issue. That said, the rw_tryupgrade() functionality should be considered for possible removal in a future release due to the difficultly in supporting the interface. Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8730	2019-05-23 13:46:33 -07:00
Paul Dagnelie	e61b53475e	Fix incorrect assertion in dnode_dirty_l1range The db_dirtycnt of an EVICTING dbuf is always 0. However, it still appears in the dn_dbufs tree. If we call dnode_dirty_l1range on a range that contains an EVICTING dbuf, we will attempt to mark it dirty (which will fail because it's EVICTING, resulting in a new dbuf being created and dirtied). Later, in ZFS_DEBUG mode, we assert that all the dbufs in the range are dirty. If the EVICTING dbuf is still present, this will trip the assertion erroneously. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Sara Hartse <sara.hartse@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #8745	2019-05-19 17:30:33 -07:00
Olaf Faaland	ca95f70dff	zpool import progress kstat When an import requires a long MMP activity check, or when the user requests pool recovery, the import make take a long time. The user may not know why, or be able to tell whether the import is progressing or is hung. Add a kstat which lists all imports currently being processed by the kernel (currently only one at a time is possible, but the kstat allows for more than one). The kstat is /proc/spl/kstat/zfs/import_progress. The kstat contents are as follows: pool_guid load_state multihost_secs max_txg pool_name 16667015954387398 3 15 0 tank3 load_state: the value of spa_load_state multihost_secs: seconds until the end of the multihost activity check; if over, or none required, this is 0 max_txg: current spa_load_max_txg, if rewind is occurring This could be used by outside tools, such as a pacemaker resource agent, to report import progress, or as a part of manual troubleshooting. The zpool import subcommand could also be modified to report this information. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #8696	2019-05-09 10:08:05 -07:00
Tomohiro Kusumi	b689de85e8	Add missing trailing '\n' in printk() messages These messages will want '\n' like any other regular printk() messages. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8726	2019-05-08 16:43:55 -07:00
Tomohiro Kusumi	97aa3ba44f	Fix link count of root inode when snapdir is visible Given how zfs_getattr() is implemented, zfs_getattr_fast() (used by ->getattr() of zpl inodes) also needs to consider an additional link count if "snapdir" property is set to "visible". Without this, # of directories in root inode of each dataset doesn't match the link count when snapdir is visible. Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8727	2019-05-08 16:40:51 -07:00
Brian Behlendorf	a20f43b51b	Linux 5.0 compat: ASM_BUG macro The 5.0 kernel defines the macro ASM_BUG. In order to prevent a conflict and build failure rename ASM_BUG to ZFS_ASM_BUG. This is currently only an issue on aarch64 but all instances of ASM_BUG we're renamed to avoid any future conflict on x86_64. Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8725 Issue #8545	2019-05-08 10:18:40 -07:00
Brian Behlendorf	515ddf6504	Fix errant EFAULT during writes (#8719 ) Commit 98bb45e resolved a deadlock which could occur when handling a page fault in zfs_write(). This change added the uio_fault_disable field to the uio structure but failed to initialize it to B_FALSE. This uninitialized field would cause uiomove_iov() to call __copy_from_user_inatomic() instead of copy_from_user() resulting in unexpected EFAULTs. Resolve the issue by fully initializing the uio, and clearing the uio_fault_disable flags after it's used in zfs_write(). Additionally, reorder the uio_t field assignments to match the order the fields are declared in the structure. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8640 Closes #8719	2019-05-08 10:04:04 -07:00
DeHackEd	1f02ecc5a5	Make zfs_special_class_metadata_reserve_pct into a parameter Exported and documented a new module parameter. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Closes #8706	2019-05-07 15:34:42 -07:00
Brian Behlendorf	caf9dd209f	Fix send/recv lost spill block When receiving a DRR_OBJECT record the receive_object() function needs to determine how to handle a spill block associated with the object. It may need to be removed or kept depending on how the object was modified at the source. This determination is currently accomplished using a heuristic which takes in to account the DRR_OBJECT record and the existing object properties. This is a problem because there isn't quite enough information available to do the right thing under all circumstances. For example, when only the block size changes the spill block is removed when it should be kept. What's needed to resolve this is an additional flag in the DRR_OBJECT which indicates if the object being received references a spill block. The DRR_OBJECT_SPILL flag was added for this purpose. When set then the object references a spill block and it must be kept. Either it is update to date, or it will be replaced by a subsequent DRR_SPILL record. Conversely, if the object being received doesn't reference a spill block then any existing spill block should always be removed. Since previous versions of ZFS do not understand this new flag additional DRR_SPILL records will be inserted in to the stream. This has the advantage of being fully backward compatible. Existing ZFS systems receiving this stream will recreate the spill block if it was incorrectly removed. Updated ZFS versions will correctly ignore the additional spill blocks which can be identified by checking for the DRR_SPILL_UNMODIFIED flag. The small downside to this approach is that is may increase the size of the stream and of the received snapshot on previous versions of ZFS. Additionally, when receiving streams generated by previous unpatched versions of ZFS spill blocks may still be lost. OpenZFS-issue: https://www.illumos.org/issues/9952 FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=233277 Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8668	2019-05-07 15:18:44 -07:00
Tomohiro Kusumi	9c53e51616	Fix `zfs set atime\|relatime=off\|on` behavior on inherited datasets `zfs set atime\|relatime=off\|on` doesn't disable or enable the property on read for datasets whose property was inherited from parent, until a dataset is once unmounted and mounted again. (The properties start to work properly if a dataset is once unmounted and mounted again. The difference comes from regular mount process, e.g. via zpool import, uses mount options based on properties read from ondisk layout for each dataset, whereas `zfs set atime\|relatime=off\|on` just remounts a specified dataset.) -- # zpool create p1 <device> # zfs create p1/f1 # zfs set atime=off p1 # echo test > /p1/f1/test # sync # zfs list NAME USED AVAIL REFER MOUNTPOINT p1 176K 18.9G 25.5K /p1 p1/f1 26K 18.9G 26K /p1/f1 # zfs get atime NAME PROPERTY VALUE SOURCE p1 atime off local p1/f1 atime off inherited from p1 # stat /p1/f1/test \| grep Access \| tail -1 Access: 2019-04-26 23:32:33.741205192 +0900 # cat /p1/f1/test test # stat /p1/f1/test \| grep Access \| tail -1 Access: 2019-04-26 23:32:50.173231861 +0900 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ changed by read(2) -- The problem is that zfsvfs::z_atime which was probably intended to keep incore atime state just gets updated by a callback function of "atime" property change, atime_changed_cb(), and never used for anything else. Since now that all file read and atime update use a common function zpl_iter_read_common() -> file_accessed(), and whether to update atime via ->dirty_inode() is determined by atime_needs_update(), atime_needs_update() needs to return false once atime is turned off. It currently continues to return true on `zfs set atime=off`. Fix atime_changed_cb() by setting or dropping SB_NOATIME in VFS super block depending on a new atime value, so that atime_needs_update() works as expected after property change. The same problem applies to "relatime" except that a self contained relatime test is needed. This is because relatime_need_update() is based on a mount option flag MNT_RELATIME, which doesn't exist in datasets with inherited "relatime" property via `zfs set relatime=...`, hence it needs its own relatime test zfs_relatime_need_update(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8674 Closes #8675	2019-05-07 10:06:30 -07:00
Tomohiro Kusumi	a762893269	Fix typo/etc in module/zfs/zfs_ctldir.c Drop duplicated phrases in comments. Also drop an obsolete comment "Perform a mount of the associated...", as all it does now is get objid from DMU and lookup incore inode. Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8707	2019-05-05 15:57:23 -07:00
Tomohiro Kusumi	de3e0b914b	Linux 5.0 compat: Use totalhigh_pages() Linux kernel commit ca79b0c211af63fa3276f0e3fd7dd9ada2439839 "mm: convert totalram_pages and totalhigh_pages variables to atomic" replaced `totalhigh_pages` with an inline function `totalhigh_pages()`. This broke compilation on IA32, etc, as ZoL uses `totalhigh_pages` on archs with highmem. Confirmed on Fedora 30 (5.0.9-301.fc30.i686). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8677 Closes #8701	2019-05-04 16:40:48 -07:00
John Gallagher	1eacf2b3b0	Improve rate at which new zvols are processed The kernel function which adds new zvols as disks to the system, add_disk(), briefly opens and closes the zvol as part of its work. Closing a zvol involves waiting for two txgs to sync. This, combined with the fact that the taskq processing new zvols is single threaded, makes this processing new zvols slow. Waiting for these txgs to sync is only necessary if the zvol has been written to, which is not the case during add_disk(). This change adds tracking of whether a zvol has been written to so that we can skip the txg_wait_synced() calls when they are unnecessary. This change also fixes the flags passed to blkdev_get_by_path() by vdev_disk_open() to be FMODE_READ \| FMODE_WRITE \| FMODE_EXCL instead of just FMODE_EXCL. The flags were being incorrectly calculated because we were using the wrong version of vdev_bdev_mode(). Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Gallagher <john.gallagher@delphix.com> Closes #8526 Closes #8615	2019-05-04 16:39:10 -07:00
Matthew Ahrens	8d9f616511	Reword comment in lz4_compress_zfs The comment in lz4_compress_zfs could be more clear and specific. It also contains needlessly strong language. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes: #8702 Closes: #8703	2019-05-02 16:46:04 -07:00
Tom Caputi	fa24166074	Add feature check for 'zpool resilver' command The 'zpool resilver' command requires that the resilver_defer feature is active on the pool. Unfortunately, the check for this was left out of the original patch. This commit simply corrects this so that the command properly returns an error in this case. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Igor Kozhukhov <igor@dilos.org> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8700	2019-05-02 16:42:31 -07:00
Tomohiro Kusumi	f0ce0436aa	Correct snprintf() size argument The size argument of snprintf(3) in glibc and snprintf() in Linux kernel includes trailing \0, as snprintf(3) man page explains it as "write at most size bytes (including the trailing null byte ('\0'))", i.e. snprintf() can just take buffer size. e.g. For snprintf() in module/zfs/zfs_ctldir.c, a buffer size is MAXPATHLEN, and a caller is passing MAXPATHLEN to snprintf(), so size should just be `path_len` to do what the caller is trying to do. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8692	2019-04-30 19:41:12 -07:00
Brian Behlendorf	c12ea77865	Linux 5.0 compat: Remove incorrect ASSERT Not all block devices, notably scsi_debug, set a root_blkg on the request queue. Remove this assertion and allow the the existing call to blkg_tryget() to gracefully handle the NULL (which it does). Reviewed-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8678	2019-04-29 18:20:21 -07:00
Tomohiro Kusumi	126d0fa733	Use SEEK_{SET,CUR,END} for file seek "whence" Use either SEEK_* or 0,1,2..., but not both. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8656	2019-04-25 10:17:27 -07:00
Tom Caputi	f4c594da94	Fixes for the DMU free throttle This patch fixes 2 issues with the DMU free throttle implemented in dmu_free_long_range(). The first issue is that get_next_chunk() was calculating the number of L1 blocks the free would dirty incorrectly. In some cases involving extremely large files, this code would greatly overestimate the number of effected L1 blocks, causing excessive calls to txg_wait_open(). This patch corrects the calculation. The second issue is that the free throttle uses the total number of free'd blocks in all (open, quiescing, and syncing) txgs to determine whether to throttle. This causes large frees (such as those created by the first issue) to cause 4 txg syncs before any further frees were allowed to proceed. This patch ensures that the accounting is done entirely in a per-txg fashion, so that frees from a given txg don't affect those that immediately follow it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8655	2019-04-25 10:16:24 -07:00
Tomohiro Kusumi	34d343c2a8	Drop unused ZNODE_STATS and ZNODE_STAT_ADD() Unused since 5649246dd3("Remove znode move functionality"), and ZNODE_STAT_ADD() will never be needed. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8636	2019-04-19 12:05:15 -07:00
Tomohiro Kusumi	a35c12073c	Fix incorrect "[UNUSED]" comments These aren't unused. `flag` in zfs_create() also isn't to indicate large file. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Richard Laager <rlaager@wiktel.com> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com> Closes #8635	2019-04-19 12:03:32 -07:00
cfzhu	5090f72743	Code improvement and bug fixes for QAT support 1. Support QAT when ZFS is root file-system: When ZFS module is loaded before QAT started, the QAT can be started again in post-process, e.g.: echo 0 > /sys/module/zfs/parameters/zfs_qat_compress_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_encrypt_disable echo 0 > /sys/module/zfs/parameters/zfs_qat_checksum_disable 2. Verify alder checksum of the de-compress result 3. Allocate Digest, IV and AAD buffer in physical contiguous memory by QAT_PHYS_CONTIG_ALLOC. 4. Update the documentation for zfs_qat_compress_disable, zfs_qat_checksum_disable, zfs_qat_encrypt_disable. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com> Closes #8323 Closes #8610	2019-04-16 12:38:36 -07:00
Richard Laager	fcf21f8fcb	Update a comment to match the code GRUB supports large_blocks. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:02:33 -07:00
Richard Laager	393363c5ec	Consistently captialize GUID for features Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:01:51 -07:00
Richard Laager	612c4930dd	Fix the spelling of deferred Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Laager <rlaager@wiktel.com> Closes #8626	2019-04-16 10:01:45 -07:00
Tom Caputi	c2c6eadf29	Fix issues with truncated files in raw sends When receiving a raw send stream only reallocated objects whose contents were not freed by the standard indicators should call dmu_free_long_range(). Furthermore, if calling dmu_free_long_range() is required then the objects current block size must be used and not the new block size. Two additional test cases were added to provided realistic test coverage for processing reallocated objects which are part of a raw receive. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8528 Closes #8607	2019-04-15 15:28:48 -07:00
Tom Caputi	7dcd318832	Cleanup nits from `ab7615d92` This patch simply up cleans up a nit and corrects an error message issue that were introduced in the Multiple DVA scrub patch. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8619	2019-04-14 11:03:06 -07:00
Brian Behlendorf	b92f5d9f82	Fix issue in receive_object() during reallocation When receiving an object to a previously allocated interior slot the new object should be "allocated" by setting DMU_NEW_OBJECT, not "reallocated" with dnode_reallocate(). For resilience verify the slot is free as required in case the stream is malformed. Add a test case to generate more realistic incremental send streams that force reallocation to occur during the receive. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8067 Closes #8614	2019-04-12 14:28:04 -07:00
Brian Behlendorf	3fa93bb8d3	Fix TXG_MASK cstyle Fix style issue for 'tx->tx_txg&TXG_MASK'. There should be white space around the '&' character. Split the dnode_reallocate() ASSERT to make it more readable to clearly separate the checks. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8606	2019-04-12 11:30:59 -07:00
Jorgen Lundman	48ed0f9da0	Always call rw_init in zio_crypt_key_unwrap The error path in zio_crypt_key_unwrap would call zio_crypt_key_destroy which calls rw_destroy(&key->zk_salt_lock); which has not yet been initialized. We move the rw_init() call to the start of zio_crypt_key_unwrap instead. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #8604 Closes #8605	2019-04-10 15:39:40 -07:00
Tim Chase	8cb34421e0	Avoid stack overwrite in zfs_setattr_dir() The bulk[] array index, count, must be reset per-iteration in order to not overwrite the stack. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #8072 Closes #8597 Closes #8601	2019-04-10 15:38:21 -07:00
Brian Behlendorf	d93d4b1acd	Revert "Fix issues with truncated files in raw sends" This partially reverts commit `5dbf8b4ed`. This change resolved the issues observed with truncated files in raw sends. However, the required changes to dnode_allocate() introduced a regression for non-raw streams which needs to be understood. The additional debugging improvements from the original patch were not reverted. Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #7378 Issue #8528 Issue #8540 Issue #8565 Close #8584	2019-04-05 17:32:56 -07:00
Matthew Ahrens	944a37248a	predictive prefetch disabled on new pools until export/reboot When a pool is initially created (by `zpool create`), predictive prefetch is inadvertently disabled, until the pool is export/import-ed, or the machine is rebooted. When device removal was introduced, we added some code to disable predictive prefetching until indirect vdevs have been loaded. This resulted in the "default state" of prefetch being disabled, until we proactively enable it after indirect vdevs are loaded. Unfortunately this resulted in a few bugs where in some code paths we neglect to enable predictive prefetch. The first of these was fixed by `20507534d4` This commit fixes another case where we also need to explicitly enable predictive prefetch, when the pool is initially created. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8577	2019-04-05 10:12:02 -07:00
Don Brady	b4ddec7af6	features.kernel layout should match features.pool The features.kernel layout should match features.pool. Reviewed-by: Sara Hartse <sara.hartse@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #8566	2019-04-04 19:00:55 -07:00
Sara Hartse	a887d653b3	Restrict kstats and print real pointers There are several places where we use zfs_dbgmsg and %p to print pointers. In the Linux kernel, these values obfuscated to prevent information leaks which means the pointers aren't very useful for debugging crash dumps. We decided to restrict the permissions of dbgmsg (and some other kstats while we were at it) and print pointers with %px in zfs_dbgmsg as well as spl_dumpstack Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: John Gallagher <john.gallagher@delphix.com> Signed-off-by: sara hartse <sara.hartse@delphix.com> Closes #8467 Closes #8476	2019-04-04 18:57:06 -07:00
Brian Behlendorf	f4e35b165c	Fix txg_wait_open() load average inflation Callers of txg_wait_open() which set should_quiesce=B_TRUE should be accounted for as iowait time. Otherwise, the caller is understood to be idle and cv_wait_sig() is used to prevent incorrectly inflating the system load average. Similarly txg_wait_wait() has been updated to use cv_wait_io() to be accounted against iowait. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8550 Closes #8558	2019-04-04 09:44:46 -07:00
Brian Behlendorf	1b939560be	Add TRIM support UNMAP/TRIM support is a frequently-requested feature to help prevent performance from degrading on SSDs and on various other SAN-like storage back-ends. By issuing UNMAP/TRIM commands for sectors which are no longer allocated the underlying device can often more efficiently manage itself. This TRIM implementation is modeled on the `zpool initialize` feature which writes a pattern to all unallocated space in the pool. The new `zpool trim` command uses the same vdev_xlate() code to calculate what sectors are unallocated, the same per- vdev TRIM thread model and locking, and the same basic CLI for a consistent user experience. The core difference is that instead of writing a pattern it will issue UNMAP/TRIM commands for those extents. The zio pipeline was updated to accommodate this by adding a new ZIO_TYPE_TRIM type and associated spa taskq. This new type makes is straight forward to add the platform specific TRIM/UNMAP calls to vdev_disk.c and vdev_file.c. These new ZIO_TYPE_TRIM zios are handled largely the same way as ZIO_TYPE_READs or ZIO_TYPE_WRITEs. This makes it possible to largely avoid changing the pipieline, one exception is that TRIM zio's may exceed the 16M block size limit since they contain no data. In addition to the manual `zpool trim` command, a background automatic TRIM was added and is controlled by the 'autotrim' property. It relies on the exact same infrastructure as the manual TRIM. However, instead of relying on the extents in a metaslab's ms_allocatable range tree, a ms_trim tree is kept per metaslab. When 'autotrim=on', ranges added back to the ms_allocatable tree are also added to the ms_free tree. The ms_free tree is then periodically consumed by an autotrim thread which systematically walks a top level vdev's metaslabs. Since the automatic TRIM will skip ranges it considers too small there is value in occasionally running a full `zpool trim`. This may occur when the freed blocks are small and not enough time was allowed to aggregate them. An automatic TRIM and a manual `zpool trim` may be run concurrently, in which case the automatic TRIM will yield to the manual TRIM. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Contributions-by: Saso Kiselkov <saso.kiselkov@nexenta.com> Contributions-by: Tim Chase <tim@chase2k.com> Contributions-by: Chunwei Chen <tuxoko@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8419 Closes #598	2019-03-29 09:13:20 -07:00
Tom Caputi	5dbf8b4edd	Fix issues with truncated files in raw sends This patch fixes a few issues with raw receives involving truncated files: * dnode_reallocate() now calls dnode_set_blksz() instead of dnode_setdblksz(). This ensures that any remaining dbufs with blkid 0 are resized along with their containing dnode upon reallocation. * One of the calls to dmu_free_long_range() in receive_object() needs to check that the object it is about to free some contents or hasn't been completely removed already by a previous call to dmu_free_long_object() in the same function. * The same call to dmu_free_long_range() in the previous point needs to ensure it uses the object's current block size and not the new block size. This ensures the blocks of the object that are supposed to be freed are completely removed and not simply partially zeroed out. This patch also adds handling for DRR_OBJECT_RANGE records to dprintf_drr() for debugging purposes. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7378 Closes #8528	2019-03-27 11:30:48 -07:00
Igor K	c048ddaf33	Fix vd_path and error in spa_vdev_remove() Make a local copy of the vd_path and preserve the removal error for use in spa_history_log_internal(). This is required because after spa_vdev_exit() there is nothing preventing the vdev state from changing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Igor Kozhukhov <igor@dilos.org> Closes #8522	2019-03-22 13:25:07 -07:00
Roman Strashkin	234234ca4d	Panic when running 'zpool split' Added missing remove of detachable VDEV from txg's DTL list to avoid use-after-free for the split VDEV Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com> Closes #5565 Closes #7856	2019-03-22 13:11:36 -07:00
George Wilson	2efea7c82c	ZFS Reads may result in unneccesary calls to zil_commit ZFS supports O_RSYNC for read operations and when specified will ensure the same level of data integrity that O_DSYNC and O_SYNC provides for writes. O_RSYNC by itself has no effect so it must be combined with either O_DSYNC or O_SYNC. However, many platforms don't support O_RSYNC and have mapped O_SYNC to mean O_RSYNC within ZFS. This is incorrect and causes unnecessary calls to zil_commit. Only platforms which support O_RSYNC should implement the zil_commit functionality in the read code path. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #8523	2019-03-22 13:09:11 -07:00
Olaf Faaland	060f0226e6	MMP interval and fail_intervals in uberblock When Multihost is enabled, and a pool is imported, uberblock writes include ub_mmp_delay to allow an importing node to calculate the duration of an activity test. This value, is not enough information. If zfs_multihost_fail_intervals > 0 on the node with the pool imported, the safe minimum duration of the activity test is well defined, but does not depend on ub_mmp_delay: zfs_multihost_fail_intervals * zfs_multihost_interval and if zfs_multihost_fail_intervals == 0 on that node, there is no such well defined safe duration, but the importing host cannot tell whether mmp_delay is high due to I/O delays, or due to a very large zfs_multihost_interval setting on the host which last imported the pool. As a result, it may use a far longer period for the activity test than is necessary. This patch renames ub_mmp_sequence to ub_mmp_config and uses it to record the zfs_multihost_interval and zfs_multihost_fail_intervals values, as well as the mmp sequence. This allows a shorter activity test duration to be calculated by the importing host in most situations. These values are also added to the multihost_history kstat records. It calculates the activity test duration differently depending on whether the new fields are present or not; for importing pools with only ub_mmp_delay, it uses (zfs_multihost_interval + ub_mmp_delay) * zfs_multihost_import_intervals Which results in an activity test duration less sensitive to the leaf count. In addition, it makes a few other improvements: * It updates the "sequence" part of ub_mmp_config when MMP writes in between syncs occur. This allows an importing host to detect MMP on the remote host sooner, when the pool is idle, as it is not limited to the granularity of ub_timestamp (1 second). * It issues writes immediately when zfs_multihost_interval is changed so remote hosts see the updated value as soon as possible. * It fixes a bug where setting zfs_multihost_fail_intervals = 1 results in immediate pool suspension. * Update tests to verify activity check duration is based on recorded tunable values, not tunable values on importing host. * Update tests to verify the expected number of uberblocks have valid MMP fields - fail_intervals, mmp_interval, mmp_seq (sequence number), that sequence number is incrementing, and that uberblock values match tunable settings. Reviewed-by: Andreas Dilger <andreas.dilger@whamcloud.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7842	2019-03-21 12:47:57 -07:00
Jorgen Lundman	d10b2f1d35	Mutex leak in dsl_dataset_hold_obj() In addition to dsl_dataset_evict_async() releasing a hold, there is an error case in dsl_dataset_hold_obj() which had missed 4 additional release calls. This was introduced in `a1d477c24`. openzfsonosx-commit: https://github.com/openzfsonosx/zfs/commit/63ff7f1c Authored by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8517	2019-03-21 10:36:58 -07:00
cfzhu	45001b949c	QAT: Allocate digest_buffer using QAT_PHYS_CONTIG_ALLOC() If the buffer 'digest_buffer' is allocated in the qat_checksum() stack, it can't ensure that the address is physically contiguous, and the DMA result of the buffer may be handled incorrectly. Using QAT_PHYS_CONTIG_ALLOC() ensures a physically contiguous allocation. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Chengfei, Zhu <chengfeix.zhu@intel.com> Closes #8323 Closes #8521	2019-03-21 10:35:18 -07:00
Brian Behlendorf	ec4f9b8f30	Report holes when there are only metadata changes Update the dirty check in dmu_offset_next() such that dnode's are only considered dirty for the purpose or reporting holes when there are pending data blocks or frees to be synced. This ensures that when there are only metadata updates to be synced (atime) that holes are reported. Reviewed-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6958 Closes #8505	2019-03-21 10:30:15 -07:00
Julian Heuking	304d469dcd	Add missing dmu_zfetch_fini() in dnode_move_impl() As it turns out, on the Windows platform when rw_init() is called (rather its bedrock call ExInitializeResourceLite) it is placed on an active-list of locks, and is removed at rw_destroy() time. dnode_move() has logic to copy over the old-dnode to new-dnode, including calling dmu_zfetch_init(new-dnode). But due to the missing dmu_zfetch_fini(old-dnode), kmem will call dnode_dest() to release the memory (and in debug builds fill pattern 0xdeadbeef) over the Windows active-lock's prev/next list pointers, making Windows sad. But on other platforms, the contents of dmu_zfetch_fini() is one call to list_destroy() and one to rw_destroy(), which is effectively a no-op call and is not required. This commit is mostly for "correctness" and can be skipped there. Porting Notes: * This leak exists on Linux but currently can never happen because the dnode_move() functionality is not supported. openzfsonosx-commit: openzfsonosx/zfs@d95fe517 Authored by: Julian Heuking <JulianH@beckhoff.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #8519	2019-03-20 15:06:55 -07:00
Brian Behlendorf	ca6c7a94c9	Fix l2arc_evict() destroy race When destroying an arc_buf_hdr_t its identity cannot be discarded until it is entirely undiscoverable. This not only includes being unhashed, but also being removed from the l2arc header list. Discarding the header's identify prematurely renders the hash lock useless because it will always hash to bucket zero. This change resolves a race with l2arc_evict() by discarding the identity after it has been removed from the l2arc header list. This ensures either the header is not on the list or contains the correct identify. Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7688 Closes #8144	2019-03-15 14:17:38 -07:00
Tom Caputi	ab7615d92c	Multiple DVA Scrubbing Fix Currently, there is an issue in the sequential scrub code which prevents self healing from working in some cases. The scrub code will split up all DVA copies of a bp and issue each of them separately. The problem is that, since each of the DVAs is no longer associated with the others, the self healing code doesn't have the opportunity to repair problems that show up in one of the DVAs with the data from the others. This patch fixes this issue by ensuring that all IOs issued by the sequential scrub code include all DVAs. Initially, only the first DVA of each is attempted. If an issue arises, the IO is retried with all available copies, giving the self healing code a chance to correct the issue. To test this change, this patch also adds the ability for zinject to specify individual DVAs to inject read errors into. We then add a new test case that utilizes this functionality to ensure scrubs and self-healing reads can handle and transparently fix issues with individual copies of blocks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8453	2019-03-15 14:14:31 -07:00
Tony Hutter	2bbec1c910	Make zpool status counters match error events count The number of IO and checksum events should match the number of errors seen in zpool status. Previously there was a mismatch between the two counts because zpool status would only count unrecovered errors, while zpool events would get an event for all errors (recovered or not). This lead to situations where disks could be faulted for "too many errors", while at the same time showing zero errors in zpool status. This fixes the zpool status error counters to increment at the same times we post the error events. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #4851 Closes #7817	2019-03-14 18:21:53 -07:00
Tom Caputi	04a3b0796c	Fix memory leaks in zfsvfs_create_impl() This patch simply fixes some small memory leaks that can happen during error handling in zfsvfs_create_impl(). If the function fails, it frees all the memory / references it created. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8490	2019-03-14 18:14:36 -07:00
Tom Caputi	eaed840542	Better user experience for errata 4 This patch attempts to address some user concerns that have arisen since errata 4 was introduced. * The errata warning has been made less scary for users without any encrypted datasets. * The errata warning now clears itself without a pool reimport if the bookmark_v2 feature is enabled and no encrypted datasets exist. * It is no longer possible to create new encrypted datasets without enabling the bookmark_v2 feature, thus helping to ensure that the errata is resolved. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue ##8308 Closes #8504	2019-03-14 16:48:30 -07:00
Alexander Motin	1af240f3b5	Add separate aggregation limit for non-rotating media Before sequential scrub patches ZFS never aggregated I/Os above 128KB. Sequential scrub bumped that to 1MB, supposedly to reduce number of head seeks for spinning disks. But for SSDs it makes little to no sense, especially on FreeBSD, where due to MAXPHYS limitation device will likely still see bunch of 128KB I/Os instead of one large. Having more strict aggregation limit for SSDs allows to avoid allocation of large memory buffer and copy to/from it, that is a serious problem when throughput reaches gigabytes per second. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Closes #8494	2019-03-13 12:00:10 -07:00
Tom Caputi	f00ab3f22c	Detect and prevent mixed raw and non-raw sends Currently, there is an issue in the raw receive code where raw receives are allowed to happen on top of previously non-raw received datasets. This is a problem because the source-side dataset doesn't know about how the blocks on the destination were encrypted. As a result, any MAC in the objset's checksum-of-MACs tree that is a parent of both blocks encrypted on the source and blocks encrypted by the destination will be incorrect. This will result in authentication errors when we decrypt the dataset. This patch fixes this issue by adding a new check to the raw receive code. The code now maintains an "IVset guid", which acts as an identifier for the set of IVs used to encrypt a given snapshot. When a snapshot is raw received, the destination snapshot will take this value from the DRR_BEGIN payload. Non-raw receives and normal "zfs snap" operations will cause ZFS to generate a new IVset guid. When a raw incremental stream is received, ZFS will check that the "from" IVset guid in the stream matches that of the "from" destination snapshot. If they do not match, the code will error out the receive, preventing the problem. This patch requires an on-disk format change to add the IVset guids to snapshots and bookmarks. As a result, this patch has errata handling and a tunable to help affected users resolve the issue with as little interruption as possible. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8308	2019-03-13 11:00:43 -07:00
Tom Caputi	579ce7c5ae	Add bookmark v2 on-disk feature This patch adds the bookmark v2 feature to the on-disk format. This feature will be needed for the upcoming redacted sends and for an upcoming fix that for raw receives. The feature is not currently used by any code and thus this change is a no-op, aside from the fact that the user can now enable the feature. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Issue #8308	2019-03-13 10:58:39 -07:00
Tom Caputi	369aa501d1	Fix handling of maxblkid for raw sends Currently, the receive code can create an unreadable dataset from a correct raw send stream. This is because it is currently impossible to set maxblkid to a lower value without freeing the associated object. This means truncating files on the send side to a non-0 size could result in corruption. This patch solves this issue by adding a new 'force' flag to dnode_new_blkid() which will allow the raw receive code to force the DMU to accept the provided maxblkid even if it is a lower value than the existing one. For testing purposes the send_encrypted_files.ksh test has been extended to include a variety of truncated files and multiple snapshots. It also now leverages the xattrtest command to help ensure raw receives correctly handle xattrs. Reviewed-by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #8168 Closes #8487	2019-03-13 10:52:01 -07:00
Justin Gottula	cffa8372f4	Fix most zfs_arc_* mod params not actually being modifiable at runtime Most of the zfs_arc_* module parameters do not have their values used by the ARC code directly. Instead, there is a function, arc_tuning_update, which is called during module initialization and periodically thereafter, whose job is to fetch the module parameter values, clamp/ limit them appropriately, and then assign those values to a separate set of internal variables that are actually referenced by the ARC code. Commit `3ec34e55` featured an overhaul of arc_reclaim_thread, which is the former location where the post-init-time calls to arc_tuning_update would occur. The rework split the work previously done by the arc_reclaim_thread into a pair of replacement threads; and unfortunately, the call to arc_tuning_update fell through the cracks and was lost in the reorganization. This meant that changing almost any ARC-related zfs module parameter via /sys/module/zfs/parameters/ would result in the module parameter value itself appearing to change; however the modification would not actually propagate to the ARC code and have any real effect. This commit reinstates the post-init-time call to arc_tuning_update. It is now called during arc_adjust_cb_check; this should be equivalent to its former call location in arc_reclaim_thread. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Justin Gottula <justin@jgottula.com> Closes #8405 Closes #8463	2019-03-12 15:03:59 -07:00
Alek P	4c0883fb4a	Avoid retrieving unused snapshot props This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it to take input parameters that alter the way looping through the list of snapshots is performed. The idea here is to restrict functions that throw away some of the snapshots returned by the ioctl to a range of snapshots that these functions actually use. This improves efficiency and execution speed for some rollback and send operations. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8077	2019-03-12 13:13:22 -07:00
Brian Behlendorf	dd785b5b86	Fix vdev_initialize_restart / removal race Resolve a vdev_initialize crash uncovered by ztest. Similar to when starting a new initialization verify that a removal is not in progress. Additionally, do not restart when the thread already exists. This check is now congruent with the POOL_INITIALIZE_DO handling in spa_vdev_initialize_impl(). Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8477	2019-03-12 10:39:47 -07:00
Olaf Faaland	3d31aad83e	MMP writes rotate over leaves Instead of choosing a leaf vdev quasi-randomly, by starting at the root vdev and randomly choosing children, rotate over leaves to issue MMP writes. This fixes an issue in a pool whose top-level vdevs have different numbers of leaves. The issue is that the frequency at which individual leaves are chosen for MMP writes is based not on the total number of leaves but based on how many siblings the leaves have. For example, in a pool like this: root-vdev +------+---------------+ vdev1 vdev2 \| \| \| +------+-----+-----+----+ disk1 disk2 disk3 disk4 disk5 disk6 vdev1 and vdev2 will each be chosen 50% of the time. Every time vdev1 is chosen, disk1 will be chosen. However, every time vdev2 is chosen, disk2 is chosen 20% of the time. As a result, disk1 will be sent 5x as many MMP writes as disk2. This may create wear issues in the case of SSDs. It also reduces the effectiveness of MMP as it depends on the writes being evenly distributed for the case where some devices fail or are partitioned. The new code maintains a list of leaf vdevs in the pool. MMP records the last leaf used for an MMP write in mmp->mmp_last_leaf. To choose the next leaf, MMP starts at mmp->mmp_last_leaf and traverses the list, continuing from the head if the tail is reached. It stops when a suitable leaf is found or all leaves have been examined. Added a test to verify MMP write distribution is even. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7953	2019-03-12 10:37:06 -07:00
George Wilson	b1b94e9644	zfs does not honor NFS sync write semantics The linux kernel's nfsd implementation use RWF_SYNC to determine if the write is synchronous or not. This flag is used to set the kernel's I/O control block flags. Unfortunately, ZFS was not updated to inspect these flags so NFS sync writes were not being honored. This change maps the IOCB_* flags to the ZFS equivalent. Reviewed-by: Don Brady <don.brady@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <george.wilson@delphix.com> Closes #8474 Closes #8452 Closes #8486	2019-03-11 09:13:37 -07:00
mzhivich	1118f99449	Fix lockdep between ds_lock and dd_lock in dsl_dataset_namelen() Booting debug kernel found an inconsistent lock dependency between dataset's ds_lock and its directory's dd_lock. [ 32.215336] ====================================================== [ 32.221859] WARNING: possible circular locking dependency detected [ 32.221861] 4.14.90+ #8 Tainted: G O [ 32.221862] ------------------------------------------------------ [ 32.221863] dynamic_kernel_/4667 is trying to acquire lock: [ 32.221864] (&ds->ds_lock){+.+.}, at: [<ffffffffc10a4bde>] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.221941] but task is already holding lock: [ 32.221941] (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] [ 32.221983] which lock already depends on the new lock. [ 32.221983] the existing dependency chain (in reverse order) is: [ 32.221984] -> #1 (&dd->dd_lock){+.+.}: [ 32.221992] __mutex_lock+0xef/0x14c0 [ 32.222049] dsl_dir_namelen+0xd4/0x2d0 [zfs] [ 32.222093] dsl_dataset_namelen+0x2f1/0x430 [zfs] [ 32.222142] verify_dataset_name_len+0xd/0x40 [zfs] [ 32.222184] dmu_objset_find_dp_impl+0x5f5/0xef0 [zfs] [ 32.222226] dmu_objset_find_dp_cb+0x40/0x60 [zfs] [ 32.222235] taskq_thread+0x969/0x1460 [spl] [ 32.222238] kthread+0x2fb/0x400 [ 32.222241] ret_from_fork+0x3a/0x50 [ 32.222241] -> #0 (&ds->ds_lock){+.+.}: [ 32.222246] lock_acquire+0x14f/0x390 [ 32.222248] __mutex_lock+0xef/0x14c0 [ 32.222291] dsl_dataset_check_quota+0x9e/0x8a0 [zfs] [ 32.222355] dsl_dir_tempreserve_space+0x5d2/0x1290 [zfs] [ 32.222392] dmu_tx_assign+0xa61/0xdb0 [zfs] [ 32.222436] zfs_create+0x4e6/0x11d0 [zfs] [ 32.222481] zpl_create+0x194/0x340 [zfs] [ 32.222484] lookup_open+0xa86/0x16f0 [ 32.222486] path_openat+0xe56/0x2490 [ 32.222488] do_filp_open+0x17f/0x260 [ 32.222490] do_sys_open+0x195/0x310 [ 32.222491] SyS_open+0xbf/0xf0 [ 32.222494] do_syscall_64+0x191/0x4f0 [ 32.222496] entry_SYSCALL_64_after_hwframe+0x42/0xb7 [ 32.222497] other info that might help us debug this: [ 32.222497] Possible unsafe locking scenario: [ 32.222498] CPU0 CPU1 [ 32.222498] ---- ---- [ 32.222499] lock(&dd->dd_lock); [ 32.222500] lock(&ds->ds_lock); [ 32.222502] lock(&dd->dd_lock); [ 32.222503] lock(&ds->ds_lock); [ 32.222504] * DEADLOCK * [ 32.222505] 3 locks held by dynamic_kernel_/4667: [ 32.222506] #0: (sb_writers#9){.+.+}, at: [<ffffffffaf68933c>] mnt_want_write+0x3c/0xa0 [ 32.222511] #1: (&type->i_mutex_dir_key#8){++++}, at: [<ffffffffaf652cde>] path_openat+0xe2e/0x2490 [ 32.222515] #2: (&dd->dd_lock){+.+.}, at: [<ffffffffc10cd8e9>] dsl_dir_tempreserve_space+0x3b9/0x1290 [zfs] The issue is caused by dsl_dataset_namelen() holding ds_lock, followed by acquiring dd_lock on ds->ds_dir in dsl_dir_namelen(). However, ds->ds_dir should not be protected by ds_lock, so releasing it before call to dsl_dir_namelen() prevents the lockdep issue Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Chris Dunlop <chris@onthe.net.au> Signed-off-by: Michael Zhivich <mzhivich@akamai.com> Closes #8413	2019-03-11 09:11:04 -07:00
Paul Zuchowski	a73e8fdb93	Stack overflow in recursive bpobj_iterate_impl The function bpobj_iterate_impl overflows the stack when bpobjs are deeply nested. Rewrite the function to eliminate the recursion. Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7674 Closes #7675 Closes #7908	2019-03-06 09:50:55 -08:00
Brian Behlendorf	96ebc5a1a4	Fix race in vdev_initialize_thread Before allowing new allocations to the metaslab we need to ensure that any issued initializing writes have been synced. Otherwise, it's possible for metaslab_block_alloc() to allocate a range which is about to be overwritten by an initializing IO. Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8461	2019-03-06 09:17:53 -08:00
Olaf Faaland	8133679ff0	Do not resume a pool if multihost is enabled When multihost is enabled, and a pool is suspended, return EINVAL in response to "zpool clear <pool>". The pool may have been imported on another host while I/O was suspended. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #6933 Closes #8460	2019-02-28 17:56:19 -08:00
Matthew Ahrens	87c25d567f	abd_alloc should use scatter for >1K allocations abd_alloc() normally does scatter allocations, thus solving the problem that ABD originally set out to: the bulk of ZFS's allocations are single pages, which are faster to allocate and free, and don't suffer from internal fragmentation (and the inability to reclaim memory because some buffers in the slab are still allocated). However, the current code does linear allocations for 4KB and smaller allocations, defeating the purpose of ABD. Scatter ABD's use at least one page each, so sub-page allocations waste some space when allocated as scatter (e.g. 2KB scatter allocation wastes half of each page). Using linear ABD's for small allocations means that they will be put on slabs which contain many allocations. This can improve memory efficiency, but it also makes it much harder for ARC evictions to actually free pages, because all the buffers on one slab need to be freed in order for the slab (and underlying pages) to be freed. Typically, 512B and 1KB kmem caches have 16 buffers per slab, so it's possible for them to actually waste more memory than scatter (one page per buf = wasting 3/4 or 7/8th; one buf per slab = wasting 15/16th). Spill blocks are typically 512B and are heavily used on systems running selinux with the default dnode size and the `xattr=sa` property set. By default we will use linear allocations for 512B and 1KB, and scatter allocations for larger (1.5KB and up). Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: DHE <git@dehacked.net> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #8455	2019-02-28 17:52:55 -08:00
Brian Behlendorf	6af7ba417e	Fix overly broad spa config lock The spa_txg_history_init_io() and spa_txg_history_fini_io() were mistakenly taking SCL_ALL when only SCL_CONFIG is required to access the vdev stats. This could result in a deadlock which was observed when running ztest. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #8445	2019-02-27 10:49:22 -08:00
Serapheim Dimitropoulos	8eef997679	Error path in metaslab_load_impl() forgets to drop ms_sync_lock Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <guss80@gmail.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8444	2019-02-25 11:08:52 -08:00
loli10K	c44a3ec059	zvol: allow rename of in use ZVOL dataset While ZFS allow renaming of in use ZVOLs at the DSL level without issues the ZVOL layer does not correctly update the renamed dataset if the device node is open (zv->zv_open_count > 0): trying to access the stale dataset name, for instance during a zfs receive, will cause the following failure: VERIFY3(zv->zv_objset->os_dsl_dataset->ds_owner == zv) failed ((null) == ffff8800dbb6fc00) PANIC at zvol.c:1255:zvol_resume() Showing stack for process 1390 CPU: 0 PID: 1390 Comm: zfs Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.51-3 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 0000000000000000 ffffffff8151ea00 ffffffffa0758a80 ffff88028aefba30 ffffffffa0417219 ffff880037179220 ffffffff00000030 ffff88028aefba40 ffff88028aefb9e0 2833594649524556 6f5f767a3e2d767a 6f3e2d7465736a62 Call Trace: [<0>] ? dump_stack+0x5d/0x78 [<0>] ? spl_panic+0xc9/0x110 [spl] [<0>] ? mutex_lock+0xe/0x2a [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs] [<0>] ? rrw_exit+0xc8/0x2e0 [zfs] [<0>] ? mutex_lock+0xe/0x2a [<0>] ? dmu_objset_from_ds+0x9a/0x250 [zfs] [<0>] ? dmu_objset_hold_flags+0x71/0xc0 [zfs] [<0>] ? zvol_resume+0x178/0x280 [zfs] [<0>] ? zfs_ioc_recv_impl+0x88b/0xf80 [zfs] [<0>] ? zfs_refcount_remove_many+0x1ad/0x250 [zfs] [<0>] ? zfs_ioc_recv+0x1c2/0x2a0 [zfs] [<0>] ? dmu_buf_get_user+0x13/0x20 [zfs] [<0>] ? __alloc_pages_nodemask+0x166/0xb50 [<0>] ? zfsdev_ioctl+0x896/0x9c0 [zfs] [<0>] ? handle_mm_fault+0x464/0x1140 [<0>] ? do_vfs_ioctl+0x2cf/0x4b0 [<0>] ? __do_page_fault+0x177/0x410 [<0>] ? SyS_ioctl+0x81/0xa0 [<0>] ? async_page_fault+0x28/0x30 [<0>] ? system_call_fast_compare_end+0x10/0x15 Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6263 Closes #8371	2019-02-22 15:38:42 -08:00
loli10K	0c637f3100	zpool reports 16E expandsize on disks with oddball number of sectors The issue is caused by a small discrepancy in how userland creates the partition layout and the kernel estimates available space: * zpool command: subtract 9M from the usable device size, then align to 1M boundary. 9M is the sum of 1M "start" partition alignment + 8M EFI "reserved" partition. * kernel module: subtract 10M from the device size. 10M is the sum of 1M "start" partition alignment + 1m "end" partition alignment + 8M EFI "reserved" partition. For devices where the number of sectors is not a multiple of the alignment size the zpool command will create a partition layout which reserves less than 1M after the 8M EFI "reserved" partition: Disk /dev/sda: 1024 MiB, 1073739776 bytes, 2097148 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: gpt Disk identifier: 49811D40-16F4-4E41-84A9-387703950D7F Device Start End Sectors Size Type /dev/sda1 2048 2078719 2076672 1014M Solaris /usr & Apple ZFS /dev/sda9 2078720 2095103 16384 8M Solaris reserved 1 When the kernel module vdev_open() the device its max_asize ends up being slightly smaller than asize: this results in a huge number (16E) reported by metaslab_class_expandable_space(). This change prevents bdev_max_capacity() from returing a size smaller than bdev_capacity(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #1468 Closes #8391	2019-02-22 15:36:34 -08:00
lidongyang	8d9e51c084	Fix dnode_hold_impl() soft lockup Soft lockups could happen when multiple threads trying to get zrl on the same dnode handle in order to allocate and initialize the dnode marked as DN_SLOT_ALLOCATED. Don't loop from beginning when we can't get zrl, otherwise we would increase the zrl refcount and nobody can actually lock it. Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyangli@ddn.com> Closes #8433	2019-02-22 09:48:37 -08:00
Tomohiro Kusumi	9abbee4912	Don't enter zvol's rangelock for read bio with size 0 The SCST driver (SCSI target driver implementation) and possibly others may issue read bio's with a length of zero bytes. Although this is unusual, such bio's issued under certain condition can cause kernel oops, due to how rangelock is implemented. rangelock_add_reader() is not made to handle overlap of two (or more) ranges from read bio's with the same offset when one of them has size of 0, even though they conceptually overlap. Allowing them to enter rangelock results in kernel oops by dereferencing invalid pointer, or assertion failure on AVL tree manipulation with debug enabled kernel module. For example, this happens when read bio whose (offset, size) is (0, 0) enters rangelock followed by another read bio with (0, 4096) when (0, 0) rangelock is still locked, when there are no pending write bio's. It can also happen with reverse order, which is (0, N) followed by (0, 0) when (0, N) is still locked. More details mentioned in #8379. Kernel Oops on ->make_request_fn() of ZFS volume https://github.com/zfsonlinux/zfs/issues/8379 Prevent this by returning bio with size 0 as success without entering rangelock. This has been done for write bio after checking flusher bio case (though not for the same reason), but not for read bio. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8379 Closes #8401	2019-02-20 10:14:36 -08:00
Serapheim Dimitropoulos	928e8ad47d	Introduce auxiliary metaslab histograms This patch introduces 3 new histograms per metaslab. These histograms track segments that have made it to the metaslab's space map histogram (and are part of the spacemap) but have not yet reached the ms_allocatable tree on loaded metaslab's because these metaslab's are currently syncing and haven't gone through metaslab_sync_done() yet. The histograms help when we decide whether to load an unloaded metaslab in-order to allocate from it. When calculating the weight of an unloaded metaslab traditionally, we look at the highest bucket of its spacemap's histogram. The problem is that we are not guaranteed to be able to allocated that segment when we load the metaslab because it may still be at the freeing, freed, or defer trees. The new histograms are used when we try to calculate an unloaded metaslab's weight to deal with this issue by removing segments that have would not be in the allocatable tree at runtime. Note, that this method of dealing with this is not completely accurate as adjacent segments are not always consolidated in the space map histogram of a metaslab. In addition and to make things deterministic, we always reset the weight of unloaded metaslabs based on their space map weight (instead of doing that on a need basis). Thus, every time a metaslab is loaded and its weight is reset again (from the weight based on its space map to the one based on its allocatable range tree) we expect (and assert) that this change in weight can only get better if it doesn't stay the same. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matt Ahrens <mahrens@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8358	2019-02-20 09:59:56 -08:00
loli10K	bb1be77a35	Prevent user accounting on readonly pool Trying to mount a dataset from a readonly pool could inadvertently start the user accounting upgrade task, leading to the following failure: VERIFY3(tx->tx_threads == 2) failed (0 == 2) PANIC at txg.c:680:txg_wait_synced() Showing stack for process 2541 CPU: 2 PID: 2541 Comm: z_upgrade Tainted: P O 3.16.0-4-amd64 #1 Debian 3.16.51-3 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 Call Trace: [<0>] ? dump_stack+0x5d/0x78 [<0>] ? spl_panic+0xc9/0x110 [spl] [<0>] ? dnode_next_offset+0x1d4/0x2c0 [zfs] [<0>] ? dmu_object_next+0x77/0x130 [zfs] [<0>] ? dnode_rele_and_unlock+0x4d/0x120 [zfs] [<0>] ? txg_wait_synced+0x91/0x220 [zfs] [<0>] ? dmu_objset_id_quota_upgrade_cb+0x10f/0x140 [zfs] [<0>] ? dmu_objset_upgrade_task_cb+0xe3/0x170 [zfs] [<0>] ? taskq_thread+0x2cc/0x5d0 [spl] [<0>] ? wake_up_state+0x10/0x10 [<0>] ? taskq_thread_should_stop.part.3+0x70/0x70 [spl] [<0>] ? kthread+0xbd/0xe0 [<0>] ? kthread_create_on_node+0x180/0x180 [<0>] ? ret_from_fork+0x58/0x90 [<0>] ? kthread_create_on_node+0x180/0x180 This patch updates both functions responsible for checking if we can perform user accounting to verify the pool is not readonly. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #8424	2019-02-19 18:41:18 -08:00
Sara Hartse	f545b6ae00	Delay injection can cause indefinitely hung zios If we hit the (NSEC_TO_TICK(diff) == 0) condition in zio_delay_interrupt, zio_interrupt is never called and the zio does not progress. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: sara hartse <sara.hartse@delphix.com> Closes #8404	2019-02-15 14:44:56 -08:00
Tim Chase	638dd5f44e	zio_deadman_impl() fix and enhancement Add the zio_deadman_log_all tunable to print all zios in zio_deadman_impl(). Also, in all cases, display the depth of the zio relative to the original parent zio. This is meant to be used by developers to gain diagnostic information for hangs which don't involve fully set-up zio trees or are otherwise stuck or hung in an early stage. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #8362	2019-02-15 12:44:24 -08:00
Paul Zuchowski	9c5e88b1de	zfs should optionally send holds Add -h switch to zfs send command to send dataset holds. If holds are present in the stream, zfs receive will create them on the target dataset, unless the zfs receive -h option is used to skip receive of holds. Reviewed-by: Alek Pinchuk <apinchuk@datto.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7513	2019-02-15 12:41:38 -08:00
Tomohiro Kusumi	2d76ab9e42	Fix obsolete comment on rangelock `5d43cc9a59` renamed it to rangelock_enter(). Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #8408	2019-02-14 15:13:58 -08:00
Alek P	65282ee9e0	Freeing throttle should account for holes Deletion throttle currently does not account for holes in a file. This means that it can activate when it shouldn't. To fix it we switch the throttle to be based on the number of L1 blocks we will have to dirty when freeing Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #7725 Closes #7888	2019-02-12 12:01:08 -08:00
Alek P	dcec0a12c8	port async unlinked drain from illumos-nexenta This patch is an async implementation of the existing sync zfs_unlinked_drain() function. This function is called at mount time and is responsible for freeing znodes that we didn't get to freeing before. We don't have to hold mounting of the dataset until the unlinked list is fully drained as is done now. Since we can process the unlinked set asynchronously this results in a better user experience when mounting a dataset with entries in the unlinked set. Reviewed by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Tom Caputi <tcaputi@datto.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Closes #8142	2019-02-12 10:41:15 -08:00
Serapheim Dimitropoulos	425d3237ee	Get rid of space_map_update() for ms_synced_length Initially, metaslabs and space maps used to be the same thing in ZFS. Later, we started differentiating them by referring to the space map as the on-disk state of the metaslab, making the metaslab a higher-level concept that is metadata that deals with space accounting. Today we've managed to split that code furthermore, with the space map being its own on-disk data structure used in areas of ZFS besides metaslabs (e.g. the vdev-wide space maps used for zpool checkpoint or vdev removal features). This patch refactors the space map code to further split the space map code from the metaslab code. It does so by getting rid of the idea that the space map can have a different in-core and on-disk length (sm_length vs smp_length) which is something that is only used for the metaslab code, and other consumers of space maps just have to deal with. Instead, this patch introduces changes that move the old in-core length of the metaslab's space map to the metaslab structure itself (see ms_synced_length field) while making the space map code only care about the actual space map's length on-disk. The result of this is that space map consumers no longer have to deal with syncing two different lengths for the same structure (e.g. space_map_update() goes away) while metaslab specific behavior stays within the metaslab code. Specifically, the ms_synced_length field keeps track of the amount of data metaslab_load() can read from the metaslab's space map while working concurrently with metaslab_sync() that may be appending to that same space map. As a side note, the patch also adds a few comments around the metaslab code documenting some assumptions and expected behavior. Reviewed-by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com> Closes #8328	2019-02-12 10:38:11 -08:00

1 2 3 4 5 ...

2168 Commits