freebsd-dev

Author	SHA1	Message	Date
Pavel Zakharov	38a19edd34	OpenZFS 9189 - Add debug to vdev_label_read_config when txg check fails These changes were added to help debug issue #9187. Essentially, in the original bug, vdev_validate() seems to fails in vdev_label_read_config() and prints "failed reading config". This could happen because either: 1. The labels are actually corrupt and zio_wait() fails for all of them 2. The labels were discarded because they didn't pass the txg check. Beyond 9187, having debug info when case 2 happens could be useful in other scenarios, such as zpool import. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9189 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f6af1b7 Closes #7533	2018-05-14 14:32:49 -04:00
Pavel Zakharov	db7d07e14b	OpenZFS 9191 - dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices Add vdev_print_tree() in spa_check_for_missing_logs() when some log devices are missing to ease debugging Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9191 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c5c02e5 Closes #7531	2018-05-14 14:30:52 -04:00
Pavel Zakharov	a11c7aaec9	OpenZFS 9187 - racing condition between vdev label and spa_last_synced_txg in vdev_validate ztest failed with uncorrectable IO error despite having the fix for 7163. Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes it from that issue. Definitely seems like a racing condition between the vdev_validate and spa_sync: 1. Thread A (spa_sync): vdev label is updated to latest txg 2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead. 3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg. Solution: do not check txg in vdev_validate unless config lock is held. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9187 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/805fda72 Closes #7529	2018-05-14 14:28:09 -04:00
Brian Behlendorf	b669ab83bb	Ignore *.o.ur-safe build artifacts Generated when building on Ubuntu 18.04. Also ignore the new dynamically generated zfs-mount-generator.8 man page, and the module/.cache.mk file. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7534	2018-05-13 18:59:02 -07:00
Olaf Faaland	bc5f51c5de	module param callbacks check for initialized spa Callbacks provided for module parameters are executed both after the module is loaded, when a user alters it via sysfs, e.g echo bar > /sys/modules/zfs/parameters/foo as well as when the module is loaded with an argument, e.g. modprobe zfs foo=bar In the latter case, the init functions likely have not run yet, including spa_init() which initializes the namespace lock so it is safe to use. Instead of immediately taking the namespace lock and attemping to iterate over initialized spa structures, check whether spa_mode_global is nonzero. This is set by spa_init() after it has initialized the namespace lock. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7496 Closes #7521	2018-05-11 12:46:07 -07:00
Tim Chase	d1043e2f6d	Unify behavior of deadman parameters The zfs_deadman_failmode, zfs_deadman_ziotime_ms and zfs_deadman_synctime_ms paramaters are stored per-pool. However, only the zfs_deadman_failmode updates the per-pool state when it's change. This patch gives adds the same behavior to the other two for consistency. Also, in all 3 three cases, only update the per-pool parameters if spa_init() has actually been called in order to avoid panicking when trying to take a lock on the spa_namespace_lock mutex. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7499	2018-05-08 21:45:47 -07:00
Tim Chase	a0ad7ca54e	Clear vdev_faulted Clear vdev_faulted if ZPOOL_CONFIG_AUX_STATE is not set to "external" ZoL supports "zpool export -f" (force fault), which can be combined with "-t" (temporary fault; don't persist across export/import) and causes a MOS configuration to be set with ZPOOL_CONFIG_FAULTED=1 and without ZFS_CONFIG_AUX_STATE set at all. In this case, the previously-offlined vdev should be imported in an on-line state and. Clearing the "vdev_faulted" flag causes the import to treat the device as on-line. Typically, resilver will catch it up based on its DTL. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7459	2018-05-08 21:39:50 -07:00
Pavel Zakharov	6cb8e5306d	OpenZFS 9075 - Improve ZFS pool import/load process and corrupted pool recovery Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: 7638 Refactor spa_load_impl into several functions 8961 SPA load/import should tell us why it failed 7277 zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss of data, this feature is for now restricted to a read-only import. Porting notes (ZTS): * Fix 'make dist' target in zpool_import * The maximum path length allowed by tar is 99 characters. Several of the new test cases exceeded this limit resulting in them not being included in the tarball. Shorten the names slightly. * Set/get tunables using accessor functions. * Get last synced txg via the "zfs_txg_history" mechanism. * Clear zinject handlers in cleanup for import_cache_device_replaced and import_rewind_device_replaced in order that the zpool can be exported if there is an error. * Increase FILESIZE to 8G in zfs-test.sh to allow for a larger ext4 file system to be created on ZFS_DISK2. Also, there's no need to partition ZFS_DISK2 at all. The partitioning had already been disabled for multipath devices. Among other things, the partitioning steals some space from the ext4 file system, makes it difficult to accurately calculate the paramters to parted and can make some of the tests fail. * Increase FS_SIZE and FILE_SIZE in the zpool_import test configuration now that FILESIZE is larger. * Write more data in order that device evacuation take lonnger in a couple tests. * Use mkdir -p to avoid errors when the directory already exists. * Remove use of sudo in import_rewind_config_changed. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9075 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/619c0123 Closes #7459	2018-05-08 21:35:27 -07:00
Pavel Zakharov	afd2f7b711	OpenZFS 8962 - zdb should work on non-idle pools Currently `zdb` consistently fails to examine non-idle pools as it fails during the `spa_load()` process. The main problem seems to be that `spa_load_verify()` fails as can be seen below: $ sudo zdb -d -G dcenter zdb: can't open 'dcenter': I/O error ZFS_DBGMSG(zdb): spa_open_common: opening dcenter spa_load(dcenter): LOADING disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950 spa_load(dcenter): using uberblock with txg=40824950 spa_load(dcenter): UNLOADING spa_load(dcenter): RELOADING spa_load(dcenter): LOADING disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952 spa_load(dcenter): using uberblock with txg=40824952 spa_load(dcenter): FAILED: spa_load_verify failed [error=5] spa_load(dcenter): UNLOADING This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is done by creating a global flag in zfs and then setting it in `zdb`. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/8962 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/180ad792 Closes #7459	2018-05-08 21:32:57 -07:00
Pavel Zakharov	4a0ee12af8	OpenZFS 8961 - SPA load/import should tell us why it failed Problem ======= When we fail to open or import a storage pool, we typically don't get any additional diagnostic information, just "no pool found" or "can not import". While there may be no additional user-consumable information, we should at least make this situation easier to debug/diagnose for developers and support. For example, we could start by using `zfs_dbgmsg()` to log each thing that we try when importing, and which things failed. E.g. "tried uberblock of txg X from label Y of device Z". Also, we could log each of the stages that we go through in `spa_load_impl()`. Solution ======== Following the cleanup to `spa_load_impl()`, debug messages have been added to every point of failure in that function. Additionally, debug messages have been added to strategic places, such as `vdev_disk_open()`. Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/8961 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/418079e0 Closes #7459	2018-05-08 21:30:10 -07:00
Paul Dagnelie	ca0845d59e	OpenZFS 9256 - zfs send space estimation off by > 10% on some datasets Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: * Added tuning to man page. * Test case changes dropped, default behavior unchanged. OpenZFS-issue: https://www.illumos.org/issues/9256 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/32356b3c56 Closes #7470	2018-05-08 08:59:24 -07:00
LOLi	4ceb8dd6fd	Fix 'zpool create -t <tempname>' Creating a pool with a temporary name fails when we also specify custom dataset properties: this is because we mistakenly call zfs_set_prop_nvlist() on the "real" pool name which, as expected, cannot be found because the SPA is present in the namespace with the temporary name. Fix this by specifying the correct pool name when setting the dataset properties. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7502 Closes #7509	2018-05-07 21:11:58 -07:00
Brian Behlendorf	c02c1becce	ZTS: Re-enable MMP tests Commit `7fab6361` inadvertently disabled the MMP test cases by creating and not removing an /etc/hostid file in the new zpool_split_props test case. When the file exists the ZTS skips the entire MMP test group rather than modify what may be a system which is already configured. Update the test case to remove the file. Additionally, because the MMP tests were disabled a regression slipped in as part of commit `9eb7b46ed0`. Fix it. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7514	2018-05-07 21:08:33 -07:00
Paul Dagnelie	64c1dcefe3	OpenZFS 9421, 9422 - zdb show possibly leaked objects 9421 zdb should detect and print out the number of "leaked" objects 9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue It is possible for zfs to "leak" objects in such a way that they are not freed, but are also not accessible via the POSIX interface. As the only way to know that this is happened is to see one of them directly in a zdb run, or by noting unaccounted space usage, zdb should be enhanced to count these objects and return failure if some are detected. We have access to the delete queue through the zfs_get_deleteq function; we should call it in dump_znode to determine if the object is on the delete queue. This is not the most efficient possible method, but it is the simplest to implement, and should suffice for the common case where there few objects on the delete queue. Also zfs diff and zdb currently traverse every single dnode in a dataset and tries to figure out the path of the object by following it's parent. When an object is placed on the delete queue, for all practical purposes it's already discarded, it's parent might not exist anymore, and another object might now have the object number that belonged to the parent. While all of the above makes sense, when trying to figure out the path of an object that is on the delete queue, we can run into issues where either it is impossible to determine the path because the parent is gone, or another dnode has taken it's place and thus we are returned a wrong path. We should therefore avoid trying to determine the path of an object on the delete queue and mark the object itself as being on the delete queue to avoid confusion. To achieve this, we currently have two ideas: 1. When putting an object on the delete queue, change it's parent object number to a known constant that means NULL. 2. When displaying objects, first check if it is present on the delete queue. Authored by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9421 OpenZFS-issue: https://illumos.org/issues/9422 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45ae0dd9ca Closes #7500	2018-05-04 10:50:24 -07:00
Matthew Ahrens	5e097c67f1	OpenZFS 9443 - panic when scrub a v10 pool While expanding stored pools, we ran into a panic using an old pool. Steps to reproduce: $ sudo zpool create -o version=2 test c2t1d0 $ sudo cp /etc/passwd /test/foo $ sudo zpool attach test c2t1d0 c2t2d0 We'll get this panic: ffffff000fc0e5e0 unix:real_mode_stop_cpu_stage2_end+b27c () ffffff000fc0e6f0 unix:trap+dc8 () ffffff000fc0e700 unix:cmntrap+e6 () ffffff000fc0e860 zfs:dsl_scan_visitds+1ff () ffffff000fc0ea20 zfs:dsl_scan_visit+fe () ffffff000fc0ea80 zfs:dsl_scan_sync+1b3 () ffffff000fc0eb60 zfs:spa_sync+435 () ffffff000fc0ec20 zfs:txg_sync_thread+23f () ffffff000fc0ec30 unix:thread_start+8 () The problem is a bad trap accessing a NULL pointer. We're looking for the dp_origin_snap of a dsl_pool_t, but version 2 didn't have that. The system will go into a reboot loop at this point, and the dump won't be accessible except by removing the cache file from within the recovery environment. This impacts any sort of scrub or resilver on version <11 pools, e.g.: $ zpool create -o version=10 test c2t1d0 $ zpool scrub test Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9443 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/010eed29 Closes #7501	2018-05-04 10:47:10 -07:00
Tom Caputi	be9a5c355c	Add support for decryption faults in zinject This patch adds the ability for zinject to trigger decryption and authentication faults in the ZIO and ARC layers. This functionality is exposed via the new "decrypt" error type, which may be provided for "data" object types. This patch also refactors some of the core encryption / decryption functions so that they have consistent prototypes, handle errors consistently, and do not have unused arguments. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7474	2018-05-02 15:36:20 -07:00
Brian Behlendorf	9464b9591e	RHEL 7.5 compat: FMODE_KABI_ITERATE As of RHEL 7.5 the mainline fops.iterate() method was added to the file_operations structure and is correctly detected by the configure script. Normally this is what we want, but in order to maintain KABI compatibility the RHEL change additionally does the following: * Requires that callers intending to use this extended interface set the FMODE_KABI_ITERATE flag on the file structure when opening the directory. * Adds the fops.iterate() method to the end of the structure, without removing fops.readdir(). This change updates the configure check to ignore the RHEL 7.5+ variant of fops.iterate() when detected. Instead fallback to the fops.readdir() interface which will be available. Finally, add the 'zpl_' prefix to the directory context wrappers to avoid colliding with the kernel provided symbols when both the fops.iterate() and fops.readdir() are provided by the kernel. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7460 Closes #7463	2018-05-02 15:01:24 -07:00
Brian Behlendorf	bc8a6a60e9	Fix inst_num overflow in qat_crypt.c This patch fixes the same issue which was previously addressed in 6051. The variable "inst_num" was of the incorrect type and "atomic_inc_32_nv()" could cause an overflow damaging its neighbor. Cast the return value of atomic_inc_32_nv() to Cpa32U. Fix a few types for num_inst for clarity. Reviewed-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7468	2018-05-01 20:44:24 -07:00
Tom Caputi	2c24b5b148	Fix issues found with zfs diff Two deadlocks / ASSERT failures were introduced in `a2c2ed1b` which would occur whenever arc_buf_fill() failed to decrypt a block of data. This occurred because the call to arc_buf_destroy() which was responsible for cleaning up the newly created buffer would attempt to take out the hdr lock that it was already holding. This was resolved by calling the underlying functions directly without retaking the lock. In addition, the dmu_diff() code did not properly ensure that keys were loaded and mapped before begining dataset traversal. It turns out that this code does not need to look at any encrypted values, so the code was altered to perform raw IO only. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7354 Closes #7456	2018-05-01 11:24:20 -07:00
Tomohiro Kusumi	d6133fc500	Silence compile-time warning on unused variable ASSERT3U() could be NOP which then leads to having unused pointer spa. metaslab.c: In function 'metaslab_condense': metaslab.c:2075:9: warning: unused variable 'spa' [-Wunused-variable] spa_t spa = msp->ms_group->mg_vd->vdev_spa; Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #7489	2018-05-01 11:15:54 -07:00
loli10K	85ce3f4fd1	Adopt pyzfs from ClusterHQ This commit introduces several changes: * Update LICENSE and project information * Give a good PEP8 talk to existing Python source code * Add RPM/DEB packaging for pyzfs * Fix some outstanding issues with the existing pyzfs code caused by changes in the ABI since the last time the code was updated * Integrate pyzfs Python unittest with the ZFS Test Suite * Add missing libzfs_core functions: lzc_change_key, lzc_channel_program, lzc_channel_program_nosync, lzc_load_key, lzc_receive_one, lzc_receive_resumable, lzc_receive_with_cmdprops, lzc_receive_with_header, lzc_reopen, lzc_send_resume, lzc_sync, lzc_unload_key, lzc_remap Note: this commit slightly changes zfs_ioc_unload_key() ABI. This allow to differentiate the case where we tried to unload a key on a non-existing dataset (ENOENT) from the situation where a dataset has no key loaded: this is consistent with the "change" case where trying to zfs_ioc_change_key() from a dataset with no key results in EACCES. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7230	2018-05-01 10:33:35 -07:00
Alexander Motin	20507534d4	OpenZFS 9434 - Speculative prefetch is blocked by device removal code Device removal code does not set spa_indirect_vdevs_loaded for pools that never experienced device removal. At least one visual consequence of it is completely blocked speculative prefetcher. This patch sets the variable in such situations. Authored by: Alexander Motin <mav@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Approved by: Matt Ahrens <mahrens@delphix.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9434 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/16127b627b Closes #7480	2018-04-30 13:05:55 -07:00
Matthew Ahrens	964c2d69a9	OpenZFS 9236 - nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Patch Notes: * Removed ZFS_DEBUG_SPA from zfs-module-parameters.5 OpenZFS-issue: https://www.illumos.org/issues/9236 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cfaba7f668 Closes #7467	2018-04-30 10:19:48 -07:00
Mark Wright	089500e792	Fix CONFIG_GCC_PLUGIN_RANDSTRUCT build Fix build errors with gcc 7.3.0 on Gentoo with kernel 4.16.3 built with CONFIG_GCC_PLUGIN_RANDSTRUCT=y such as: module/zfs/vdev_indirect.c:296:2: error: positional initialization of field in ‘struct’ declared with ‘designated_init’ attribute [-Werror=designated-init] vdev_indirect_map_free, ^~~~~~~~~~~~~~~~~~~~~~ Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Mark Wright <gienah@gentoo.org> Closes #7464	2018-04-20 09:53:25 -07:00
Chunwei Chen	599b864813	Fix ENOSPC in "Handle zap_add() failures in ..." Commit `cc63068` caused ENOSPC error when copy a large amount of files between two directories. The reason is that the patch limits zap leaf expansion to 2 retries, and return ENOSPC when failed. The intent for limiting retries is to prevent pointlessly growing table to max size when adding a block full of entries with same name in different case in mixed mode. However, it turns out we cannot use any limit on the retry. When we copy files from one directory in readdir order, we are copying in hash order, one leaf block at a time. Which means that if the leaf block in source directory has expanded 6 times, and you copy those entries in that block, by the time you need to expand the leaf in destination directory, you need to expand it 6 times in one go. So any limit on the retry will result in error where it shouldn't. Note that while we do use different salt for different directories, it seems that the salt/hash function doesn't provide enough randomization to the hash distance to prevent this from happening. Since `cc63068` has already been reverted. This patch adds it back and removes the retry limit. Also, as it turn out, failing on zap_add() has a serious side effect for mzap_upgrade(). When upgrading from micro zap to fat zap, it will call zap_add() to transfer entries one at a time. If it hit any error halfway through, the remaining entries will be lost, causing those files to become orphan. This patch add a VERIFY to catch it. Reviewed-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Reviewed-by: Richard Yao <ryao@gentoo.org> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7401 Closes #7421	2018-04-18 14:19:50 -07:00
Tom Caputi	b0ee5946aa	Fix issues with raw sends of spill blocks This patch fixes 2 issues in how spill blocks are processed during raw sends. The first problem is that compressed spill blocks were using the logical length rather than the physical length to determine how much data to dump into the send stream. The second issue is a typo that caused the spill record's object number to be used where the objset's ID number was required. Both issues have been corrected, and the payload_size is now printed in zstreamdump for future debugging. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7378 Closes #7432	2018-04-17 11:19:03 -07:00
Tom Caputi	e14a32b1c8	Fix object reclaim when using large dnodes Currently, when the receive_object() code wants to reclaim an object, it always assumes that the dnode is the legacy 512 bytes, even when the incoming bonus buffer exceeds this length. This causes a buffer overflow if --enable-debug is not provided and triggers an ASSERT if it is. This patch resolves this issue and adds an ASSERT to ensure this can't happen again. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7097 Closes #7433	2018-04-17 11:13:57 -07:00
Matthew Ahrens	0c03d21ac9	assertion in arc_release() during encrypted receive In the existing code, when doing a raw (encrypted) zfs receive, we call arc_convert_to_raw() from open context. This creates a race condition between arc_release()/arc_change_state() and writing out the block from syncing context (arc_write_ready/done()). This change makes it so that when we are doing a raw (encrypted) zfs receive, we save the crypt parameters (salt, iv, mac) of dnode blocks in the dbuf_dirty_record_t, and call arc_convert_to_raw() from syncing context when writing out the block of dnodes. Additionally, we can eliminate dr_raw and associated setters, and instead know that dnode blocks are always raw when doing a zfs receive (see the new field os_raw_receive). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #7424 Closes #7429	2018-04-17 11:06:54 -07:00
Matthew Ahrens	7f96cc23ac	OpenZFS 9192 - explicitly pass good_writes to vdev_uberblock/label_sync Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume that its io_private is a pointer to the good_writes count. They should instead accept this argument explicitly. OpenZFS-issue: https://www.illumos.org/issues/9192 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3f4c0b602d Closes #7446	2018-04-17 10:45:47 -07:00
Matt Ahrens	d830d4795a	OpenZFS 9280 - Assertion failure while running removal_with_ganging test with 4K devices Authored by: Matt Ahrens <Matt.Ahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9280 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/243952c Closes #7445	2018-04-17 10:44:50 -07:00
megari	d68ac65eb6	Revert "OpenZFS 9036 - zfs: duplicate 'const' declaration specifier" This reverts commit `cbb8933215`. The original change in OpenZFS 9036 did remove duplicate 'const' specifiers, but the ZoL port had already done what should have been done in OpenZFS 9036, which is to make the pointers themselves const. The port of the change to ZoL ended up doing an unnecessary removal of the constness of the pointers. Undo that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Ari Sundholm <ari@tuxera.com> Closes #7444	2018-04-16 12:44:40 -07:00
Pavel Zakharov	9eb7b46ed0	OpenZFS 7638 - Refactor spa_load_impl into several functions Authored by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/7638 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1fd3785ff6 Closes #7437	2018-04-16 12:24:23 -07:00
Tim Chase	5284f43a1e	Avoid Linux hung task message in ZTHR Use an interruptible to avoid Linux hung task message in ZTHR and to prevent inflating the load average. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7440 Closes #7441	2018-04-15 15:12:28 -07:00
Toomas Soome	5e567da987	OpenZFS 9213 - zfs: sytem typo Authored by: Toomas Soome <tsoome@me.com> Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Joshua M. Clulow <josh@sysmgr.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: * The additional instances of this typo addressed in the OpenZFS patch were already resolved. OpenZFS-issue: https://illumos.org/issues/9213 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/edc8ef7d92 Closes #7436	2018-04-15 10:59:13 -07:00
Toomas Soome	cbb8933215	OpenZFS 9036 - zfs: duplicate 'const' declaration specifier Authored by: Toomas Soome <tsoome@me.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9036 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f02c28e434 Closes #6900	2018-04-14 12:40:52 -07:00
Prakash Surya	eecdd8e884	OpenZFS 9084 - spa_*_ashift must ignore spare devices It's possible for the following assertion to be tripped when running ztest: assertion failed for thread 0xf09fca40, thread-id 549: spa->spa_max_ashift == spa->spa_min_ashift (0xc == 0x9), file ../../../uts/common/fs/zfs/vdev_removal.c, line 965 > $c libc.so.1`_lwp_kill+7(ebdde6c0, ebdde6c0, a9, fee7865e) libc.so.1`_assfail+0x214(ebddea28, fed7ac3c, 3c5, fef62000) libc.so.1`assfail3+0xde(fed7b130, c, 0, fed812cb, 9, 0) libzpool.so.1`spa_vdev_copy_impl+0x26b(89a4b40, ebddef74, ebddef68, 8992dc0, ebe10a00, fef073c0) libzpool.so.1`spa_vdev_remove_thread+0x6cd(87450c0, 0, 0, fee8f43a) libc.so.1`_thrp_setup+0x8c(f09fca40) libc.so.1`_lwp_start(f09fca40, 0, 0, 0, 0, 0) > ::spa -v ADDR STATE NAME 08723000 ACTIVE ztest ADDR STATE AUX DESCRIPTION 087466c0 HEALTHY - root 087450c0 HEALTHY - /rpool/tmp/ztest.0a 08745640 HEALTHY - indirect 08745bc0 HEALTHY - /rpool/tmp/ztest.2a 08746140 HEALTHY - /rpool/tmp/ztest.3a - - - spares 08744b40 HEALTHY - /rpool/tmp/ztest.spares.0 Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/9084 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/18acba7 Closes #6900	2018-04-14 12:40:52 -07:00
Serapheim Dimitropoulos	4bf8108ede	OpenZFS 9080 - recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() Authored by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://illumos.org/issues/9080 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/bdfded42e6 Closes #6900	2018-04-14 12:40:47 -07:00
Serapheim Dimitropoulos	9d5b524597	OpenZFS 9079 - race condition in starting and ending condensing thread for indirect vdevs The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. This commit introduces the ZTHR infrastructure, which is basically threads created during spa_load()/spa_create() and exist until we export or destroy the pool. ZTHRs sleep the majority of the time, until they are notified to wake up and do some predefined type of work. In the context of the current bug, a zthr to does the condensing of indirect mappings replacing the older code that used bare kthreads. When a pool is created, the condensing zthr is spawned but sleeps right away, until it is awaken by a signal from spa_sync(). If an existing pool is loaded, the condensing zthr looks if there is anything to condense before going to sleep, in case we were condensing mappings in the pool before it got exported. The benefits of this solution are the following: - The current bug is fixed - spa_condensing_indirect is the sole indicator of whether we are currently condensing or not - condensing is more decoupled from the spa_async_thread related functionality. As a final note, this commit also sets up the path on upstreaming other features that use the ZTHR code like zpool checkpoint and fast clone deletion. Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9079 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/3dc606ee Closes #6900	2018-04-14 12:23:53 -07:00
Brian Behlendorf	4589f3ae4c	Optimize possible split block search space Remove duplicate segment copies to minimize the possible search space for reconstruction. Once reduced an accurate assessment can be made regarding the difficulty in reconstructing the block. Also, ztest will now run zdb with zfs_reconstruct_indirect_combinations_max set to 1000000 in an attempt to avoid checksum errors. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6900	2018-04-14 12:22:43 -07:00
Matthew Ahrens	9e052db462	OpenZFS 9290 - device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. Note that in no case is incorrect data returned, but we might get a checksum error when we should have been able to find the right data. There are two underlying problems: 1. When we remove a mirror device, we only read one side of the mirror. Since we can't verify the checksum, this side may be silently bad, but the good data is on the other side of the mirror (which we didn't read). This can cause the removal to "bake in" the busted data – all copies of the data in the new location are the same, busted version, while we left the good version behind. The fix for this is to read and copy both sides of the mirror. If the old and new vdevs are mirrors, we will read both sides of the old mirror, and write each copy to the corresponding side of the new mirror. (If the old and new vdevs have a different number of children, we will do this as best as possible.) Even though we aren't verifying checksums, this ensures that as long as there's a good copy of the data, we'll have a good copy after the removal, even if there's silent damage to one side of the mirror. If we're removing a mirror that has some silent damage, we'll have exactly the same damage in the new location (assuming that the new location is also a mirror). 2. When we read from an indirect vdev that points to a mirror vdev, we only consider one copy of the data. This can lead to reduced effective redundancy, because we might read a bad copy of the data from one side of the mirror, and not retry the other, good side of the mirror. Note that the problem is not with the removal process, but rather after the removal has completed (having copied correct data to both sides of the mirror), if one side of the new mirror is silently damaged, we encounter the problem when reading the relocated data via the indirect vdev. Also note that the problem doesn't occur when ZFS knows that one side of the mirror is bad, e.g. when a disk entirely fails or is offlined. The impact is that reads (from indirect vdevs that point to mirrors) may return a checksum error even though the good data exists on one side of the mirror, and scrub doesn't repair all data on the mirror (if some of it is pointed to via an indirect vdev). The fix for this is complicated by "split blocks" - one logical block may be split into two (or more) pieces with each piece moved to a different new location. In this case we need to read all versions of each split (one from each side of the mirror), and figure out which combination of versions results in the correct checksum, and then repair the incorrect versions. This ensures that we supply the same redundancy whether you use device removal or not. For example, if a mirror has small silent errors on all of its children, we can still reconstruct the correct data, as long as those errors are at sufficiently-separated offsets (specifically, separated by the largest block size - default of 128KB, but up to 16MB). Porting notes: * A new indirect vdev check was moved from dsl_scan_needs_resilver_cb() to dsl_scan_needs_resilver(), which was added to ZoL as part of the sequential scrub work. * Passed NULL for zfs_ereport_post_checksum()'s zbookmark_phys_t parameter. The extra parameter is unique to ZoL. * When posting indirect checksum errors the ABD can be passed directly, zfs_ereport_post_checksum() is not yet ABD-aware in OpenZFS. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Ported-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://illumos.org/issues/9290 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/591 Closes #6900	2018-04-14 12:21:39 -07:00
Matthew Ahrens	a1d477c24c	OpenZFS 7614, 9064 - zfs device evacuation/removal OpenZFS 7614 - zfs device evacuation/removal OpenZFS 9064 - remove_mirror should wait for device removal to complete This project allows top-level vdevs to be removed from the storage pool with "zpool remove", reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now "indirect") vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become "obsolete" because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been "remapped" in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be "remapped" to their new (concrete) locations if possible. This process can be accelerated by using the "zfs remap" command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. At the moment, only mirrors and simple top-level vdevs can be removed and no removal is allowed if any of the top-level vdevs are raidz. Porting Notes: * Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. * Remove comment regarding "mpt" driver where zfs_remove_max_segment is initialized to SPA_MAXBLOCKSIZE. Change zfs_condense_indirect_commit_entry_delay_ticks to zfs_condense_indirect_commit_entry_delay_ms for consistency with most other tunables in which delays are specified in ms. * ZTS changes: Use set_tunable rather than mdb Use zpool sync as appropriate Use sync_pool instead of sync Kill jobs during test_removal_with_operation to allow unmount/export Don't add non-disk names such as "mirror" or "raidz" to $DISKS Use $TEST_BASE_DIR instead of /tmp Increase HZ from 100 to 1000 which is more common on Linux removal_multiple_indirection.ksh Reduce iterations in order to not time out on the code coverage builders. removal_resume_export: Functionally, the test case is correct but there exists a race where the kernel thread hasn't been fully started yet and is not visible. Wait for up to 1 second for the removal thread to be started before giving up on it. Also, increase the amount of data copied in order that the removal not finish before the export has a chance to fail. * MMP compatibility, the concept of concrete versus non-concrete devices has slightly changed the semantics of vdev_writeable(). Update mmp_random_leaf_impl() accordingly. * Updated dbuf_remap() to handle the org.zfsonlinux:large_dnode pool feature which is not supported by OpenZFS. * Added support for new vdev removal tracepoints. * Test cases removal_with_zdb and removal_condense_export have been intentionally disabled. When run manually they pass as intended, but when running in the automated test environment they produce unreliable results on the latest Fedora release. They may work better once the upstream pool import refectoring is merged into ZoL at which point they will be re-enabled. Authored by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Alex Reece <alex@delphix.com> Reviewed-by: George Wilson <george.wilson@delphix.com> Reviewed-by: John Kennedy <john.kennedy@delphix.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Tim Chase <tim@chase2k.com> Signed-off-by: Tim Chase <tim@chase2k.com> OpenZFS-issue: https://www.illumos.org/issues/7614 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/f539f1eb Closes #6900	2018-04-14 12:16:17 -07:00
Seth Forshee	93b43af10d	Allow mounting datasets more than once Currently mounting an already mounted zfs dataset results in an error, whereas it is typically allowed with other filesystems. This causes some bad interactions with mount namespaces. Take this sequence for example: - Create a dataset - Create a snapshot of the dataset - Create a clone of the snapshot - Create a new mount namespace - Rename the original dataset The rename results in unmounting and remounting the clone in the original mount namespace, however the remount fails because the dataset is still mounted in the new mount namespace. (Note that this means the mount in the new mount namespace is never being unmounted, so perhaps the unmount/remount of the clone isn't actually necessary.) The problem here is a result of the way mounting is implemented in the kernel module. Since it is not mounting block devices it uses mount_nodev() instead of the usual mount_bdev(). However, mount_nodev() is written for filesystems for which each mount is a new instance (i.e. a new super block), and zfs should be able to detect when a mount request can be satisfied using an existing super block. Change zpl_mount() to call sget() directly with it's own test callback. Passing the objset_t object as the fs data allows checking if a superblock already exists for the dataset, and in that case we just need to return a new reference for the sb's root dentry. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Alek Pinchuk <apinchuk@datto.com> Signed-off-by: Seth Forshee <seth.forshee@canonical.com> Closes #5796 Closes #7207	2018-04-13 10:44:05 -07:00
beren12	7403d0743e	Fix zfs_arc_max minimum tuning When setting `zfs_arc_max` its minimum value is allowed to be 64 MiB. There was an off-by-1 error which can matter on tiny systems. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Zubrzycki <github@mid-earth.net> Closes #7417	2018-04-12 10:47:32 -07:00
Tom Caputi	edc1e713c2	Fix race in dnode_check_slots_free() Currently, dnode_check_slots_free() works by checking dn->dn_type in the dnode to determine if the dnode is reclaimable. However, there is a small window of time between dnode_free_sync() in the first call to dsl_dataset_sync() and when the useraccounting code is run when the type is set DMU_OT_NONE, but the dnode is not yet evictable, leading to crashes. This patch adds the ability for dnodes to track which txg they were last dirtied in and adds a check for this before performing the reclaim. This patch also corrects several instances when dn_dirty_link was treated as a list_node_t when it is technically a multilist_node_t. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7147 Closes #7388	2018-04-10 11:15:05 -07:00
Giuseppe Di Natale	10f88c5cd5	Linux compat 4.16: blk_queue_flag_{set,clear} queue_flag_{set,clear}_unlocked are now private interfaces in the Linux kernel (https://github.com/torvalds/linux/commit/8a0ac14). Use blk_queue_flag_{set,clear} interfaces which were introduced as of https://github.com/torvalds/linux/commit/8814ce8. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #7410	2018-04-10 10:32:14 -07:00
Tony Hutter	4f301661df	Revert "Handle zap_add() failures in mixed ... " This reverts commit `cc63068e95`. Under certain circumstances this change can result in an ENOSPC error when adding new files to a directory. See #7401 for full details. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Issue #7401 Cloes #7416	2018-04-09 14:24:46 -07:00
Brian Behlendorf	3b0d99289a	Fix 'zfs send/recv' hang with 16M blocks When using 16MB blocks the send/recv queue's aren't quite big enough. This change leaves the default 16M queue size which a good value for most pools. But it additionally ensures that the queue sizes are at least twice the allowed zfs_max_recordsize. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7365 Closes #7404	2018-04-08 19:41:15 -07:00
Matthew Ahrens	5c27ec1088	Fixes for SNPRINTF_BLKPTR with encrypted BP's mdb doesn't have dmu_ot[], so we need a different mechanism for its SNPRINTF_BLKPTR() to determine if the BP is encrypted vs authenticated. Additionally, since it already relies on BP_IS_ENCRYPTED (etc), SNPRINTF_BLKPTR might as well figure out the "crypt_type" on its own, rather than making the caller do so. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #7390	2018-04-06 13:30:26 -07:00
Olaf Faaland	0ba106e75c	Fix divide-by-zero in mmp_delay_update() vdev_count_leaves() in the denominator may return 0, caught by Coverity. Introduced by * `533ea04` Update mmp_delay on sync or skipped, failed write Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7391	2018-04-06 13:29:11 -07:00
Olaf Faaland	533ea0415b	Update mmp_delay on sync or skipped, failed write When an MMP write is skipped, or fails, and time since mts->mmp_last_write is already greater than mts->mmp_delay, increase mts->mmp_delay. The original code only updated mts->mmp_delay when a write succeeded, but this results in the write(s) after delays and failed write(s) reporting an ub_mmp_delay which is too low. Update mmp_last_write and mmp_delay if a txg sync was successful. At least one uberblock was written, thus extending the time we can be sure the pool will not be imported by another host. Do not allow mmp_delay to go below (MSEC2NSEC(zfs_multihost_interval) / vdev_count_leaves()) so that a period of frequent successful MMP writes, e.g. due to frequent txg syncs, does not result in an import activity check so short it is not reliable based on mmp thread writes alone. Remove unnecessary local variable, start. We do not use the start time of the loop iteration. Add a debug message in spa_activity_check() to allow verification of the import_delay value and to prove the activity check occurred. Alter the tests that import pools and attempt to detect an activity check. Calculate the expected duration of spa_activity_check() based on module parameters at the time the import is performed, rather than a fixed time set in mmp.cfg. The fixed time may be wrong. Also, use the default zfs_multihost_interval value so the activity check is longer and easier to recognize. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7330	2018-04-04 16:38:44 -07:00
Tony Hutter	21a4f5cc86	Fedora 28: Fix misc bounds check compiler warnings Fix a bunch of (mostly) sprintf/snprintf truncation compiler warnings that show up on Fedora 28 (GCC 8.0.1). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7361 Closes #7368	2018-04-04 10:16:47 -07:00
LOLi	1724eb62de	Fix spa reference leak in zfs_ioc_pool_scan zfs_ioc_pool_scan leaks a spa reference when zc->zc_flags is not a valid pool_scrub_cmd_t: this could happen if the userland binaries and ZFS kernel module differ in version and would prevent the pool from being exported. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7380	2018-04-03 17:31:30 -07:00
Tim Chase	10adee27ce	Remove ASSERT() in l2arc_apply_transforms() The ASSERT was erroneously copied from the next section of code. The buffer's size should be expanded from "psize" to "asize" if necessary. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7375	2018-03-31 15:14:21 -07:00
Tom Caputi	a2c2ed1bd4	Decryption error handling improvements Currently, the decryption and block authentication code in the ZIO / ARC layers is a bit inconsistent with regards to the ereports that are produces and the error codes that are passed to calling functions. This patch ensures that all of these errors (which begin as ECKSUM) are converted to EIO before they leave the ZIO or ARC layer and that ereports are correctly generated on each decryption / authentication failure. In addition, this patch fixes a bug in zio_decrypt() where ECKSUM never gets written to zio->io_error. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7372	2018-03-31 11:12:51 -07:00
Tom Caputi	4515b1d01c	Encrypted dnode blocks should be prefetched raw Encrypted dnode blocks are always initially read as raw data and converted to decrypted data when an encrypted bonus buffer is needed. This allows the DMU to be used for things like fetching the DMU master node without requiring keys to be loaded. However, dbuf_issue_final_prefetch() does not currently read the data as raw. The end result of this is that prefetched dnode blocks are read twice from disk: once decrypted and then again as raw data. This patch corrects the issue by adding the flag when appropriate. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7362	2018-03-31 11:11:48 -07:00
LOLi	77d8a0f1a4	Fix hung z_zvol tasks during 'zfs receive' During a receive operation zvol_create_minors_impl() can wait needlessly for the prefetch thread because both share the same tasks queue. This results in hung tasks: <3>INFO: task z_zvol:5541 blocked for more than 120 seconds. <3> Tainted: P O 3.16.0-4-amd64 <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. The first z_zvol:5541 (zvol_task_cb) is waiting for the long running traverse_prefetch_thread:260 root@linux:~# cat /proc/spl/taskq taskq act nthr spwn maxt pri mina spl_system_taskq/0 1 2 0 64 100 1 active: [260]traverse_prefetch_thread [zfs](0xffff88003347ae40) wait: 5541 spl_delay_taskq/0 0 1 0 4 100 1 delay: spa_deadman [zfs](0xffff880039924000) z_zvol/1 1 1 0 1 120 1 active: [5541]zvol_task_cb [zfs](0xffff88001fde6400) pend: zvol_task_cb [zfs](0xffff88001fde6800) This change adds a dedicated, per-pool, prefetch taskq to prevent the traverse code from monopolizing the global (and limited) system_taskq by inappropriately scheduling long running tasks on it. Reviewed-by: Albert Lee <trisk@forkgnu.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #6330 Closes #6890 Closes #7343	2018-03-30 12:10:01 -07:00
Andriy Gapon	5e00213e43	OpenZFS 9164 - assert: newds == os->os_dsl_dataset Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <don.brady@delphix.com> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: * Re-enabled and tweaked the zpool_upgrade_007_pos test case to successfully run in under 5 minutes. OpenZFS-issue: https://www.illumos.org/issues/9164 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0e776dc06a Closes #6112 Closes #7336	2018-03-30 12:00:40 -07:00
Tom Caputi	32dce2da0c	Resolve QAT issues with incompressible data Currently, when ZFS wants to accelerate compression with QAT, it passes a destination buffer of the same size as the source buffer. Unfortunately, if the data is incompressible, QAT can actually "compress" the data to be larger than the source buffer. When this happens, the QAT driver will return a FAILED error code and print warnings to dmesg. This patch fixes these issues by providing the QAT driver with an additional buffer to work with so that even completely incompressible source data will not cause an overflow. This patch also resolves an error handling issue where incompressible data attempts compression twice: once by QAT and once in software. To fix this issue, a new (and fake) error code CPA_STATUS_INOMPRESSIBLE has been added so that the calling code can correctly account for the difference between a hardware failure and data that simply cannot be compressed. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7338	2018-03-29 17:40:34 -07:00
Tom Caputi	13a2ff2727	Fix ASSERT in dsl_scan_fini() and cleanup comments This patch fixes an issue where dsl_scan_prefetch_cb() might add more prefetch I/Os to the prefetch queue after prefetching has been completed. This was happening because that code was checking scn->scn_suspending instead of scn->scn_prefetch_stop. This occasionally triggered an ASSERT during ztest runs in dsl_scan_fini() when the code attempted to destroy an AVL tree that still had entires in it. This patch also includes a number of spelling corrections and comment cleanups throughout dsl_scan.c Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7353	2018-03-28 18:30:44 -07:00
Brian Behlendorf	b2ab468dde	Fix mmap / libaio deadlock Calling uiomove() in mappedread() under the page lock can result in a deadlock if the user space page needs to be faulted in. Resolve the issue by dropping the page lock before the uiomove(). The inode range lock protects against concurrent updates via zfs_read() and zfs_write(). Reviewed-by: Albert Lee <trisk@forkgnu.org> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7335 Closes #7339	2018-03-28 10:19:22 -07:00
Allan Jude	5152a74088	OpenZFS 9321 - arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value Authored by: Allan Jude <allanjude@freebsd.org> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9321 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/92b05f3a18 Closes #7333	2018-03-26 20:40:15 -07:00
Tom Caputi	157ef7f6a5	Don't count embedded bps in read stats Currently, ZFS tracks statistics about calls to arc_read() via the /proc/spl/kstat/zfs/<pool>/reads file for debugging. Unfortunately, this file currently counts embedded bps as disk reads since they are technically processed by the ZIO layer. This pollutes the log since the ARC will never cache embedded bps. This patch corrects this issue by preventing the logging of embedded bp reads. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7334	2018-03-23 21:35:19 -07:00
Matthew Ahrens	2fd92c3d6c	enable zfs_dbgmsg() by default, without dprintf() zfs_dbgmsg() should record a message by default. As a general principal, these messages shouldn't be too verbose. Furthermore, the amount of memory used is limited to 4MB (by default). dprintf() should only record a message if this is a debug build, and ZFS_DEBUG_DPRINTF is set in zfs_flags. This flag is not set by default (even on debug builds). These messages are extremely verbose, and sometimes nontrivial to compute. SET_ERROR() should only record a message if ZFS_DEBUG_SET_ERROR is set in zfs_flags. This flag is not set by default (even on debug builds). This brings our behavior in line with illumos. Note that the message format is unchanged (including file, line, and function, even though these are not recorded on illumos). Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #7314	2018-03-21 15:37:32 -07:00
Tom Caputi	8d9e7c8fbe	Fix spelling errors in comments This patch simply corrects some spelling / grammar errors in the QAT and encryption code comments. No functional changes Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7319	2018-03-21 08:42:13 -07:00
Brian Behlendorf	a76f3d0437	Fix deadlock in IO pipeline In vdev_queue_aggregate() the zio_execute() bypass should not be called under the vdev queue lock. This can result in a deadlock as shown in the stack traces below. Drop the vdev queue lock then walk the parents of the aggregate IO to determine the list of component IOs to be bypassed. This can be done safely without holding the io_lock since the new aggregate IO has not yet been returned and its parents cannot change. --- THREAD 1 --- arc_read() zio_nowait() zio_vdev_io_start() vdev_queue_io() <--- mutex_enter(vq->vq_lock) vdev_queue_io_to_issue() vdev_queue_aggregate() zio_execute() zio_vdev_io_assess() zio_wait_for_children() <- mutex_enter(zio->io_lock) --- THREAD 2 --- (inverse order) arc_read() zio_change_priority() <- mutex_enter(zio->zio_lock) vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock) Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7307	2018-03-16 16:46:06 -07:00
Tom Caputi	17dd88352e	Allow QAT to handle non page-aligned buffers This patch adds some handling to the QAT acceleration functions that allows them to handle buffers that are not aligned with the page cache. At the moment this never happens since callers only happen to work with page-aligned buffers, but this code should prevent headaches if this isn't always true in the future. This patch also adds some cleanups to align the QAT compression code with the encryption and checksumming code. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7305	2018-03-16 16:02:49 -07:00
Olaf Faaland	cec3a0a1bb	Report pool suspended due to MMP When the pool is suspended, record whether it was due to an I/O error or due to MMP writes failing to succeed within the required time. Change spa_suspended from uint8_t to zio_suspend_reason_t to store the reason. When userspace queries pool status via spa_tryimport(), report the reason the pool was suspended in a new key, ZPOOL_CONFIG_SUSPENDED_REASON. In libzfs, when interpreting the returned config nvlist, report suspension due to MMP with a new pool status enum value, ZPOOL_STATUS_IO_FAILURE_MMP. In status_callback(), which generates and emits the message when 'zpool status' is executed, add a case to print an appropriate message for the new pool status enum value. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7296	2018-03-15 10:56:55 -07:00
Tom Caputi	3874220932	SHA256 QAT acceleration This patch enables acceleration of SHA256 checksums using Intel Quick Assist Technology. This patch also fixes up and refactors some of the code from QAT encryption to make the behavior consistent. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7295	2018-03-15 10:53:58 -07:00
Paul Zuchowski	1a2342784a	receive_spill does not byte swap spill contents In zfs receive, the function receive_spill should account for spill block endian conversion as a defensive measure. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Paul Zuchowski <pzuchowski@datto.com> Closes #7300	2018-03-15 10:29:51 -07:00
Brian Behlendorf	de4f8d5d26	OpenZFS 9188 - increase size of dbuf cache to reduce indirect block decompression With compressed ARC (bug 6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Porting Notes: * Added modules options to zfs-module-parameters.5 man page. * Preserved scaling based on target ARC size rather than max ARC size. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9188 OpenZFS-commit: https://github.com/openzfs/openzfs/pull/564 Upstream bug: DLPX-46942 Closes #7273	2018-03-13 10:52:48 -07:00
Brian Behlendorf	a6cc97566c	Add kernel module auto-loading Historically a dynamic misc minor number was registered for the /dev/zfs device in order to prevent minor number collisions. This was fine but it prevented us from being able to use the kernel module auto-loaded which requires a known reserved value. Resolve this issue by adding a configure test to find an available misc minor number which can then be used in MODULE_ALIAS_MISCDEV at build time. By adding this alias the zfs kmod is added to the list of known static-nodes and the systemd-tmpfiles-setup-dev service will create a /dev/zfs character device at boot time. This in turn allows us to update the 90-zfs.rules file to make it aware this is a static node. The upshot of this is that whenever a process (zpool, zfs, zed) opens the /dev/zfs the kmods will be automatic loaded. This even works for unprivileged users so there is no longer a need to manually load the modules at boot time. As an additional bonus the zed now no longer needs to start after the zfs-import.service since it will trigger the module load. In the unlikely event the minor number we selected conflicts with another out of tree unregistered minor number the code falls back to dynamically allocating it. In this case the modules again must be manually loaded. Note that due to the change in the method of registering the minor number the zimport.sh test case may incorrectly fail when the static node for the installed packages is created instead of the dynamic one. This issue will only transiently impact zimport.sh for this single commit when we transition and are mixing and matching methods. Reviewed-by: Fabian Grünbichler <f.gruenbichler@proxmox.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> TEST_ZIMPORT_SKIP="yes" Closes #7287	2018-03-13 10:45:55 -07:00
Tim Chase	02638a30ef	Add zfs_scan_ignore_errors tunable When it's set, a DTL range will be cleared even if its scan/scrub had errors. This allows to work around resilver/scrub upon import when the pool has errors. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tim Chase <tim@chase2k.com> Closes #7293	2018-03-13 10:43:14 -07:00
Chunwei Chen	9470cbd4f9	Fix race in trace point in zrl_add_impl We hit an illegal memory access in the zrlock trace point. The problem is that zrl->zr_owner and zrl->zr_caller are assigned locklessly. And if zrl->zr_owner got assigned a longer string between when __string() calculate the strlen, and when __assign_str() does strcpy. The copy will overflow the buffer. == For example: Initial condition: zrl->zr_owner = A zrl->zr_caller = "abc" Thread A Thread B ------------------------------------------------- if (zrl->zr_owner == A) { DTRACE_PROBE2() { __string() { strlen(zrl->zr_caller) -> 3 allocate buf[4] } zrl->zr_owner = B zrl->zr_caller = "abcd" __assign_str() { strcpy(buf, zrl->zr_caller) <- buffer overflow == Dereferencing zrl->zr_owner->pid may also be problematic, in that the zrl->zr_owner got changed to other task, and that task exits, freeing the task_struct. This should be very unlikely, as the other task need to zrl_remove and exit between the dereferencing zr->zr_owner and zr->zr_owner->pid. Nevertheless, we'll deal with it as well. To fix the zrl->zr_caller issue, instead of copy the string content, we just copy the pointer, this is safe because it always points to __func__, which is static. As for the zrl->zr_owner issue, we pass in curthread instead of using zrl->zr_owner. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7291	2018-03-12 11:27:02 -07:00
Brian Behlendorf	b7eec00f9f	Fix MMP write frequency for large pools When a single pool contains more vdevs than the CONFIG_HZ for for the kernel the mmp thread will not delay properly. Switch to using cv_timedwait_sig_hires() to handle higher resolution delays. This issue was reported on Arch Linux where HZ defaults to only 100 and this could be fairly easily reproduced with a reasonably large pool. Most distribution kernels set CONFIG_HZ=250 or CONFIG_HZ=1000 and thus are unlikely to be impacted. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7205 Closes #7289	2018-03-12 11:26:05 -07:00
Olaf Faaland	743253df70	Hold SCL_VDEV when counting leaves A config lock should be held while vdev_count_leaves() walks the tree, otherwise the pointers reference may become invalid during the walk. SCL_VDEV is a minimal lock provided for such uses cases. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7286	2018-03-09 15:42:54 -08:00
Olaf Faaland	ebed90a598	Handle zio_resume and mmp => off When multihost is disabled on a pool, and the pool is resumed via zpool clear, within a single cycle of the mmp thread's loop (e.g. while it's in the cv_timedwait call), both mmp_last_write and mmp_delay should be updated. The original code mistakenly treated the two cases as if they could not occur at the same time. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7286	2018-03-09 15:42:11 -08:00
Tom Caputi	cf63739191	QAT support for AES-GCM This patch adds support for acceleration of AES-GCM encryption with Intel Quick Assist Technology. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chengfeix Zhu <chengfeix.zhu@intel.com> Signed-off-by: Weigang Li <weigang.li@intel.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7282	2018-03-09 13:37:15 -08:00
Wolfgang Bumiller	0e85048f53	Take user namespaces into account in policy checks Change file related checks to use user namespaces and make sure involved uids/gids are mappable in the current namespace. Note that checks without file ownership information will still not take user namespaces into account, as some of these should be handled via 'zfs allow' (otherwise root in a user namespace could issue commands such as `zpool export`). This also adds an initial user namespace regression test for the setgid bit loss, with a user_ns_exec helper usable in further tests. Additionally, configure checks for the required user namespace related features are added for: * ns_capable * kuid/kgid_has_mapping() * user_ns in cred_t Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com> Closes #6800 Closes #7270	2018-03-07 15:40:42 -08:00
Olaf Faaland	d2160d0538	Record skipped MMP writes in multihost_history Once per pass through the MMP thread's loop, the vdev tree is walked to find a suitable leaf to write the next MMP block to. If no such leaf is found, the thread sleeps for a while and resumes at the top of the loop. Add an entry to multihost_history when no leaf can be found, and record the reason in the error column. The error code for such entries is a bitfield, displayed in hex: 0x1 At least one vdev (interior or leaf) was not writeable. 0x2 At least one writeable leaf vdev was found, but it had a pending MMP write. timestamp = the time in seconds since the epoch when no leaf could be found originally. duration = the time (in ns) during which no MMP block was written for this reason. This does not include the preceeding inter-write period nor the following inter-write period. vdev_guid = the number of sequential cycles of the MMP thread looop when this occurred. Sample output, truncated to fit: For records of skipped MMP writes the right-most column, vdev_path, is reported as "-". id txg timestamp error duration mmp_delay vdev_guid ... 936 11 1520036441 0 146264 891422313 1740883117838 ... 937 11 1520036441 0 163956 888356657 7320395061548 ... 938 11 1520036442 0 130690 885314969 7320395061548 ... 939 11 1520036442 0 2001068577 882296582 1740883117838 ... 940 11 1520036443 0 161806 882296582 7320395061548 ... 941 11 1520036443 0x2 0 998020546 1 ... 942 11 1520036444 0 136585 998020546 7320395061548 ... 943 11 1520036444 0x2 0 998020257 1 ... 944 11 1520036445 5 2002662964 994160219 1740883117838 ... 945 11 1520036445 0x2 998073118 994160219 3 ... 946 11 1520036447 0 247136 994160219 7320395061548 ... Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7212	2018-03-06 15:15:15 -08:00
Olaf Faaland	14c240cede	Detect long config lock acquisition in mmp If something holds the config lock as a writer for too long, MMP will fail to issue MMP writes in a timely manner. This will result either in the pool being suspended, or in an extreme case, in the pool not being protected. If the time to acquire the config lock exceeds 1/10 of the minimum zfs_multihost_interval, report it in the zfs debug log. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7212	2018-03-06 15:14:39 -08:00
Nasf-Fan	2705ebf0a7	Misc fixes and cleanup for project quota 1) The Coverity Scan reports some issues for the project quota patch, including: 1.1) zfs_prop_get_userquota() directly uses the const quota type value as the condition check by wrong. 1.2) dmu_objset_userquota_get_ids() may cause dnode::dn_newgid to be overwritten by dnode::dn->dn_oldprojid. 2) This patch fixes related issues. It also enhances the logic for zfs_project_item_alloc() to avoid buffer overflow. 3) Skip project quota ability check if does not change project quota related things (id or flag). Otherwise, it will cause chattr (for other non project quota flags) operation failed if project quota disabled. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> Closes #7251 Closes #7265	2018-03-05 12:56:27 -08:00
Giuseppe Di Natale	dd3e1e3083	Linux 4.16 compat: get_disk_and_module() As of https://github.com/torvalds/linux/commit/fb6d47a, get_disk() is now get_disk_and_module(). Add a configure check to determine if we need to use get_disk_and_module(). Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #7264	2018-03-05 12:44:35 -08:00
Tony Hutter	80d52c3919	Change checksum & IO delay ratelimit values Change checksum & IO delay ratelimit thresholds from 5/sec to 20/sec. This allows zed to actually trigger if a bunch of these events arrive in a short period of time (zed has a threshold of 10 events in 10 sec). Previously, if you had, say, 100 checksum errors in 1 sec, it would get ratelimited to 5/sec which wouldn't trigger zed to fault the drive. Also, convert the checksum and IO delay thresholds to module params for easy testing. Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #7252	2018-03-04 17:34:51 -08:00
chrisrd	5666a994f2	Increment zil_itx_needcopy_bytes properly In zil_lwb_commit() with TX_WRITE, we copy the log write record (lrw) into the log write block (lwb) and send it off using zil_lwb_add_txg(). If we also have WR_NEED_COPY, we additionally copy the lwr's data into the lwb to be sent off. If the lwr + data doesn't fit into the lwb, we send the lrw and as much data as will fit (dnow bytes), then go back and do the same with the remaining data. Each time through this loop we're sending dnow data bytes. I.e. zil_itx_needcopy_bytes should be incremented by dnow. Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #6988 Closes #7176	2018-03-02 10:01:53 -08:00
Tom Caputi	095495e008	Raw DRR_OBJECT records must write raw data `b1d21733` made it possible for empty metadnode blocks to be compressed to a hole, fixing a bug that would cause invalid metadnode MACs when a send stream attempted to free objects and allowing the blocks to be reclaimed when they were no longer needed. However, this patch also introduced a race condition; if a txg sync occurred after a DRR_OBJECT_RANGE record was received but before any objects were added, the metadnode block would be compressed to a hole and lose all of its encryption parameters. This would cause subsequent DRR_OBJECT records to fail when they attempted to write their data into an unencrypted block. This patch defers the DRR_OBJECT_RANGE handling to receive_object() so that the encryption parameters are set with each object that is written into that block. Reviewed-by: Kash Pande <kash@tripleback.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7215 Closes #7236	2018-02-27 09:04:05 -08:00
Brian Behlendorf	3673d03285	Fix more cstyle warnings This patch contains no functional changes. It is solely intended to resolve cstyle warnings in order to facilitate moving the spl source code in to the zfs repository. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #687	2018-02-24 10:05:37 -08:00
chrisrd	e9a7729008	Fix free memory calculation on v3.14+ Provide infrastructure to auto-configure to enum and API changes in the global page stats used for our free memory calculations. arc_free_memory has been broken since an API change in Linux v3.14: 2016-07-28 v4.8 599d0c95 mm, vmscan: move LRU lists to node 2016-07-28 v4.8 75ef7184 mm, vmstat: add infrastructure for per-node vmstats These commits moved some of global_page_state() into global_node_page_state(). The API change was particularly egregious as, instead of breaking the old code, it silently did the wrong thing and we continued using global_page_state() where we should have been using global_node_page_state(), thus indexing into the wrong array via NR_SLAB_RECLAIMABLE et al. There have been further API changes along the way: 2017-07-06 v4.13 385386cf mm: vmstat: move slab statistics from zone to node counters 2017-09-06 v4.14 c41f012a mm: rename global_page_state to global_zone_page_state ...and various (incomplete, as it turns out) attempts to accomodate these changes in ZoL: 2017-08-24 `2209e409` Linux 4.8+ compatibility fix for vm stats 2017-09-16 `787acae0` Linux 3.14 compat: IO acct, global_page_state, etc 2017-09-19 661907e6 Linux 4.14 compat: IO acct, global_page_state, etc The config infrastructure provided here resolves these issues going back to the original API change in v3.14 and is robust against further Linux changes in this area. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Chris Dunlop <chris@onthe.net.au> Closes #7170	2018-02-23 08:50:06 -08:00
Olaf Faaland	7088545d01	Report duration and error in mmp_history entries After an MMP write completes, update the relevant mmp_history entry with the time between submission and completion, and the error status of the write. [faaland1@toss3a zfs]$ cat /proc/spl/kstat/zfs/pool/multihost 39 0 0x01 100 8800 69147946270893 72723903122926 id txg timestamp error duration mmp_delay vdev_guid 10607 1166 1518985089 0 138301 637785455 4882... 10608 1166 1518985089 0 136154 635407747 1151... 10609 1166 1518985089 0 803618560 633048078 9740... 10610 1166 1518985090 0 144826 633048078 4882... 10611 1166 1518985090 0 164527 666187671 1151... Where duration = gethrtime_in_done_fn - gethrtime_at_submission, and error = zio->io_error. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7190	2018-02-22 15:34:34 -08:00
Olaf Faaland	0d398b2564	Do not initiate MMP writes while pool is suspended While the pool is suspended on host A, it may be imported on host B. If host A continued to write MMP blocks, it would be blindly overwriting MMP blocks written by host B, and the blocks written by host A would have outdated txg information. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Olaf Faaland <faaland1@llnl.gov> Closes #7182	2018-02-22 09:14:46 -08:00
Tom Caputi	f8478fc2ca	Fix bounds check in zio_crypt_do_objset_hmacs The current bounds check in zio_crypt_do_objset_hmacs() does not properly handle the possible sizes of the objset_phys_t and can therefore read outside the buffer's memory. If that memory happened to match what the check was actually looking for, the objset would fail to be owned, complaining that the MAC was invalid. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7210	2018-02-22 08:50:14 -08:00
Tomohiro Kusumi	378c6ed549	Fix function name typos vn_init() and vn_fini() had been renamed by `12ff95ff` in 2011. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #686	2018-02-21 15:12:10 -08:00
Tomohiro Kusumi	68386b0503	Staticize kstat_default_update() This is only used via ->ks_update of `kstat_t *`. This isn't exported nor do headers have its prototype. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com> Closes #686	2018-02-21 15:11:38 -08:00
Toomas Soome	09302a4ca8	OpenZFS 9035 - zfs: this statement may fall through Authored by: Toomas Soome <tsoome@me.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/9035 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/46ac8fdfc5 Closes #7206	2018-02-21 14:55:34 -08:00
Tom Caputi	b0918402dc	Raw receive should change key atomically Currently, raw zfs sends transfer the encrypted master keys and objset_phys_t encryption parameters in the DRR_BEGIN payload of each send file. Both of these are processed as soon as they are read in dmu_recv_stream(), meaning that the new keys are set before the new snapshot is received. In addition to the fact that this changes the user's keys for the dataset earlier than they might expect, the keys were never reset to what they originally were in the event that the receive failed. This patch splits the processing into objset handling and key handling, the later of which is moved to dmu_recv_end() so that they key change can be done atomically. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7200	2018-02-21 12:31:03 -08:00
Tom Caputi	4a385862b7	Prevent raw zfs recv -F if dataset is unencrypted The current design of ZFS encryption only allows a dataset to have one DSL Crypto Key at a time. As a result, it is important that the zfs receive code ensures that only one key can be in use at a time for a given DSL Directory. zfs receive -F complicates this, since the new dataset is received as a clone of the existing one so that an atomic switch can be done at the end. To prevent confusion about which dataset is actually encrypted a check was added to ensure that encrypted datasets cannot use zfs recv -F to completely replace existing datasets. Unfortunately, the check did not take into account unencrypted datasets being overriden by encrypted ones as a case. Along the same lines, the code also failed to ensure that raw recieves could not be done on top of existing unencrypted datasets, which causes amny problems since the new stream cannot be decrypted. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7199	2018-02-21 12:30:11 -08:00
Tom Caputi	b1d217338a	Raw receives must compress metadnode blocks Currently, the DMU relies on ZIO layer compression to free LO dnode blocks that no longer have objects in them. However, raw receives disable all compression, meaning that these blocks can never be freed. In addition to the obvious space concerns, this could also cause incremental raw receives to fail to mount since the MAC of a hole is different from that of a completely zeroed block. This patch corrects this issue by adding a special case in zio_write_compress() which will attempt to compress these blocks to a hole even if ZIO_FLAG_RAW_ENCRYPT is set. This patch also removes the zfs_mdcomp_disable tunable, since tuning it could cause these same issues. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7198	2018-02-21 12:28:52 -08:00
Tom Caputi	5121c4fb0c	Remove unnecessary txg syncs from receive_object() `1b66810b` introduced serveral changes which improved the reliability of zfs sends when large dnodes were involved. However, these fixes required adding a few calls to txg_wait_synced() in the DRR_OBJECT handling code. Although most of them are currently necessary, this patch allows the code to continue without waiting in some cases where it doesn't have to. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7197	2018-02-21 12:26:51 -08:00
Tom Caputi	478b3150de	Add omitted set for os->os_next_write_raw This one line patch adds adds a set to os->os_next_write_raw that was omitted when the code was updated in `1b66810`. Without it, the code (in some instances) could attempt to write raw encrypted data as regular unencrypted data without the keys being loaded, triggering an ASSERT in zio_encrypt(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7196	2018-02-21 12:24:37 -08:00
Tom Caputi	163a8c28dd	ZIL claiming should not start user accounting Currently, ZIL claiming dirties objsets which causes dsl_pool_sync() to attempt to perform user accounting on them. This causes problems for encrypted datasets that were raw received before the system went offline since they cannot perform user accounting until they have their keys loaded. This triggers an ASSERT in zio_encrypt(). Since encryption was added, the code now depends on the fact that data should only be written when objsets are owned. This patch adds a check in dmu_objset_do_userquota_updates() to ensure that useraccounting is only done when the objsets are actually owned for write. As part of this work, the zfsvfs and zvol code was updated so that it no longer lies about owning objsets readonly. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6916 Closes #7163	2018-02-20 16:27:31 -08:00
Don Brady	cbce581353	Fix coverity defects: zfs channel programs CID 173243, 173245: Memory - corruptions (OVERRUN) Added size argument to lcompat_sprintf() to avoid use of INT_MAX CID 173244: Integer handling issues (OVERFLOW_BEFORE_WIDEN) Added cast to uint64_t to avoid a 32 bit overflow warning CID 173242: Integer handling issues (CONSTANT_EXPRESSION_RESULT) Conditionally removed unused luai_numisnan() floating point check CID 173241: Resource leaks (RESOURCE_LEAK) Added missing close(fd) on error path CID 173240: (UNINIT) Fixed uninitialized variable in get_special_prop() CID 147560: Null pointer dereferences (NULL_RETURNS) Cleaned up bad code merge in dsl_dataset_promote_check() CID 28475: Memory - illegal accesses (OVERRUN) Fixed lcompat_sprintf() to use a size paramater CID 28418, 28422: Error handling issues (CHECKED_RETURN) Added function result cast to (void) to avoid warning CID 23935, 28411, 28412: Memory - corruptions (ARRAY_VS_SINGLETON) Added casts to avoid exposing result as an array Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #7181	2018-02-20 11:19:42 -08:00
Tom Caputi	7b30ee6baf	Project dnode should be protected by local MAC This patch corrects a small security issue with `9c5167d1`. When the project dnode was added to the objset_phys_t, it was not included in the local MAC for cryptographic protection, allowing an attacker to modify this data without the consent of the key holder. This patch does represent an on-disk format change for anyone using project dnodes on an encrypted dataset. Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7177	2018-02-20 09:41:07 -08:00
Don Brady	62d5c55313	Address objtool check failures in lua module The use of void __attribute__((noreturn)) in kernel builds was causing lots of warnings if CONFIG_STACK_VALIDATION is active. For now we just remove this attribute to achieve clean builds for the Lua module. There was no significant increase in the time to run the full channel_program ZTS tests. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Don Brady <don.brady@delphix.com> Closes #7173	2018-02-15 09:53:04 -08:00
George Wilson	ddc751d56b	OpenZFS 8857 - zio_remove_child() panic due to already destroyed parent zio PROBLEM ======= It's possible for a parent zio to complete even though it has children which have not completed. This can result in the following panic: > $C ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() > ::status debugging crash dump vmcore.2 (64-bit) from batfs0390 operating system: 5.11 joyent_20170911T171900Z (i86pc) image uuid: (not set) panic message: mutex_enter: bad mutex, lp=ffffff597dde7f80 owner=ffffff3c59b39480 thread=ffffff0180912c40 dump content: kernel pages only The problem is that dbuf_prefetch along with l2arc can create a zio tree which confuses the parent zio and allows it to complete with while children still exist. Here's the scenario: zio tree: pio \|--- lio The parent zio, pio, has entered the zio_done stage and begins to check its children to see there are still some that have not completed. In zio_done(), the children are checked in the following order: zio_wait_for_children(zio, ZIO_CHILD_VDEV, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_GANG, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_DDT, ZIO_WAIT_DONE) zio_wait_for_children(zio, ZIO_CHILD_LOGICAL, ZIO_WAIT_DONE) If pio, finds any child which has not completed then it stops executing and goes to sleep. Each call to zio_wait_for_children() will grab the io_lock while checking the particular child. In this scenario, the pio has completed the first call to zio_wait_for_children() to check for any ZIO_CHILD_VDEV children. Since the only zio in the zio tree right now is the logical zio, lio, then it completes that call and prepares to check the next child type. In the meantime, the lio completes and in its callback creates a child vdev zio, cio. The zio tree looks like this: zio tree: pio \|--- lio \|--- cio The lio then grabs the parent's io_lock and removes itself. zio tree: pio \|--- cio The pio continues to run but has already completed its check for ZIO_CHILD_VDEV and will erroneously complete. When the child zio, cio, completes it will panic the system trying to reference the parent zio which has been destroyed. SOLUTION ======== The fix is to rework the zio_wait_for_children() logic to accept a bitfield for all the children types that it's interested in checking. The io_lock will is held the entire time we check all the children types. Since the function now accepts a bitfield, a simple ZIO_CHILD_BIT() macro is provided to allow for the conversion between a ZIO_CHILD type and the bitfield used by the zio_wiat_for_children logic. Authored by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Youzhong Yang <youzhong@gmail.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8857 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/862ff6d99c Issue #5918 Closes #7168	2018-02-14 15:30:09 -08:00
Nasf-Fan	9c5167d19f	Project Quota on ZFS Project quota is a new ZFS system space/object usage accounting and enforcement mechanism. Similar as user/group quota, project quota is another dimension of system quota. It bases on the new object attribute - project ID. Project ID is a numerical value to indicate to which project an object belongs. An object only can belong to one project though you (the object owner or privileged user) can change the object project ID via 'chattr -p' or 'zfs project [-s] -p' explicitly. The object also can inherit the project ID from its parent when created if the parent has the project inherit flag (that can be set via 'chattr +P' or 'zfs project -s [-p]'). By accounting the spaces/objects belong to the same project, we can know how many spaces/objects used by the project. And if we set the upper limit then we can control the spaces/objects that are consumed by such project. It is useful when multiple groups and users cooperate for the same project, or a user/group needs to participate in multiple projects. Support the following commands and functionalities: zfs set projectquota@project zfs set projectobjquota@project zfs get projectquota@project zfs get projectobjquota@project zfs get projectused@project zfs get projectobjused@project zfs projectspace zfs allow projectquota zfs allow projectobjquota zfs allow projectused zfs allow projectobjused zfs unallow projectquota zfs unallow projectobjquota zfs unallow projectused zfs unallow projectobjused chattr +/-P chattr -p project_id lsattr -p This patch also supports tree quota based on the project quota via "zfs project" commands set as following: zfs project [-d\|-r] <file\|directory ...> zfs project -C [-k] [-r] <file\|directory ...> zfs project -c [-0] [-d\|-r] [-p id] <file\|directory ...> zfs project [-p id] [-r] [-s] <file\|directory ...> For "df [-i] $DIR" command, if we set INHERIT (project ID) flag on the $DIR, then the proejct [obj]quota and [obj]used values for the $DIR's project ID will be shown as the total/free (avail) resource. Keep the same behavior as EXT4/XFS does. Reviewed-by: Andreas Dilger <andreas.dilger@intel.com> Reviewed-by Ned Bass <bass6@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Fan Yong <fan.yong@intel.com> TEST_ZIMPORT_POOLS="zol-0.6.1 zol-0.6.2 master" Change-Id: Ib4f0544602e03fb61fd46a849d7ba51a6005693c Closes #6290	2018-02-13 14:54:54 -08:00
sanjeevbagewadi	918dbe35b5	mmp should use a fixed tag for spa_config locks mmp_write_uberblock() and mmp_write_done() should the same tag for spa_config_locks. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #6530 Closes #7155	2018-02-12 11:30:38 -08:00
Andriy Gapon	1334283225	OpenZFS 8520 - lzc_rollback 8520 lzc_rollback_to should support rolling back to origin 7198 libzfs should gracefully handle EINVAL from lzc_rollback lzc_rollback_to() should support rolling back to a clone's origin. The current checks in zfs_ioc_rollback() would not allow that because the origin snapshot belongs to a different filesystem. The overly restrictive check was in introduced in 7600, but it was not a regression as none of the existing tools provided a way to rollback to the origin. Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8520 OpenZFS-issue: https://www.illumos.org/issues/7198 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/78a5a1a25a Closes #7150	2018-02-09 10:27:58 -08:00
sanjeevbagewadi	cc63068e95	Handle zap_add() failures in mixed case mode With "casesensitivity=mixed", zap_add() could fail when the number of files/directories with the same name (varying in case) exceed the capacity of the leaf node of a Fatzap. This results in a ASSERT() failure as zfs_link_create() does not expect zap_add() to fail. The fix is to handle these failures and rollback the transactions. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Sanjeev Bagewadi <sanjeev.bagewadi@gmail.com> Closes #7011 Closes #7054	2018-02-09 10:15:53 -08:00
Chunwei Chen	f108a49236	Fix zle_decompress out of bound access Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-02-09 10:08:05 -08:00
Chunwei Chen	d3190c5f29	Fix zdb -c traverse stop on damaged objset root If a corruption happens to be on a root block of an objset, zdb -c will not correctly report the error, and it will not traverse the datasets that come after. This is because traverse_visitbp, which does the callback and reset error for TRAVERSE_HARD, is skipped when traversing zil is failed in traverse_impl. Here's example of what 'zdb -eLcc' command looks like on a pool with damaged objset root: == before patch: Traversing all blocks to verify checksums ... Error counts: errno count block traversal size 379392 != alloc 33987072 (unreachable 33607680) bp count: 172 ganged count: 0 bp logical: 1678336 avg: 9757 bp physical: 130560 avg: 759 compression: 12.85 bp allocated: 379392 avg: 2205 compression: 4.42 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 71 Dittoed blocks on same vdev: 101 == after patch: Traversing all blocks to verify checksums ... zdb_blkptr_cb: Got error 52 reading <54, 0, -1, 0> -- skipping Error counts: errno count 52 1 block traversal size 33963520 != alloc 33987072 (unreachable 23552) bp count: 447 ganged count: 0 bp logical: 36093440 avg: 80745 bp physical: 33699840 avg: 75391 compression: 1.07 bp allocated: 33963520 avg: 75981 compression: 1.06 bp deduped: 0 ref>1: 0 deduplication: 1.00 SPA allocated: 33987072 used: 0.80% additional, non-pointer bps of type 0: 76 Dittoed blocks on same vdev: 115 == Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: loli10K <ezomori.nozomu@gmail.com> Signed-off-by: Chunwei Chen <david.chen@nutanix.com> Closes #7099	2018-02-09 10:05:25 -08:00
Brian Behlendorf	18f57327e0	Linux 4.16 compat: inode_set_iversion() A new interface was added to manipulate the version field of an inode. Add a inode_set_iversion() wrapper for older kernels and use the new interface when available. The i_version field was dropped from the trace point due to the switch to an atomic64_t i_version type. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Chunwei Chen <david.chen@nutanix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7148	2018-02-08 21:25:19 -08:00
Serapheim Dimitropoulos	5b72a38d68	OpenZFS 8677 - Open-Context Channel Programs Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> We want to be able to run channel programs outside of synching context. This would greatly improve performance for channel programs that just gather information, as they won't have to wait for synching context anymore. === What is implemented? This feature introduces the following: - A new command line flag in "zfs program" to specify our intention to run in open context. (The -n option) - A new flag/option within the channel program ioctl which selects the context. - Appropriate error handling whenever we try a channel program in open-context that contains zfs.sync* expressions. - Documentation for the new feature in the manual pages. === How do we handle zfs.sync functions in open context? When such a function is found by the interpreter and we are running in open context we abort the script and we spit out a descriptive runtime error. For example, given the script below ... arg = ... fs = arg["argv"][1] err = zfs.sync.destroy(fs) msg = "destroying " .. fs .. " err=" .. err return msg if we run it in open context, we will get back the following error: Channel program execution failed: [string "channel program"]:3: running functions from the zfs.sync submodule requires passing sync=TRUE to lzc_channel_program() (i.e. do not specify the "-n" command line argument) stack traceback: [C]: in function 'destroy' [string "channel program"]:3: in main chunk === What about testing? We've introduced new wrappers for all channel program tests that run each channel program as both (startard & open-context) and expect the appropriate behavior depending on the program using the zfs.sync module. OpenZFS-issue: https://www.illumos.org/issues/8677 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/17a49e15 Closes #6558	2018-02-08 16:05:57 -08:00
Serapheim Dimitropoulos	8d103d8856	OpenZFS 8604 - Simplify snapshots unmounting code Authored by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> Every time we want to unmount a snapshot (happens during snapshot deletion or renaming) we unnecessarily iterate through all the mountpoints in the VFS layer (see zfs_get_vfs). The current patch completely gets rid of that code and changes the approach while keeping the behavior of that code path the same. Specifically, it puts a hold on the dataset/snapshot and gets its vfs resource reference directly, instead of linearly searching for it. If that reference exists we attempt to amount it. With the above change, it became obvious that the nvlist manipulations that we do (add_boolean and add_nvlist) take a significant amount of time ensuring uniqueness of every new element even though they don't have too. Thus, we updated the patch so those nvlists are not trying to enforce the uniqueness of their elements. A more complete analysis of the problem solved by this patch can be found below: https://sdimitro.github.io/post/snap-unmount-perf/ OpenZFS-issue: https://www.illumos.org/issues/8604 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/126118fb	2018-02-08 15:29:44 -08:00
Don Brady	fc5d4b6737	Increase code coverage for Lua libraries Add test coverage for lua libraries Remove dead code in Lua implementation Signed-off-by: Don Brady <don.brady@delphix.com>	2018-02-08 15:29:38 -08:00
Chris Williamson	234c91c508	OpenZFS 8600 - ZFS channel programs - snapshot Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> ZFS channel programs should be able to create snapshots. In addition to the base snapshot functionality, this entails extra logic to handle edge cases which were formerly not possible, such as creating then destroying a snapshot in the same transaction sync. OpenZFS-issue: https://www.illumos.org/issues/8600 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/68089b8b	2018-02-08 15:29:24 -08:00
Brad Lewis	af07368986	OpenZFS 8592 - ZFS channel programs - rollback Authored by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> ZFS channel programs should be able to perform a rollback. OpenZFS-issue: https://www.illumos.org/issues/8592 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d46b5ed6	2018-02-08 15:29:14 -08:00
Chris Williamson	475eca4908	OpenZFS 8605 - zfs channel programs fix zfs.exists Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Don Brady <don.brady@delphix.com> zfs.exists() in channel programs doesn't return any result, and should have a man page entry. This patch corrects zfs.exists so that it returns a value indicating if the dataset exists or not. It also adds documentation about it in the man page. OpenZFS-issue: https://www.illumos.org/issues/8605 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1e85e111	2018-02-08 15:28:52 -08:00
Chris Williamson	d99a015343	OpenZFS 7431 - ZFS Channel Programs Authored by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Ported-by: Don Brady <don.brady@delphix.com> Ported-by: John Kennedy <john.kennedy@delphix.com> OpenZFS-issue: https://www.illumos.org/issues/7431 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/dfc11533 Porting Notes: * The CLI long option arguments for '-t' and '-m' don't parse on linux * Switched from kmem_alloc to vmem_alloc in zcp_lua_alloc * Lua implementation is built as its own module (zlua.ko) * Lua headers consumed directly by zfs code moved to 'include/sys/lua/' * There is no native setjmp/longjump available in stock Linux kernel. Brought over implementations from illumos and FreeBSD * The get_temporary_prop() was adapted due to VFS platform differences * Use of inline functions in lua parser to reduce stack usage per C call * Skip some ZFS Test Suite ZCP tests on sparc64 to avoid stack overflow	2018-02-08 15:28:18 -08:00
WHR	8824a7f133	OpenZFS 8966 - Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable Authored by: WHR <msl0000023508@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8966 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c95549fcdc Closes #7141	2018-02-08 10:33:45 -08:00
Richard Elling	6b810d04bd	Remove deprecated zfs_arc_p_aggressive_disable zfs_arc_p_aggressive_disable is no more. This PR removes docs and module parameters for zfs_arc_p_aggressive_disable. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Richard Elling <Richard.Elling@RichardElling.com> Closes #7135	2018-02-07 11:54:20 -08:00
Brian Behlendorf	5461eefe50	Fix cstyle warnings This patch contains no functional changes. It is solely intended to resolve cstyle warnings in order to facilitate moving the spl source code in to the zfs repository. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #681	2018-02-07 11:49:38 -08:00
Tom Caputi	2b84817f66	Adjust ARC prefetch tunables to match docs Currently, the ARC exposes 2 tunables (zfs_arc_min_prefetch_ms and zfs_arc_min_prescient_prefetch_ms) which are documented to be specified in milliseconds. However, the code actually uses the values as though they were in seconds. This patch adjusts the code to match the names and documentation of the tunables. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7126	2018-02-05 16:57:53 -08:00
wli5	f6b58faaa6	Bug fix in qat_compress.c for vmalloc addr check Remove the unused vmalloc address check, and function mem_to_page will handle the non-vmalloc address when map it to a physical address. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Weigang Li <weigang.li@intel.com> Closes #7125	2018-02-05 10:26:27 -08:00
Brian Behlendorf	0d23f5e2e4	Fix hash_lock / keystore.sk_dk_lock lock inversion The keystore.sk_dk_lock should not be held while performing I/O. Drop the lock when reading from disk and update the code so they the first successful caller adds the key. Improve error handling in spa_keystore_create_mapping_impl(). Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: RageLtMan <rageltman@sempervictus> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7112 Closes #7115	2018-02-04 14:07:13 -08:00
Tom Caputi	1b66810bad	Change os->os_next_write_raw to work per txg Currently, os_next_write_raw is a single boolean used for determining whether or not the next call to dmu_objset_sync() should write out the objset_phys_t as a raw buffer. Since the boolean is not associated with a txg, the work simply happens during the next txg, which is not necessarily the correct one. In the current implementation this issue was misdiagnosed, resulting in a small hack in dmu_objset_sync() which seemed to resolve the problem. This patch changes os_next_write_raw to be an array of booleans, one for each txg in TXG_OFF and removes the hack. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6864	2018-02-02 11:44:53 -08:00
Tom Caputi	047116ac76	Raw sends must be able to decrease nlevels Currently, when a raw zfs send file includes a DRR_OBJECT record that would decrease the number of levels of an existing object, the object is reallocated with dmu_object_reclaim() which creates the new dnode using the old object's nlevels. For non-raw sends this doesn't really matter, but raw sends require that nlevels on the receive side match that of the send side so that the checksum-of-MAC tree can be properly maintained. This patch corrects the issue by freeing the object completely before allocating it again in this case. This patch also corrects several issues with dnode_hold_impl() and related functions that prevented dnodes (particularly multi-slot dnodes) from being reallocated properly due to the fact that existing dnodes were not being fully cleaned up when they were freed. This patch adds a test to make sure that zfs recv functions properly with incremental streams containing dnodes of different sizes. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6821 Closes #6864	2018-02-02 11:43:11 -08:00
Tom Caputi	d53bd7f524	Fix recovery import (-F) with encrypted pool When performing zil_claim() at pool import time, it is important that encrypted datasets set os_next_write_raw before writing to the zil_header_t. This prevents the code from attempting to re-authenticate the objset_phys_t when it writes it out, which is unnecessary because the zil_header_t is not protected by either objset MAC and impossible since the keys aren't loaded yet. Unfortunately, one of the code paths did not set this flag, which causes failed ASSERTs during 'zpool import -F'. This patch corrects this issue. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6864 Closes #6916	2018-02-02 11:39:36 -08:00
Tom Caputi	ae76f45cda	Encryption Stability and On-Disk Format Fixes The on-disk format for encrypted datasets protects not only the encrypted and authenticated blocks themselves, but also the order and interpretation of these blocks. In order to make this work while maintaining the ability to do raw sends, the indirect bps maintain a secure checksum of all the MACs in the block below it along with a few other fields that determine how the data is interpreted. Unfortunately, the current on-disk format erroneously includes some fields which are not portable and thus cannot support raw sends. It is not possible to easily work around this issue due to a separate and much smaller bug which causes indirect blocks for encrypted dnodes to not be compressed, which conflicts with the previous bug. In addition, the current code generates incompatible on-disk formats on big endian and little endian systems due to an issue with how block pointers are authenticated. Finally, raw send streams do not currently include dn_maxblkid when sending both the metadnode and normal dnodes which are needed in order to ensure that we are correctly maintaining the portable objset MAC. This patch zero's out the offending fields when computing the bp MAC and ensures that these MACs are always calculated in little endian order (regardless of the host system's byte order). This patch also registers an errata for the old on-disk format, which we detect by adding a "version" field to newly created DSL Crypto Keys. We allow datasets without a version (version 0) to only be mounted for read so that they can easily be migrated. We also now include dn_maxblkid in raw send streams to ensure the MAC can be maintained correctly. This patch also contains minor bug fixes and cleanups. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6845 Closes #6864 Closes #7052	2018-02-02 11:37:16 -08:00
Tom Caputi	a73c94934f	Change movaps to movups in AES-NI code Currently, the ICP contains accelerated assembly code to be used specifically on CPUs with AES-NI enabled. This code makes heavy use of the movaps instruction which assumes that it will be provided aes keys that are 16 byte aligned. This assumption seems to hold on Illumos, but on Linux some kernel options such as 'slub_debug=P' will violate it. This patch changes all instances of this instruction to movups which is the same except that it can handle unaligned memory. This patch also adds a few flags which were accidentally never given to the assembly compiler, resulting in objtool warnings. Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Nathaniel R. Lewis <linux.robotdude@gmail.com> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #7065 Closes #7108	2018-01-31 15:17:56 -08:00
Brian Behlendorf	f90a30ad1b	Fix txg_sync_thread hang in scan_exec_io() When scn->scn_maxinflight_bytes has not been initialized it's possible to hang on the condition variable in scan_exec_io(). This issue was uncovered by ztest and is only possible when deduplication is enabled through the following call path. txg_sync_thread() spa_sync() ddt_sync_table() ddt_sync_entry() dsl_scan_ddt_entry() dsl_scan_scrub_cb() dsl_scan_enqueuei() scan_exec_io() cv_wait() Resolve the issue by always initializing scn_maxinflight_bytes to a reasonable minimum value. This value will be recalculated in dsl_scan_sync() to pick up changes to zfs_scan_vdev_limit and the addition/removal of vdevs. Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7098	2018-01-31 09:33:33 -08:00
LOLi	63f88c12b4	Fix style issues in man pages and commands help * Remove 'zfs snap' from zfs help message (OpenZFS sync) * Update zfs(8) to suggest 'snap' can be used as an alias for 'snapshot' * Enforce 80 columns limit in help messages * Remove zfs_disable_dup_eviction from zfs-module-parameters(5) * Expose zfs_scan_max_ext_gap as a kernel module parameter. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tom Caputi <tcaputi@datto.com> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Closes #7087	2018-01-29 15:05:03 -08:00
Giuseppe Di Natale	5e021f56d3	Add dbuf hash and dbuf cache kstats Introduce kstats about the dbuf hash and dbuf cache to make it easier to inspect state. This should help with debugging and understanding of these portions of the codebase. Correct format of dbuf kstat file. Introduce a dbc column to dbufs kstat to indicate if a dbuf is in the dbuf cache. Introduce field filtering in the dbufstat python script. Introduce a no header option to the dbufstat python script. Introduce a test case to test basic mru->mfu list movement in the ARC. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6906	2018-01-29 10:24:52 -08:00
Prakash Surya	0735ecb334	OpenZFS 8997 - ztest assertion failure in zil_lwb_write_issue PROBLEM ======= When `dmu_tx_assign` is called from `zil_lwb_write_issue`, it's possible for either `ERESTART` or `EIO` to be returned. If `ERESTART` is returned, this will cause an assertion to fail directly in `zil_lwb_write_issue`, where the code assumes the return value is `EIO` if `dmu_tx_assign` returns a non-zero value. This can occur if the SPA is suspended when `dmu_tx_assign` is called, and most often occurs when running `zloop`. If `EIO` is returned, this can cause assertions to fail elsewhere in the ZIL code. For example, `zil_commit_waiter_timeout` contains the following logic: lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb); ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED); In this case, if `dmu_tx_assign` returned `EIO` from within `zil_lwb_write_issue`, the `lwb` variable passed in will not be issued to disk. Thus, it's `lwb_state` field will remain `LWB_STATE_OPENED` and this assertion will fail. `zil_commit_waiter_timeout` assumes that after it calls `zil_lwb_write_issue`, the `lwb` will be issued to disk, and doesn't handle the case where this is not true; i.e. it doesn't handle the case where `dmu_tx_assign` returns `EIO`. SOLUTION ======== This change modifies the `dmu_tx_assign` function such that `txg_how` is a bitmask, rather than of the `txg_how_t` enum type. Now, the previous `TXG_WAITED` semantics can be used via `TXG_NOTHROTTLE`, along with specifying either `TXG_NOWAIT` or `TXG_WAIT` semantics. Previously, when `TXG_WAITED` was specified, `TXG_NOWAIT` semantics was automatically invoked. This was not ideal when using `TXG_WAITED` within `zil_lwb_write_issued`, leading the problem described above. Rather, we want to achieve the semantics of `TXG_WAIT`, while also preventing the `tx` from being penalized via the dirty delay throttling. With this change, `zil_lwb_write_issued` can acheive the semtantics that it requires by passing in the value `TXG_WAIT \| TXG_NOTHROTTLE` to `dmu_tx_assign`. Further, consumers of `dmu_tx_assign` wishing to achieve the old `TXG_WAITED` semantics can pass in the value `TXG_NOWAIT \| TXG_NOTHROTTLE`. Authored by: Prakash Surya <prakash.surya@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> Porting Notes: - Additionally updated `zfs_tmpfile` to use `TXG_NOTHROTTLE` OpenZFS-issue: https://www.illumos.org/issues/8997 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/19ea6cb0f9 Closes #7084	2018-01-26 20:19:46 -08:00
Brian Behlendorf	8fb1ede146	Extend deadman logic The intent of this patch is extend the existing deadman code such that it's flexible enough to be used by both ztest and on production systems. The proposed changes include: * Added a new `zfs_deadman_failmode` module option which is used to dynamically control the behavior of the deadman. It's loosely modeled after, but independant from, the pool failmode property. It can be set to wait, continue, or panic. * wait - Wait for the "hung" I/O (default) * continue - Attempt to recover from a "hung" I/O * panic - Panic the system * Added a new `zfs_deadman_ziotime_ms` module option which is analogous to `zfs_deadman_synctime_ms` except instead of applying to a pool TXG sync it applies to zio_wait(). A default value of 300s is used to define a "hung" zio. * The ztest deadman thread has been re-enabled by default, aligned with the upstream OpenZFS code, and then extended to terminate the process when it takes significantly longer to complete than expected. * The -G option was added to ztest to print the internal debug log when a fatal error is encountered. This same option was previously added to zdb in commit `fa603f82`. Update zloop.sh to unconditionally pass -G to obtain additional debugging. * The FM_EREPORT_ZFS_DELAY event which was previously posted when the deadman detect a "hung" pool has been replaced by a new dedicated FM_EREPORT_ZFS_DEADMAN event. * The proposed recovery logic attempts to restart a "hung" zio by calling zio_interrupt() on any outstanding leaf zios. We may want to further restrict this to zios in either the ZIO_STAGE_VDEV_IO_START or ZIO_STAGE_VDEV_IO_DONE stages. Calling zio_interrupt() is expected to only be useful for cases when an IO has been submitted to the physical device but for some reasonable the completion callback hasn't been called by the lower layers. This shouldn't be possible but has been observed and may be caused by kernel/driver bugs. * The 'zfs_deadman_synctime_ms' default value was reduced from 1000s to 600s. * Depending on how ztest fails there may be no cache file to move. This should not be considered fatal, collect the logs which are available and carry on. * Add deadman test cases for spa_deadman() and zio_wait(). * Increase default zfs_deadman_checktime_ms to 60s. Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6999	2018-01-25 13:40:38 -08:00
Andriy Gapon	1b18c6d791	OpenZFS 8731 - ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks Authored by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8731 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4c08500788 Closes #7079	2018-01-25 10:02:11 -08:00
Giuseppe Di Natale	cf232b53d5	Revert "Remove wrong ASSERT in annotate_ecksum" This reverts commit `093911f194`. Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #7079	2018-01-25 10:01:02 -08:00
Brian Behlendorf	23602fdb39	Add cv_timedwait_io() Add missing helper function cv_timedwait_io(), it should be used when waiting on IO with a specified timeout. Reviewed-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #674	2018-01-24 11:33:47 -08:00
Alexander Motin	916384729e	OpenZFS 8835 - Speculative prefetch in ZFS not working for misaligned reads In case of misaligned I/O sequential requests are not detected as such due to overlaps in logical block sequence: dmu_zfetch(fffff80198dd0ae0, 27347, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27355, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27363, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27371, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27379, 9, 1) dmu_zfetch(fffff80198dd0ae0, 27387, 9, 1) This patch makes single block overlap to be counted as a stream hit, improving performance up to several times. Authored by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Gvozden Neskovic <neskovic@gmail.com> Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8835 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/aab6dd482a Closes #7062	2018-01-19 09:31:29 -08:00
Brian Behlendorf	31864e3d8c	OpenZFS 8652 - Tautological comparisons with ZPROP_INVAL usr/src/uts/common/sys/fs/zfs.h Change ZPROP_INVAL and ZPROP_CONT from macros to enum values. Clang and GCC both prefer to use unsigned ints to store enums. That was causing tautological comparison warnings (and likely eliminating error handling code at compile time) whenever a zfs_prop_t or zpool_prop_t was compared to ZPROP_INVAL or ZPROP_CONT. Making the error flags be explicity enum values forces the enum types to be signed. ZPROP_INVAL was also compared against two different enum types. I had to change its name to ZPOOL_PROP_INVAL whenever its compared to a zpool_prop_t. There are still some places where ZPROP_INVAL or ZPROP_CONT is compared to a plain int, in code that doesn't know whether the int is storing a zfs_prop_t or a zpool_prop_t. usr/src/uts/common/fs/zfs/spa.c s/ZPROP_INVAL/ZPOOL_PROP_INVAL/ Authored by: Alan Somers <asomers@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: George Melikov <mail@gmelikov.ru> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8652 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c2de80dc74 Closes #7061	2018-01-19 09:22:37 -08:00
John L. Hammond	51d1b58ef3	Emit an error message before MMP suspends pool In mmp_thread(), emit an MMP specific error message before calling zio_suspend() so that the administrator will understand why the pool is being suspended. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John L. Hammond <john.hammond@intel.com> Closes #7048	2018-01-17 12:24:42 -08:00
Sean Eric Fagan	43cb30b3ce	OpenZFS 8959 - Add notifications when a scrub is paused or resumed Authored by: Sean Eric Fagan <sef@ixsystems.com> Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Giuseppe Di Natale <dinatale2@llnl.gov> Porting Notes: - Brought #defines in eventdefs.h in line with ZFS on Linux format. - Updated zfs-events.5 with the new events. OpenZFS-issue: https://www.illumos.org/issues/8959 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c862b93eea Closes #7049	2018-01-17 10:31:00 -08:00
Andriy Gapon	6a2185660d	OpenZFS 8930 - zfs_zinactive: do not remove the node if the filesystem is readonly Authored by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Approved by: Gordon Ross <gwr@nexenta.com> Ported-by: Brian Behlendorf <behlendorf1@llnl.gov> OpenZFS-issue: https://www.illumos.org/issues/8930 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/93c618e0f4 Closes #7029	2018-01-11 13:50:08 -08:00
Brian Behlendorf	fed90353d7	Support -fsanitize=address with --enable-asan When --enable-asan is provided to configure then build all user space components with fsanitize=address. For kernel support use the Linux KASAN feature instead. https://github.com/google/sanitizers/wiki/AddressSanitizer When using gcc version 4.8 any test case which intentionally generates a core dump will fail when using --enable-asan. The default behavior is to disable core dumps and only newer versions allow this behavior to be controled at run time with the ASAN_OPTIONS environment variable. Additionally, this patch includes some build system cleanup. * Rules.am updated to set the minimum AM_CFLAGS, AM_CPPFLAGS, and AM_LDFLAGS. Any additional flags should be added on a per-Makefile basic. The --enable-debug and --enable-asan options apply to all user space binaries and libraries. * Compiler checks consolidated in always-compiler-options.m4 and renamed for consistency. * -fstack-check compiler flag was removed, this functionality is provided by asan when configured with --enable-asan. * Split DEBUG_CFLAGS in to DEBUG_CFLAGS, DEBUG_CPPFLAGS, and DEBUG_LDFLAGS. * Moved default kernel build flags in to module/Makefile.in and split in to ZFS_MODULE_CFLAGS and ZFS_MODULE_CPPFLAGS. These flags are set with the standard ccflags-y kbuild mechanism. * -Wframe-larger-than checks applied only to binaries or libraries which include source files which are built in both user space and kernel space. This restriction is relaxed for user space only utilities. * -Wno-unused-but-set-variable applied only to libzfs and libzpool. The remaining warnings are the result of an ASSERT using a variable when is always declared. * -D_POSIX_PTHREAD_SEMANTICS and -D__EXTENSIONS__ dropped because they are Solaris specific and thus not needed. * Ensure $GDB is defined as gdb by default in zloop.sh. Signed-off-by: DHE <git@dehacked.net> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7027	2018-01-10 10:49:27 -08:00
Alex Zhuravlev	3910184d9e	Use zap_count instead of cached z_size for unlink As a performance optimization Lustre does not strictly update the SA_ZPL_SIZE when adding/removing from non-directory entries. This results in entries which cannot be removed through the ZPL layer even though the ZAP is empty and safe to remove. Resolve this issue by checking the zap_count() directly instead on relying on the cached SA_ZPL_SIZE. Micro-benchmarks show no significant performance impact due to the additional overhead of using zap_count(). Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Alex Zhuravlev <alexey.zhuravlev@intel.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #7019	2018-01-09 16:16:07 -08:00
DHE	460f239e69	Fix -fsanitize=address memory leak kmem_alloc(0, ...) in userspace returns a leakable pointer. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: DHE <git@dehacked.net> Issue #6941	2018-01-09 12:26:25 -08:00
Brian Behlendorf	0873bb6337	Fix ARC hit rate When the compressed ARC feature was added in commit `d3c2ae1` the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6171 Closes #6852 Closes #6989	2018-01-08 09:52:36 -08:00
Prakash Surya	2fe61a7ecc	OpenZFS 8909 - 8585 can cause a use-after-free kernel panic Authored by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: John Kennedy <jwk404@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Ported-by: Prakash Surya <prakash.surya@delphix.com> PROBLEM ======= There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race is present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnecessary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. SOLUTION ======== The solution to this, is to ensure all outstanding lwb's complete before calling `zil_free_lwb` via `zil_destroy` in `zil_suspend`. This patch accomplishes this goal by forcing the normal `zil_commit` logic when called from `zil_sync`. Now, `zil_suspend` will call `zil_commit_impl` which will always use the normal logic of waiting/issuing lwb's to disk before it returns. As a result, any lwb's outstanding when `zil_commit_impl` is called will be guaranteed to reach the `LWB_STATE_DONE` state by the time it returns. Further, no new lwb's will be created via `zil_commit` since the zilog's `zl_suspend` flag will be set. This will force all new callers of `zil_commit` to use `txg_wait_synced` instead of creating and issuing new lwb's. Thus, all lwb's left on the zilog's lwb list when `zil_destroy` is called will be in the `LWB_STATE_DONE` state, and we'll avoid this race condition. OpenZFS-issue: https://www.illumos.org/issues/8909 OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ece62b6f8d Closes #6940	2017-12-28 10:18:04 -08:00
lidongyang	823d48bfb1	Call commit callbacks from the tail of the list Our zfs backed Lustre MDT had soft lockups while under heavy metadata workloads while handling transaction callbacks from osd_zfs. The problem is zfs is not taking advantage of the fast path in Lustre's trans callback handling, where Lustre will skip the calls to ptlrpc_commit_replies() when it already saw a higher transaction number. This patch corrects this, it also has a positive impact on metadata performance on Lustre with osd_zfs, plus some cleanup in the headers. A similar issue for ext4/ldiskfs is described on: https://jira.hpdd.intel.com/browse/LU-6527 Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Li Dongyang <dongyang.li@anu.edu.au> Closes #6986	2017-12-22 10:19:51 -08:00
Tony Hutter	c9821f1ccc	Linux 4.15 compat: timer updates Use timer_setup() macro and new timeout function definition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov> Closes #670 Closes #671	2017-12-21 10:56:32 -08:00
Tom Caputi	a8b2e30685	Support re-prioritizing asynchronous prefetches When sequential scrubs were merged, all calls to arc_read() (including prefetch IOs) were given ZIO_PRIORITY_ASYNC_READ. Unfortunately, this behaves badly with an existing issue where prefetch IOs cannot be re-prioritized after the issue. The result is that synchronous reads end up in the same vdev_queue as the scrub IOs and can have (in some workloads) multiple seconds of latency. This patch incorporates 2 changes. The first ensures that all scrub IOs are given ZIO_PRIORITY_SCRUB to allow the vdev_queue code to differentiate between these I/Os and user prefetches. Second, this patch introduces zio_change_priority() to provide the missing capability to upgrade a zio's priority. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tom Caputi <tcaputi@datto.com> Closes #6921 Closes #6926	2017-12-21 09:13:06 -08:00
Brian Behlendorf	bbffb59efc	Fix multihost stale cache file import When the multihost property is enabled it should be impossible to import an active pool even using the force (-f) option. This patch prevents a forced import from succeeding when importing with a stale cache file. The root cause of the problem is that the kernel modules trusted the hostid provided in configuration. This is always correct when the configuration is generated by scanning for the pool. However, when using an existing cache file the hostid could be stale which would result in the activity check being skipped. Resolve the issue by always using the hostid read from the label configuration where the best uberblock was found. Reviewed-by: Olaf Faaland <faaland1@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6933 Closes #6971	2017-12-18 10:28:27 -08:00

1 2 3 4 5 ...

2398 Commits