freebsd-skq

Author	SHA1	Message	Date
Alexander Motin	4e77060898	9236 nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. illumos/illumos-gate@21f7c81cc1 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:42:31 +00:00
Alexander Motin	3741bf4def	9192 explicitly pass good_writes to vdev_uberblock/label_sync Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume that its io_private is a pointer to the good_writes count. They should instead accept this argument explicitly. illumos/illumos-gate@a3b5583021 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:34:39 +00:00
Alexander Motin	d7479e295b	9290 device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. illumos/illumos-gate@3a4b1be953 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:13:04 +00:00
Alexander Motin	df0ce4e5a8	9112 Improve allocation performance on high-end systems On high-end systems running async sequential write workloads, especially NUMA systems with flash or NVMe storage, one significant performance bottleneck is selecting a metaslab to do allocations from. This process can be parallelized, providing significant performance increases for these workloads. illumos/illumos-gate@f78cdc34af Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-30 23:53:25 +00:00
Alexander Motin	af8040b47a	9238 ZFS Spacemap Encoding V2 The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). The new remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps. One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. illumos/illumos-gate@17f11284b4 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-07-30 22:56:24 +00:00
Alexander Motin	05cdb6cd3b	9189 Add debug to vdev_label_read_config when txg check fails illumos/illumos-gate@b6bf6e1540 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-07-30 21:59:44 +00:00
Alexander Motin	ec66f8a6e3	9284 arc_reclaim_thread has 2 jobs `arc_reclaim_thread()` calls `arc_adjust()` after calling `arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to indicate that we may no longer be `arc_is_overflowing()`. The problem is, `arc_kmem_reap_now()` can take several seconds to complete, has no impact on `arc_is_overflowing()`, but due to how the code is structured, can impact how long the ARC will remain in the `arc_is_overflowing()` state. The fix is to use seperate threads to: 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which improves `arc_is_overflowing()` 2. keep enough free memory in the system, by calling `arc_kmem_reap_now()` plus `arc_shrink()`, which improves `arc_available_memory()`. illumos/illumos-gate@de753e34f9 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Tim Kordas <tim.kordas@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Brad Lewis <brad.lewis@delphix.com>	2018-07-30 19:44:14 +00:00
Alexander Motin	a622bac50f	9280 Assertion failure while running removal_with_ganging test with 4K devices illumos/illumos-gate@243952c7ee Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matt Ahrens <Matt.Ahrens@delphix.com>	2018-03-28 23:12:03 +00:00
Alexander Motin	a1773e8b7a	9188 increase size of dbuf cache to reduce indirect block decompression illumos/illumos-gate@268bbb2a2f With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Garrett D'Amore <garrett@damore.org> Author: George Wilson <george.wilson@delphix.com>	2018-03-28 22:57:02 +00:00
Alexander Motin	bc1a4f2e3c	9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value illumos/illumos-gate@9be12bd737 arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally In the case of zfs_compressed_arc_enabled=0, when the buf is returned via arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes is decremented by lsize, not psize. Switch to using arc_buf_size(buf), instead of psize, which will return psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf). Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Allan Jude <allanjude@freebsd.org>	2018-03-28 22:43:55 +00:00
Alexander Motin	d32e3273d2	9235 rename zpool_rewind_policy_t to zpool_load_policy_t illumos/illumos-gate@5dafeea3eb We want to be able to pass various settings during import/open of a pool, which are not only related to rewind. Instead of adding a new policy and duplicate a bunch of code, we should just rename rewind_policy to a more generic term like load_policy. For instance, we'd like to set spa->spa_import_flags from the nvlist, rather from a flags parameter passed to spa_import as in some cases we want those flags not only for the import case, but also for the open case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would allow zfs to open a pool when logs are missing. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:16:51 +00:00
Alexander Motin	f5b017d914	9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices illumos/illumos-gate@ccef24b493 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:08:57 +00:00
Alexander Motin	64f15aa65b	9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate illumos/illumos-gate@d1de72cfa2 ztest failed with uncorrectable IO error despite having the fix for #7163. Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes it from that issue. Definitely seems like a racing condition between the vdev_validate and spa_sync: 1. Thread A (spa_sync): vdev label is updated to latest txg 2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead. 3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg. Solution: do not check txg in vdev_validate unless config lock is held. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:06:12 +00:00
Alexander Motin	bce155015b	Add files missed from r331695.	2018-03-28 21:00:34 +00:00
Alexander Motin	b8a528e092	9166 zfs storage pool checkpoint illumos/illumos-gate@8671400134 The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with exactly that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme rewind that doesn’t corrupt your data). It remembers the entire state of the pool at the point that it was taken and the user can revert back to it later or discard it. Its generic use case is an administrator that is about to perform a set of destructive actions to ZFS as part of a critical procedure. She takes a checkpoint of the pool before performing the actions, then rewinds back to it if one of them fails or puts the pool into an unexpected state. Otherwise, she discards it. With the assumption that no one else is making modifications to ZFS, she basically wraps all these actions into a “high-level transaction”. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-03-28 18:12:06 +00:00
Alexander Motin	6027cbd474	9213 zfs: sytem typo illumos/illumos-gate@edc8ef7d92 Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Toomas Soome <tsoome@me.com>	2018-03-23 02:28:37 +00:00
Alexander Motin	057fec9070	9084 spa_*_ashift must ignore spare devices illumos/illumos-gate@b037f3dbd6 Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-03-23 02:22:16 +00:00
Alexander Motin	ba31d3992e	8484 Implement aggregate sum and use for arc counters In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-03-23 00:20:42 +00:00
Andriy Gapon	695c9b6645	9164 assert: newds == os->os_dsl_dataset illumos/illumos-gate@5f5913bb83 `5f5913bb83` https://www.illumos.org/issues/9164 This issue has been reported by Alan Somers as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877 dmu_objset_refresh_ownership() first disowns a dataset (and releases it) and then owns it again. There is an assert that the new dataset object is the same as the old dataset object. When running ZFS Test Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test: panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000 == 0xfffff80021ab4800) I see that the old dataset has dsl_dataset_evict_async() pending in ds_dbu.dbu_tqent and its ds_dbuf is NULL. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <don.brady@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <avg@FreeBSD.org>	2018-03-15 08:46:49 +00:00
Andriy Gapon	dcf8a50d8f	8984 fix for 6764 breaks ACL inheritance illumos/illumos-gate@e9bacc6d1a `e9bacc6d1a` https://www.illumos.org/issues/8984 Consider a directory configured as: drwx-ws---+ 2 henson cpp 3 Jan 23 12:35 dropbox/ user:henson:rwxpdDaARWcC--:f-i----:allow owner@:--------------:f-i----:allow group@:--------------:f-i----:allow everyone@:--------------:f-i----:allow owner@:rwxpdDaARWcC--:-di----:allow group:cpp:-wx-----------:-------:allow owner@:rwxpdDaARWcC--:-------:allow A new file created in this directory ends up looking like: rw-r--r-+ 1 astudent cpp 0 Jan 23 12:39 testfile user:henson:rw-pdDaARWcC--:------I:allow owner@:--------------:------I:allow group@:--------------:------I:allow everyone@:--------------:------I:allow owner@:rw-p--aARWcCos:-------:allow group@:r-----a-R-c--s:-------:allow everyone@:r-----a-R-c--s:-------:allow with extraneous group@ and everyone@ entries allowing read access that shouldn't exist. Per Albert Lee on the zfs mailing list: "aclinherit=passthrough/passthrough-x should still ignore the requested mode when an inheritable ACE for owner@ group@, or everyone@ is present in the parent directory. It appears there was an oversight in my fix for https://www.illumos.org/issues/6764 which made calling zfs_acl_chmod from zfs_acl_inherit unconditional. I think the parent ACL check for aclinherit=passthrough needs to be reintroduced in zfs_acl_inherit." We have a large number of faculty who use dropbox directories like the example to have students submit projects. All of these directories are now allowing Reviewed by: Sam Zaydel <szaydel@racktopsystems.com> Reviewed by: Paul B. Henson <henson@acm.org> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Dominik Hassler <hadfl@omniosce.org>	2018-03-07 13:47:01 +00:00
Alexander Motin	4195015764	9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() illumos/illumos-gate@bdfded42e6 A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that was already acquired from the first call to vdev_indirect_remap(). The specific scenario, is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the same vdev, which brings up the aforementioned issue. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-02-22 03:52:03 +00:00
Alexander Motin	9a64996e42	Missed pieces of r329799.	2018-02-22 03:23:43 +00:00
Alexander Motin	e7245cfc4c	9079 race condition in starting and ending condesing thread for indirect vdevs illumos/illumos-gate@667ec66f1b The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-02-22 03:22:27 +00:00
Alexander Motin	250699c304	r329793 \| mav \| 2018-02-22 04:21:03 +0200 (чт, 22 февр. 2018) \| 58 lines 9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 02:25:09 +00:00
Alexander Motin	dc763b80f4	8942 zfs promote .../%recv should be an error illumos/illumos-gate@add927f8c8 Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843, fixed by https://github.com/zfsonlinux/zfs/pull/6339: If we are in the middle of an incremental zfs receive, the child .../%recv will exist. If you concurrently run zfs promote .../%recv, it will "work", but then zfs gets confused. For example, there's no obvious way to destroy the containing filesystem (because it is now a clone of its invisible child). Attempting to do this promote should be an error. We could fix this by having zfs_ioc_promote() check if zc_name contains a %, similar to zfs_ioc_rename(). Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:30:03 +00:00
Alexander Motin	63be747cb1	8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa) illumos/illumos-gate@f4c1745bd6 Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think this is the intended behaviour, we shouldn't be allowed to clear readonly pools. Probably. A fix is already in the ZFS on Linux repository to addess this issue: `92e43c1718` Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 00:59:49 +00:00
Alexander Motin	106c279da7	8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly illumos/illumos-gate@85723e5eec When iterating over the input nvlist in dsl_props_set_sync_impl() when we don't preserve the nvpair name before looking up ZPROP_VALUE, so when we later go to process it nvpair_name() is always "value" instead of the actual property name. This results in a couple of bugs in the recv code: - received properties are not restored correctly when failing to receive an incremental send stream - received properties are not completely replaced by the new ones when successfully receiving an incremental send stream This was discovered on ZFS on Linux (fixed in `5f1346c299`) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 00:53:59 +00:00
Alexander Motin	f84197c93c	9036 zfs: duplicate 'const' declaration specifier illumos/illumos-gate@f02c28e434 Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Toomas Soome <tsoome@me.com>	2018-02-22 00:48:59 +00:00
Alexander Motin	0537abbeb5	9035 zfs: this statement may fall through illumos/illumos-gate@46ac8fdfc5 Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Toomas Soome <tsoome@me.com>	2018-02-22 00:46:24 +00:00
Alexander Motin	38fc65e027	8962 zdb should work on non-idle pools illumos/illumos-gate@e144c4e6c9 Currently `zdb` consistently fails to examine non-idle pools as it fails during the `spa_load()` process. The main problem seems to be that `spa_load_verify()` fails as can be seen below: $ sudo zdb -d -G dcenter zdb: can't open 'dcenter': I/O error ZFS_DBGMSG(zdb): spa_open_common: opening dcenter spa_load(dcenter): LOADING disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950 spa_load(dcenter): using uberblock with txg=40824950 spa_load(dcenter): UNLOADING spa_load(dcenter): RELOADING spa_load(dcenter): LOADING disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952 spa_load(dcenter): using uberblock with txg=40824952 spa_load(dcenter): FAILED: spa_load_verify failed [error=5] spa_load(dcenter): UNLOADING This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is done by creating a global flag in zfs and then setting it in `zdb`. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 00:09:15 +00:00
Alexander Motin	843b26d023	8961 SPA load/import should tell us why it failed illumos/illumos-gate@3ee8c80c74 When we fail to open or import a storage pool, we typically don't get any additional diagnostic information, just "no pool found" or "can not import". While there may be no additional user-consumable information, we should at least make this situation easier to debug/diagnose for developers and support. For example, we could start by using `zfs_dbgmsg()` to log each thing that we try when importing, and which things failed. E.g. "tried uberblock of txg X from label Y of device Z". Also, we could log each of the stages that we go through in `spa_load_impl()`. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-21 23:42:30 +00:00
Alexander Motin	223661a5b7	7638 Refactor spa_load_impl into several functions illumos/illumos-gate@1fd3785ff6 spa_load_impl has grown out of proportions. It is currently over 700 lines long and makes it very hard to follow or debug the import process even for experienced ZFS developers. The objective is to split it up in a series of well commented functions. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-21 23:25:11 +00:00
Alexander Motin	ffaf1cfabc	9018 Replace kmem_cache_reap_now() with kmem_cache_reap_soon() illumos/illumos-gate@36a64e6284 To prevent kmem_cache reaping from blocking other system resources, turn kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which exploits #9017's new taskq_empty(). Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Author: Tim Kordas <tim.kordas@joyent.com>	2018-02-21 22:14:19 +00:00
Alexander Motin	81ef5e369c	8809 libzpool should leverage work done in libfakekernel illumos/illumos-gate@f06dce2c1f Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andrew Stormont <astormont@racktopsystems.com>	2018-02-21 21:04:46 +00:00
Alexander Motin	ed2ac05a27	8969 Cannot boot from RAIDZ with parity > 1 illumos/illumos-gate@0fb055e81f At present it is possible to boot from a root pool that is on RAIDZ but not one that is on RAIDZ2 or RAIDZ3. This is because, at the time the pool version is checked to ensure support for dual/triple parity, the uberblock has not yet been loaded into the SPA and therefore the code determines that the pool version is too old and returns ENOTSUP. Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Andy Fiddaman <omnios@citrus-it.co.uk>	2018-02-21 18:09:07 +00:00
Andriy Gapon	86b66fca8b	8520 7198 lzc_rollback_to should support rolling back to origin illumos/illumos-gate@95643f75d2 `95643f75d2` https://www.illumos.org/issues/8520 lzc_rollback_to() should support rolling back to a clone's origin. The current checks in zfs_ioc_rollback() would not allow that because the origin snapshot belongs to a different filesystem. The overly restrictive check was introduced in 7600, but it was not a regression as none of the existing tools provided a way to rollback to the origin. https://www.illumos.org/issues/7198 EINVAL is returned when a dataset does not have any snapshots, so there is nothing to roll back to. Although the code in zfs_do_rollback checks for that condition in advance, it's still possible that the snapshot(s) gets removed after the check and before the rollback sync task is executed. At the moment zfs command would crash when that happens. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org>	2018-02-21 15:10:33 +00:00
Andriy Gapon	02a28ad842	8997 ztest assertion failure in zil_lwb_write_issue illumos/illumos-gate@f864f99efe `f864f99efe` https://www.illumos.org/issues/8997 When dmu_tx_assign is called from zil_lwb_write_issue, it's possible for either ERESTART or EIO to be returned. If ERESTART is returned, this will cause an assertion to fail directly in zil_lwb_write_issue, where the code assumes the return value is EIO if dmu_tx_assign returns a non-zero value. This can occur if the SPA is suspended when dmu_tx_assign is called, and most often occurs when running zloop. If EIO is returned, this can cause assertions to fail elsewhere in the ZIL code. For example, zil_commit_waiter_timeout contains the following logic: lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb); ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED); In this case, if dmu_tx_assign returned EIO from within zil_lwb_write_issue, the lwb variable passed in will not be issued to disk. Thus, it's lwb_state field will remain LWB_STATE_OPENED and this assertion will fail. zil_commit_waiter_timeout assumes that after it calls zil_lwb_write_issue, the lwb will be issued to disk, and doesn't handle the case where this is not true; i.e. it doesn't handle the case where dmu_tx_assign returns EIO. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-02-21 14:33:00 +00:00
Andriy Gapon	d03529fcb8	8731 ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks illumos/illumos-gate@a6c1eb3c08 `a6c1eb3c08` https://www.illumos.org/issues/8731 annotate_ecksum() asserts that nui64s, calculated as nui64s = size / sizeof (uint64_t), is not greater than UINT16_MAX. This restriction is needed because histograms of incorrectly set and cleared bits have 16 bit counters and if the buffer consists of too many 64-bit words, then a counter can potentially overflow producing an incorrect result. When the largest buffer size was 128KB the greatest value of nui64s was 16K, well within the limit. But now we have support for large buffers and for buffer sizes of 512KB and above the restriction is violated. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org>	2018-02-21 14:30:34 +00:00
Andriy Gapon	771a769243	8966 Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable illumos/illumos-gate@82693e09cc `82693e09cc` https://www.illumos.org/issues/8966 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Richard Lowe <richlowe@richlowe.net> Author: WHR <msl0000023508@gmail.com>	2018-02-21 14:12:29 +00:00
Alexander Motin	79a23a6944	7614 zfs device evacuation/removal illumos/illumos-gate@5cabbc6b49 https://www.illumos.org/issues/7614: This project allows top-level vdevs to be removed from the storage pool with “zpool remove”, reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now “indirect”) vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become “obsolete” because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been “remapped” in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be “remapped” to their new (concrete) locations if possible. This process can be accelerated by using the “zfs remap” command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. Therefore, mirror and raidz devices can not be removed. Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Prashanth Sreenivasa <pks@delphix.com>	2018-02-18 01:21:52 +00:00
Andriy Gapon	6fd4145b31	8857 zio_remove_child() panic due to already destroyed parent zio illumos/illumos-gate@d6e1c446d7 `d6e1c446d7` https://www.illumos.org/issues/8857 I had an OS panic on one of our servers: ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() It panicked here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ zio.c#430 pio->io_lock is DEAD, thus a panic. Further analysis shows the "pio" (parent zio of "cio") has already been destroyed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Youzhong Yang <youzhong@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2018-02-15 14:34:18 +00:00
Alexander Motin	2d224f6116	8835 Speculative prefetch in ZFS not working for misaligned reads illumos/illumos-gate@5cb8d943bc https://www.illumos.org/issues/8835: Sequential reads not aligned to block size are not detected by ZFS prefetcher as sequential, killing prefetch and severely hurting performance. It is caused by dmu_zfetch() in case of misaligned sequential accesses being called with overlap of one block. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Alexander Motin <mav@FreeBSD.org>	2018-01-22 05:55:43 +00:00
Alexander Motin	1e2bad5ea0	8652 Tautological comparisons with ZPROP_INVAL illumos/illumos-gate@4ae5f5f06c https://www.illumos.org/issues/8652: Clang and GCC prefer to use unsigned ints to store enums. With Clang, that causes tautological comparison warnings when comparing a zfs_prop_t or zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error handling code is being silently removed as a result. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Alan Somers <asomers@gmail.com>	2018-01-22 04:48:14 +00:00
Alexander Motin	ee700ae0c6	8959 Add notifications when a scrub is paused or resumed illumos/illumos-gate@301fd1d6f2 Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Sean Eric Fagan <sef@ixsystems.com>	2018-01-22 04:27:05 +00:00
Alexander Motin	1536455b31	8856 arc_cksum_is_equal() doesn't take into account ABD-logic illumos/illumos-gate@01a059ee0c https://www.illumos.org/issues/8856: arc_cksum_is_equal() calls zio_push_transform() that requires abd_t* (second arg), but a void* is passed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Roman Strashkin <roman.strashkin@nexenta.com>	2018-01-22 04:21:55 +00:00
Alexander Motin	dbee84cc27	8930 zfs_zinactive: do not remove the node if the filesystem is readonly illumos/illumos-gate@93c618e0f4 https://www.illumos.org/issues/8930: We normally remove an unlinked node when its last user goes away and the node becomes inactive. However, we should not do that if the filesystem is mounted read-only including the case where it has its readonly property set. The node will remain on the unlinked queue, so it will not be leaked. One particular scenario is when we receive an incremental stream into a mounted read-only filesystem and that stream contains an unlinked file (still on the unlinked queue). If that file is opened before the receive and some time later after the receive it becomes inactive we would remove it and, thus, modify the read-only filesystem. As a result, the filesystem would diverge from its source and further incremental receives would not be possible (without forcing a rollback). Another related scenario, that may or may not be possible depending on an OS / VFS policy, is when an open file is unlinked, then the filesystem is remounted read-only, and then the file is closed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Andriy Gapon <avg@FreeBSD.org>	2018-01-21 23:42:45 +00:00
Alexander Motin	ebd7699264	8909 8585 can cause a use-after-free kernel panic illumos/illumos-gate@94ddd0900a https://www.illumos.org/issues/8909: There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race is present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnesseary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. Reviewed by: John Kennedy <jwk404@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-01-21 23:12:38 +00:00
Alexander Motin	d216de62d1	8603 rename zilog's "zl_writer_lock" to "zl_issuer_lock" illumos/illumos-gate@cf07d3da99 https://www.illumos.org/issues/8603: To help make the ZIL's code more understandable, it was suggested that the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".	2018-01-21 23:04:25 +00:00
Alexander Motin	619fd3c317	8677 Open-Context Channel Programs illumos/illumos-gate@a3b2868063 https://www.illumos.org/issues/8677 We want to be able to run channel programs outside of synching context. This would greatly improve performance of channel program that just gather information, as we won't have to wait for synching context anymore. This feature should introduce the following: - A new command line flag in "zfs program" to specify our intention to run in open context. - A new flag/option within the channel program ioctl which selects the context. - Appropriate error handling whenever we try a channel program in open-context that contains zfs.sync* expressions. - Documentation for the new feature in the manual pages. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-01-21 19:26:38 +00:00
Mark Johnston	6ff83134aa	8880 improve DTrace error checking illumos/illumos-gate@2cf374268f `2cf374268f` https://www.illumos.org/issues/8880 Reviewed by: Tim Kordas <tim.kordas@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@joyent.com> Author: Jerry Jelinek <jerry.jelinek@joyent.com>	2017-12-12 00:51:39 +00:00

1 2 3 4 5 ...

497 Commits