freebsd-skq

Author	SHA1	Message	Date
Alexander Motin	903252254b	9523 Large alloc in zdb can cause trouble 16MB alloc in zdb_embedded_block() can cause cores in certain situations (clang, gcc55). OsX commit: `ced236a5da` FreeBSD commit: https://svnweb.freebsd.org/base?view=revision&revision=326150 illumos/illumos-gate@03a4c2f4bf Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Jorgen Lundman <lundman@lundman.net>	2018-08-02 20:37:53 +00:00
Alexander Motin	abb5258c34	9487 Free objects when receiving full stream as clone All objects after the last written or freed object are not supposed to exist after receiving the stream. We should free them accordingly, as if a freeobjects record for them had been included in the stream. zfsonlinux/zfs@48fbb9ddbf illumos/illumos-gate@7864b8192b Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-08-02 20:29:58 +00:00
Alexander Motin	fdbc73edd4	9464 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes Ideally we would like txg_kick() to get triggered only when we are sure that we are not syncing AND not quiescing any txg. This way we can kick an open TXG to the quiescing state when we are sure that there is nothing going on and we would benefit from the different states running concurrently. illumos/illumos-gate@fa41d87de9 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-08-02 20:17:37 +00:00
Alexander Motin	58d5696e1c	9442 decrease indirect block size of spacemaps Updates to indirect blocks of spacemaps can contribute significantly to write inflation. Therefore we want to reduce the indirect block size of spacemaps from 128K to 16K. illumos/illumos-gate@221813c13b Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Albert Lee <trisk@forkgnu.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-02 19:37:13 +00:00
Alexander Motin	cf8fc527c9	9512 zfs remap poolname@snapname coredumps Only filesystems and volumes are valid "zfs remap" parameters: when passed a snapshot name zfs_remap_indirects() does not handle the EINVAL returned from libzfs_core, which results in failing an assertion and consequently crashing. illumos/illumos-gate@0b2e825398 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-08-02 19:09:13 +00:00
Alexander Motin	2395a7f52d	8115 parallel zfs mount Mounting of filesystems in "filesystem/local" is done using `zfs mount -a`, which mounts each filesystems serially. The bottleneck for each mount is the I/O done to load metadata for each filesystem. As such, mounting filesystems using a parallel algorithm should be a big win, and bring down the runtime of "filesystem/local"'s start method. illumos/illumos-gate@591e0e133f Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Sebastien Roy <seb@delphix.com>	2018-08-01 18:28:17 +00:00
Alexander Motin	2a44a23db2	9426 metaslab size can exceed offset addressable by spacemap metaslab size can exceed offset addressable by spacemap. The vdev can address up to 2^63 * SPA_MAXBLOCKSIZE (512). A metaslab can address up to 2^47 * 2^vdev_ashift. Therefore we may need to increase the number of metaslabs so that the maximum metaslab size is capped at the amount that can be addressed by the spacemap. This should happen in vdev_metaslab_set_size(). illumos/illumos-gate@b4bf0cf045 Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Don Brady <don.brady@delphix.com>	2018-08-01 03:19:30 +00:00
Alexander Motin	03a8ca5bd7	9328 zap code can take advantage of c99 9329 panic in zap_leaf_lookup() due to concurrent zapification illumos/illumos-gate@bf26014c55 Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-01 03:03:15 +00:00
Alexander Motin	82104b77fb	9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error This assertion (VERIFY) failure was reported when reading a block. Turns out the problem is that if we get an i/o error (ECKSUM in this case), and there are multiple concurrent ARC reads of the same block (from different clones), then the ARC will put multiple buf's on the same ANON hdr, which isn't supposed to happen, and then causes a panic when we try to arc_buf_destroy() the buf. illumos/illumos-gate@fa98e487a9 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 23:44:13 +00:00
Alexander Motin	b7b330afa0	9443 panic when scrub a v10 pool illumos/illumos-gate@bb1f424574 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 22:59:48 +00:00
Alexander Motin	5c1f905282	9237 "zpool add" fails for very large pools illumos/illumos-gate@ae5ee1bd5e Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Alan Somers <asomers@gmail.com>	2018-07-31 21:42:31 +00:00
Alexander Motin	b3e8109985	9421 zdb should detect and print out the number of "leaked" objects 9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue illumos/illumos-gate@20b5dafb42 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-31 21:31:24 +00:00
Alexander Motin	2dcf06914e	9194 mechanism to override ashift at pool creation time illumos/illumos-gate@93a1902e51 Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 21:26:03 +00:00
Alexander Motin	4f73289938	9434 Speculative prefetch is blocked by device removal code illumos/illumos-gate@47b8d4b884 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Alexander Motin <mav@FreeBSD.org>	2018-07-31 19:30:08 +00:00
Alexander Motin	5343ce649b	Add new files missed in r336991.	2018-07-31 19:23:27 +00:00
Alexander Motin	59ef839e67	9102 zfs should be able to initialize storage devices The first access to a disk block can incur a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that volumes be "thick provisioned", where supported by the platform (VMware). Thick provisioning is time consuming and often is ignored. If the thick provision step is omitted, customers will see suboptimal performance until we have written to all parts of the LUN. ZFS should be able to initialize any unused storage to remove any first-write penalty that exists. illumos/illumos-gate@094e47e980 Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2018-07-31 18:49:07 +00:00
Alexander Motin	86b3990c6e	9256 zfs send space estimation off by > 10% on some datasets illumos/illummos-gate@df477c0afa Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-31 01:00:46 +00:00
Alexander Motin	150ac7bcc5	9337 zfs get all is slow due to uncached metadata This project's goal is to make read-heavy channel programs and zfs(1m) administrative commands faster by caching all the metadata that they will need in the dbuf layer. This will prevent the data from being evicted, so that any future call to i.e. zfs get all won't have to go to disk (very much). illumos/illumos-gate@adb52d9262 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:56:41 +00:00
Alexander Motin	4e77060898	9236 nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. illumos/illumos-gate@21f7c81cc1 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:42:31 +00:00
Alexander Motin	3741bf4def	9192 explicitly pass good_writes to vdev_uberblock/label_sync Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume that its io_private is a pointer to the good_writes count. They should instead accept this argument explicitly. illumos/illumos-gate@a3b5583021 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:34:39 +00:00
Alexander Motin	d7479e295b	9290 device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. illumos/illumos-gate@3a4b1be953 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:13:04 +00:00
Alexander Motin	df0ce4e5a8	9112 Improve allocation performance on high-end systems On high-end systems running async sequential write workloads, especially NUMA systems with flash or NVMe storage, one significant performance bottleneck is selecting a metaslab to do allocations from. This process can be parallelized, providing significant performance increases for these workloads. illumos/illumos-gate@f78cdc34af Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-30 23:53:25 +00:00
Alexander Motin	af8040b47a	9238 ZFS Spacemap Encoding V2 The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). The new remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps. One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. illumos/illumos-gate@17f11284b4 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-07-30 22:56:24 +00:00
Alexander Motin	6f4662e465	9286 want refreservation=auto When a ZFS volume is created with zfs create -V (but without -s), the refreservation property is set to a value that is volsize plus the maximum size of metadata. If refreservation is ever set to another value, it is impossible to set it back to the automatically determined value. There are other cases where refreservation may be wrong. These include receiving a volume that was sent without properties and zfs clone. We need: zfs set refreservation=auto <volume> zfs clone -o refreservation=auto <volume> Each one would use the same function used by zfs create -V to determine the proper value for refreservation. illumos/illumos-gate@1c10ae76c0 Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Mike Gerdts <mike.gerdts@joyent.com>	2018-07-30 22:10:15 +00:00
Alexander Motin	05cdb6cd3b	9189 Add debug to vdev_label_read_config when txg check fails illumos/illumos-gate@b6bf6e1540 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-07-30 21:59:44 +00:00
Alexander Motin	ec66f8a6e3	9284 arc_reclaim_thread has 2 jobs `arc_reclaim_thread()` calls `arc_adjust()` after calling `arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to indicate that we may no longer be `arc_is_overflowing()`. The problem is, `arc_kmem_reap_now()` can take several seconds to complete, has no impact on `arc_is_overflowing()`, but due to how the code is structured, can impact how long the ARC will remain in the `arc_is_overflowing()` state. The fix is to use seperate threads to: 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which improves `arc_is_overflowing()` 2. keep enough free memory in the system, by calling `arc_kmem_reap_now()` plus `arc_shrink()`, which improves `arc_available_memory()`. illumos/illumos-gate@de753e34f9 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Tim Kordas <tim.kordas@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Brad Lewis <brad.lewis@delphix.com>	2018-07-30 19:44:14 +00:00
Alexander Motin	a622bac50f	9280 Assertion failure while running removal_with_ganging test with 4K devices illumos/illumos-gate@243952c7ee Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matt Ahrens <Matt.Ahrens@delphix.com>	2018-03-28 23:12:03 +00:00
Alexander Motin	a1773e8b7a	9188 increase size of dbuf cache to reduce indirect block decompression illumos/illumos-gate@268bbb2a2f With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Garrett D'Amore <garrett@damore.org> Author: George Wilson <george.wilson@delphix.com>	2018-03-28 22:57:02 +00:00
Alexander Motin	bc1a4f2e3c	9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value illumos/illumos-gate@9be12bd737 arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally In the case of zfs_compressed_arc_enabled=0, when the buf is returned via arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes is decremented by lsize, not psize. Switch to using arc_buf_size(buf), instead of psize, which will return psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf). Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Allan Jude <allanjude@freebsd.org>	2018-03-28 22:43:55 +00:00
Alexander Motin	d32e3273d2	9235 rename zpool_rewind_policy_t to zpool_load_policy_t illumos/illumos-gate@5dafeea3eb We want to be able to pass various settings during import/open of a pool, which are not only related to rewind. Instead of adding a new policy and duplicate a bunch of code, we should just rename rewind_policy to a more generic term like load_policy. For instance, we'd like to set spa->spa_import_flags from the nvlist, rather from a flags parameter passed to spa_import as in some cases we want those flags not only for the import case, but also for the open case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would allow zfs to open a pool when logs are missing. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:16:51 +00:00
Alexander Motin	f5b017d914	9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices illumos/illumos-gate@ccef24b493 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:08:57 +00:00
Alexander Motin	64f15aa65b	9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate illumos/illumos-gate@d1de72cfa2 ztest failed with uncorrectable IO error despite having the fix for #7163. Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes it from that issue. Definitely seems like a racing condition between the vdev_validate and spa_sync: 1. Thread A (spa_sync): vdev label is updated to latest txg 2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead. 3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg. Solution: do not check txg in vdev_validate unless config lock is held. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:06:12 +00:00
Alexander Motin	bce155015b	Add files missed from r331695.	2018-03-28 21:00:34 +00:00
Alexander Motin	b8a528e092	9166 zfs storage pool checkpoint illumos/illumos-gate@8671400134 The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with exactly that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme rewind that doesn’t corrupt your data). It remembers the entire state of the pool at the point that it was taken and the user can revert back to it later or discard it. Its generic use case is an administrator that is about to perform a set of destructive actions to ZFS as part of a critical procedure. She takes a checkpoint of the pool before performing the actions, then rewinds back to it if one of them fails or puts the pool into an unexpected state. Otherwise, she discards it. With the assumption that no one else is making modifications to ZFS, she basically wraps all these actions into a “high-level transaction”. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-03-28 18:12:06 +00:00
Alexander Motin	6027cbd474	9213 zfs: sytem typo illumos/illumos-gate@edc8ef7d92 Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Toomas Soome <tsoome@me.com>	2018-03-23 02:28:37 +00:00
Alexander Motin	057fec9070	9084 spa_*_ashift must ignore spare devices illumos/illumos-gate@b037f3dbd6 Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-03-23 02:22:16 +00:00
Alexander Motin	ba31d3992e	8484 Implement aggregate sum and use for arc counters In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-03-23 00:20:42 +00:00
Andriy Gapon	695c9b6645	9164 assert: newds == os->os_dsl_dataset illumos/illumos-gate@5f5913bb83 `5f5913bb83` https://www.illumos.org/issues/9164 This issue has been reported by Alan Somers as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877 dmu_objset_refresh_ownership() first disowns a dataset (and releases it) and then owns it again. There is an assert that the new dataset object is the same as the old dataset object. When running ZFS Test Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test: panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000 == 0xfffff80021ab4800) I see that the old dataset has dsl_dataset_evict_async() pending in ds_dbu.dbu_tqent and its ds_dbuf is NULL. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <don.brady@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <avg@FreeBSD.org>	2018-03-15 08:46:49 +00:00
Andriy Gapon	dcf8a50d8f	8984 fix for 6764 breaks ACL inheritance illumos/illumos-gate@e9bacc6d1a `e9bacc6d1a` https://www.illumos.org/issues/8984 Consider a directory configured as: drwx-ws---+ 2 henson cpp 3 Jan 23 12:35 dropbox/ user:henson:rwxpdDaARWcC--:f-i----:allow owner@:--------------:f-i----:allow group@:--------------:f-i----:allow everyone@:--------------:f-i----:allow owner@:rwxpdDaARWcC--:-di----:allow group:cpp:-wx-----------:-------:allow owner@:rwxpdDaARWcC--:-------:allow A new file created in this directory ends up looking like: rw-r--r-+ 1 astudent cpp 0 Jan 23 12:39 testfile user:henson:rw-pdDaARWcC--:------I:allow owner@:--------------:------I:allow group@:--------------:------I:allow everyone@:--------------:------I:allow owner@:rw-p--aARWcCos:-------:allow group@:r-----a-R-c--s:-------:allow everyone@:r-----a-R-c--s:-------:allow with extraneous group@ and everyone@ entries allowing read access that shouldn't exist. Per Albert Lee on the zfs mailing list: "aclinherit=passthrough/passthrough-x should still ignore the requested mode when an inheritable ACE for owner@ group@, or everyone@ is present in the parent directory. It appears there was an oversight in my fix for https://www.illumos.org/issues/6764 which made calling zfs_acl_chmod from zfs_acl_inherit unconditional. I think the parent ACL check for aclinherit=passthrough needs to be reintroduced in zfs_acl_inherit." We have a large number of faculty who use dropbox directories like the example to have students submit projects. All of these directories are now allowing Reviewed by: Sam Zaydel <szaydel@racktopsystems.com> Reviewed by: Paul B. Henson <henson@acm.org> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Dominik Hassler <hadfl@omniosce.org>	2018-03-07 13:47:01 +00:00
Alexander Motin	34ff7cee7a	8940 Sending an intra-pool resumable send stream may result in EXDEV illumos/illumos-gate@544132fce3 "zfs send -t <token>" for an incremental send should be able to resume successfully when sending to the same pool: a subtle issue in zfs_iter_children() doesn't currently allow this. Because resuming from a token requires "guid" -> "dataset" mapping (guid_to_name()), we have to walk the whole hierarchy to find the right snapshots to send. When resuming an incremental send both source and destination live in the same pool and have the same guid: this is where zfs_iter_children() gets confused and picks up the wrong snapshot, so we end up trying to send an incremental "destination@snap1 -> source@snap2" stream instead of "source@snap1 -> source@snap2": this fails with an "Invalid cross-device link" (EXDEV) error. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 04:01:05 +00:00
Alexander Motin	4195015764	9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() illumos/illumos-gate@bdfded42e6 A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that was already acquired from the first call to vdev_indirect_remap(). The specific scenario, is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the same vdev, which brings up the aforementioned issue. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-02-22 03:52:03 +00:00
Alexander Motin	9a64996e42	Missed pieces of r329799.	2018-02-22 03:23:43 +00:00
Alexander Motin	e7245cfc4c	9079 race condition in starting and ending condesing thread for indirect vdevs illumos/illumos-gate@667ec66f1b The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-02-22 03:22:27 +00:00
Alexander Motin	250699c304	r329793 \| mav \| 2018-02-22 04:21:03 +0200 (чт, 22 февр. 2018) \| 58 lines 9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 02:25:09 +00:00
Alexander Motin	2e4bc6ee5c	9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 02:21:03 +00:00
Alexander Motin	dc763b80f4	8942 zfs promote .../%recv should be an error illumos/illumos-gate@add927f8c8 Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843, fixed by https://github.com/zfsonlinux/zfs/pull/6339: If we are in the middle of an incremental zfs receive, the child .../%recv will exist. If you concurrently run zfs promote .../%recv, it will "work", but then zfs gets confused. For example, there's no obvious way to destroy the containing filesystem (because it is now a clone of its invisible child). Attempting to do this promote should be an error. We could fix this by having zfs_ioc_promote() check if zc_name contains a %, similar to zfs_ioc_rename(). Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:30:03 +00:00
Alexander Motin	c59ff16a1e	8941 zpool add: assertion failed in get_replication() with nested interior VDEVs illumos/illumos-gate@ac0215f4d6 When replacing a faulted device which was previously handled by a spare multiple levels of nested interior VDEVs will be present in the pool configuration: get_replication() needs to handle this situation gracefully to let zpool add new devices to the pool Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:17:32 +00:00
Alexander Motin	63be747cb1	8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa) illumos/illumos-gate@f4c1745bd6 Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think this is the intended behaviour, we shouldn't be allowed to clear readonly pools. Probably. A fix is already in the ZFS on Linux repository to addess this issue: `92e43c1718` Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 00:59:49 +00:00
Alexander Motin	106c279da7	8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly illumos/illumos-gate@85723e5eec When iterating over the input nvlist in dsl_props_set_sync_impl() when we don't preserve the nvpair name before looking up ZPROP_VALUE, so when we later go to process it nvpair_name() is always "value" instead of the actual property name. This results in a couple of bugs in the recv code: - received properties are not restored correctly when failing to receive an incremental send stream - received properties are not completely replaced by the new ones when successfully receiving an incremental send stream This was discovered on ZFS on Linux (fixed in `5f1346c299`) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 00:53:59 +00:00
Alexander Motin	f84197c93c	9036 zfs: duplicate 'const' declaration specifier illumos/illumos-gate@f02c28e434 Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Toomas Soome <tsoome@me.com>	2018-02-22 00:48:59 +00:00

1 2 3 4 5 ...

764 Commits