Commit Graph

271 Commits

Author SHA1 Message Date
Paul Zuchowski
bc67cba7c0
Fix zdb -R with 'b' flag
zdb -R :b fails due to the indirect block being compressed,
and the 'b' and 'd' flag not working in tandem when specified.
Fix the flag parsing code and create a zfs test for zdb -R
block display.  Also fix the zio flags where the dotted notation
for the vdev portion of DVA (i.e. 0.0:offset:length) fails.

Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9640
Closes #9729
2020-02-10 14:00:05 -08:00
Ryan Moeller
e4c439021f ZTS: Move more tests out of common.run
These tests won't run on all platforms as currently implemented:
 * add_nested_replacing_spare (needs zed)
 * fault (needs zed)
 * mmp (needs multihost_history)
 * umount_unlink_drained (needs procfs)
 * zpool_expand (needs udev events and zed)
 * zpool_reopen (needs scsi_debug)
 * zvol_swap_003_pos (needs to modify vfstab)
 * zvol_swap_00[56]_pos (needs swaphigh/swaplen)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9871
2020-01-27 13:29:25 -08:00
Ned Bass
a3403164d7 zdb: add support for object ranges for zdb -d
Allow a range of object identifiers to dump with -d. This may
be useful when dumping a large dataset and you want to break
it up into multiple phases, or to resume where a previous scan
left off. Object type selection flags are supported to reduce
the performance overhead of verbosely dumping unwanted objects,
and to reduce the amount of post-processing work needed to
filter out unwanted objects from zdb output.

This change extends existing syntax in a backward-compatible
way. That is, the base case of a range is to specify a single
object identifier to dump. Ranges and object identifiers can
be intermixed as command line parameters.

Usage synopsis:

    Object ranges take the form <start>:<end>[:<flags>]
        start    Starting object number
        end      Ending object number, or -1 for no upper bound
        flags    Optional flags to select object types:
         A    All objects (this is the default)
         d    ZFS directories
         f    ZFS files
         m    SPA space maps
         z    ZAPs
         -    Negate effect of next flag

Examples:

 # Dump all file objects
 zdb -dd tank/fish 0👎f

 # Dump all file and directory objects
 zdb -dd tank/fish 0👎fd

 # Dump all types except file and directory objects
 zdb -dd tank/fish 0👎A-f-d

 # Dump object IDs in a specific range
 zdb -dd tank/fish 1000:2000

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #9832
2020-01-24 11:00:46 -08:00
Ryan Moeller
dbfec5cc09 ZTS: Get xattr tests running on FreeBSD
This mostly involves reworking platform checks to make illumos the
exception (thanks to their unusual way of exposing xattrs). Other
platforms are able to take advantage of the recently added xattr
wrappers in libtest.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9872
2020-01-23 17:14:40 -08:00
Jason King
e2ef1cbf04 Support inheriting properties in channel programs
This adds support in channel programs to inherit properties analogous
to `zfs inherit` by adding `zfs.sync.inherit` and `zfs.check.inherit`
functions to the ZFS LUA API.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason King <jason.king@joyent.com>
Closes #9738
2020-01-22 17:03:17 -08:00
Paul Zuchowski
f12e42cccf zdb -d should accept the numeric objset id
As an alternative to the dataset name, zdb now allows the decimal 
or hexadecimal objset ID to be specified.  When permanent errors
are reported as 2 hexadecimal numbers (objset ID : object ID) in 
zpool status; you can now use 'zdb <pool>[/objset ID] object' to
determine the names of the objset and object which have the error.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9733
2020-01-16 09:22:49 -08:00
Tom Caputi
ba0ba69e50 Add 'zfs send --saved' flag
This commit adds the --saved (-S) to the 'zfs send' command.
This flag allows a user to send a partially received dataset,
which can be useful when migrating a backup server to new
hardware. This flag is compatible with resumable receives, so
even if the saved send is interrupted, it can be resumed.
The flag does not require any user / kernel ABI changes or any
new feature flags in the send stream format.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Reviewed-by: Christian Schwarz <me@cschwarz.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9007
2020-01-10 10:16:58 -08:00
Ryan Moeller
798b2b8477 ZTS: Add sunos.run to dist data
This was missed when the file was introduced.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9815
2020-01-07 12:38:12 -08:00
Ryan Moeller
d7164b27be ZTS: Move dumpdev tests to sunos.run
Neither FreeBSD nor Linux support dumping to zvols.

DilOS still uses these tests, so the files are kept and the tests have
been relocated to sunos.run.

An `is_illumos` function was added to libtest.shlib to eliminate some
awkward platform checks.

A few functions that are not expected to be used outside of illumos
have been sanitized of extraneous FreeBSD adaptations.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9794
2020-01-03 09:08:23 -08:00
Tony Hutter
9fb2771aa5 Colorize zpool status output
If the ZFS_COLOR env variable is set, then use ANSI color
output in zpool status:

- Column headers are bold
- Degraded or offline pools/vdevs are yellow
- Non-zero error counters and faulted vdevs/pools are red
- The 'status:' and 'action:' sections are yellow if they're
  displaying a warning.

This also includes a new 'faketty' function in libtest.shlib that is
compatible with FreeBSD (code provided by @freqlabs).

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #9340
2019-12-19 16:26:07 -08:00
John Wren Kennedy
523fc80069 Tests for btree implementation used by range trees
Additional test cases for the btree implementation, see #9181.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Kennedy <john.kennedy@delphix.com>
Closes #9717
2019-12-19 11:53:54 -08:00
Tomohiro Kusumi
ddb4e69db5 Don't fail to apply umask for O_TMPFILE files
Apply umask to `mode` which will eventually be applied to inode.
This is needed since VFS doesn't apply umask for O_TMPFILE files.

(Note that zpl_init_acl() applies `ip->i_mode &= ~current_umask();`
only when POSIX ACL is used.)

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8997 
Closes #8998
2019-12-13 15:02:23 -08:00
Paul Zuchowski
f0bf435176 zio_decompress_data always ASSERTs successful decompression
This interferes with zdb_read_block trying all the decompression
algorithms when the 'd' flag is specified, as some are
expected to fail.  Also control the output when guessing
algorithms, try the more common compression types first, allow
specifying lsize/psize, and fix an uninitialized variable.

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9612 
Closes #9630
2019-12-10 15:51:58 -08:00
Matthew Macy
f95704ca5e Disable EDONR on FreeBSD
FreeBSD uses its own crypto framework in-kernel which, at this time,
has no EDONR implementation.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Allan Jude <allanjude@freebsd.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matt Macy <mmacy@FreeBSD.org>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9664
2019-12-05 13:10:29 -08:00
jwpoduska
3c819a2c7d Prevent unnecessary resilver restarts
If a device is participating in an active resilver, then it will have a
non-empty DTL. Operations like vdev_{open,reopen,probe}() can cause the
resilver to be restarted (or deferred to be restarted later), which is
unnecessary if the DTL is still covered by the current scan range. This
is similar to the logic in vdev_dtl_should_excise() where the DTL can
only be excised if it's max txg is in the resilvered range.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Gallagher <john.gallagher@delphix.com>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: John Poduska <jpoduska@datto.com>
Issue #840 
Closes #9155
Closes #9378
Closes #9551
Closes #9588
2019-11-27 10:15:01 -08:00
Paul Zuchowski
894f6696b4 Add display of checksums to zdb -R
The function zdb_read_block (zdb -R) was always intended to have a :c 
flag which would read the DVA and length supplied by the user, and 
display the checksum. Since we don't know which checksum goes with 
the data, we should calculate and display them all.

For each checksum in the table, read in the data at the supplied 
DVA:length, calculate the checksum, and display it. Update the man 
page and create a zfs test for the new feature.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Kjeld Schouten <kjeld@schouten-lebbing.nl>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Closes #9607
2019-11-27 10:08:18 -08:00
loli10K
e35704647e Fix for ARC sysctls ignored at runtime
This change leverage module_param_call() to run arc_tuning_update()
immediately after the ARC tunable has been updated as suggested in
cffa8372 code review.

A simple test case is added to the ZFS Test Suite to prevent future
regressions in functionality.

Reviewed-by: Matt Macy <mmacy@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9487  
Closes #9489
2019-10-26 15:22:19 -07:00
Ryan Moeller
fdf5593232 ZTS: Move more tests to linux.run
Tests that rely on special filesystems that are specific to Linux
should only be run on Linux.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9512
2019-10-25 20:03:46 -07:00
Ryan Moeller
28f7427ab1 ZTS: Move tmpfile tests to linux.run
O_TMPFILE is not available on FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9503
2019-10-24 10:49:19 -07:00
Ryan Moeller
4c6225b688 ZTS: Consistency pass for .ksh extensions
* Use .ksh extension for ksh scripts, not .sh
* Remove .ksh extension from tests in common.run

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ryan Moeller <ryan@iXsystems.com>
Closes #9502
2019-10-24 10:47:47 -07:00
Ryan Moeller
5e74ac51c7 Move platform independent tests to a shared runfile
Tests that aren't limited to running on Linux can be moved to a common
runfile to be shared with other platforms.

The test runner and wrapper script are enhanced to allow specifying
multiple runfiles as a comma-separated list. The default runfiles are
now "common.run,PLATFORM.run" where PLATFORM is determined at run time.

Sections in runfiles that share a path with another runfile can append
a colon separator and an identifier to the path in the section
name, ie `[tests/functional/atime:Linux]`, to avoid overriding the tests
specified by other runfiles.

Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Ryan Moeller <ryan@ixsystems.com>
Closes #9391
2019-10-09 10:39:26 -07:00
Tom Caputi
bb61cc3185 Fix encryption hierarchy issues with zfs recv -d
Currently, the recv_fix_encryption_hierarchy() function accepts
'destsnap' as one of its parameters. Originally, this was intended
to be the top-level dataset of a receive (whether or not the
receive was recursive). Unfortunately, this parameter actually is
simply the input that is passed in from the command line. When
the user specifies 'zfs recv -d', this string is actually only the
name of the receiving pool since the rest of the name is derived
from the send stream. This causes the function to fail, leaving
some datasets with an invalid encryption hierarchy.

This patch resolves this problem by passing in the top_zfs variable
instead. In order to make this work, this patch also includes some
changes that ensure the value is always present when we need it.

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9273
Closes #9309
2019-09-25 17:02:32 -07:00
loli10K
fcd37b622b Device removal of indirect vdev panics the kernel
This commit fixes a NULL pointer dereference triggered in
spa_vdev_remove_top_check() by trying to "zpool remove" an indirect
vdev.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes #9327
2019-09-16 10:46:59 -07:00
Tom Caputi
637f0c6019 Fix clone handling with encryption roots
Currently, spa_keystore_change_key_sync_impl() does not recurse
into clones when updating encryption roots for either a call to
'zfs promote' or 'zfs change-key'. This can cause children of
these clones to end up in a state where they point to the wrong
dataset as the encryption root. It can also trigger ASSERTs in
some cases where the code checks reference counts on wrapping
keys. This patch fixes this issue by ensuring that this function
properly recurses into clones during processing.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes #9267 
Closes #9294
2019-09-16 10:07:33 -07:00
John Gallagher
e60e158eff Add subcommand to wait for background zfs activity to complete
Currently the best way to wait for the completion of a long-running
operation in a pool, like a scrub or device removal, is to poll 'zpool
status' and parse its output, which is neither efficient nor convenient.

This change adds a 'wait' subcommand to the zpool command. When invoked,
'zpool wait' will block until a specified type of background activity
completes. Currently, this subcommand can wait for any of the following:

 - Scrubs or resilvers to complete
 - Devices to initialized
 - Devices to be replaced
 - Devices to be removed
 - Checkpoints to be discarded
 - Background freeing to complete

For example, a scrub that is in progress could be waited for by running

    zpool wait -t scrub <pool>

This also adds a -w flag to the attach, checkpoint, initialize, replace,
remove, and scrub subcommands. When used, this flag makes the operations
kicked off by these subcommands synchronous instead of asynchronous.

This functionality is implemented using a new ioctl. The type of
activity to wait for is provided as input to the ioctl, and the ioctl
blocks until all activity of that type has completed. An ioctl was used
over other methods of kernel-userspace communiction primarily for the
sake of portability.

Porting Notes:
This is ported from Delphix OS change DLPX-44432. The following changes
were made while porting:

 - Added ZoL-style ioctl input declaration.
 - Reorganized error handling in zpool_initialize in libzfs to integrate
   better with changes made for TRIM support.
 - Fixed check for whether a checkpoint discard is in progress.
   Previously it also waited if the pool had a checkpoint, instead of
   just if a checkpoint was being discarded.
 - Exposed zfs_initialize_chunk_size as a ZoL-style tunable.
 - Updated more existing tests to make use of new 'zpool wait'
   functionality, tests that don't exist in Delphix OS.
 - Used existing ZoL tunable zfs_scan_suspend_progress, together with
   zinject, in place of a new tunable zfs_scan_max_blks_per_txg.
 - Added support for a non-integral interval argument to zpool wait.

Future work:
ZoL has support for trimming devices, which Delphix OS does not. In the
future, 'zpool wait' could be extended to add the ability to wait for
trim operations to complete.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: John Gallagher <john.gallagher@delphix.com>
Closes #9162
2019-09-13 18:09:06 -07:00
Brian Behlendorf
25f06d677a
Fix /etc/hostid on root pool deadlock
Accidentally introduced by dc04a8c which now takes the SCL_VDEV lock
as a reader in zfs_blkptr_verify().  A deadlock can occur if the
/etc/hostid file resides on a dataset in the same pool.  This is
because reading the /etc/hostid file may occur while the caller is
holding the SCL_VDEV lock as a writer.  For example, to perform a
`zpool attach` as shown in the abbreviated stack below.

To resolve the issue we cache the system's hostid when initializing
the spa_t, or when modifying the multihost property.  The cached
value is then relied upon for subsequent accesses.

Call Trace:
    spa_config_enter+0x1e8/0x350 [zfs]
    zfs_blkptr_verify+0x33c/0x4f0 [zfs] <--- trying read lock
    zio_read+0x6c/0x140 [zfs]
    ...
    vfs_read+0xfc/0x1e0
    kernel_read+0x50/0x90
    ...
    spa_get_hostid+0x1c/0x38 [zfs]
    spa_config_generate+0x1a0/0x610 [zfs]
    vdev_label_init+0xa0/0xc80 [zfs]
    vdev_create+0x98/0xe0 [zfs]
    spa_vdev_attach+0x14c/0xb40 [zfs] <--- grabbed write lock

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9256 
Closes #9285
2019-09-10 13:42:30 -07:00
Paul Dagnelie
475aa97cab Prevent metaslab_sync panic due to spa_final_dirty_txg
If a pool enables the SPACEMAP_HISTOGRAM feature shortly before being
exported, we can enter a situation that causes a kernel panic. Any metaslabs
that are loaded during the final dirty txg and haven't already been condensed
will cause metaslab_sync to proceed after the final dirty txg so that the
condense can be performed, which there are assertions to prevent. Because of
the nature of this issue, there are a number of ways we can enter this
state. Rather than try to prevent each of them one by one, potentially missing
some edge cases, we instead cut it off at the point of intersection; by
preventing metaslab_sync from proceeding if it would only do so to perform a
condense and we're past the final dirty txg, we preserve the utility of the
existing asserts while preventing this particular issue.

Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9185
Closes #9186
Closes #9231
Closes #9253
2019-08-30 09:28:31 -07:00
Brian Behlendorf
07a328dde4
ZTS: Temporarily disable several upgrade tests
Until issues #9185 and #9186 have been resolved the following zpool
upgrade tests are being disabled to prevent CI failures.

  zpool_upgrade_002_pos,
  zpool_upgrade_003_pos,
  zpool_upgrade_004_pos,
  zpool_upgrade_007_pos,
  zpool_upgrade_008_pos

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #9185
Issue #9186
Closes #9225
2019-08-28 14:52:08 -07:00
Chunwei Chen
035e96118b Fix zil replay panic when TX_REMOVE followed by TX_CREATE
If TX_REMOVE is followed by TX_CREATE on the same object id, we need to
make sure the object removal is completely finished before creation. The
current implementation relies on dnode_hold_impl with
DNODE_MUST_BE_ALLOCATED returning ENOENT. While this check seems to work
fine before, in current version it does not guarantee the object removal
is completed.

We fix this by checking if DNODE_MUST_BE_FREE returns successful
instead. Also add test and remove dead code in dnode_hold_impl.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chunwei Chen <david.chen@nutanix.com>
Closes #7151
Closes #8910
Closes #9123
Closes #9145
2019-08-28 10:42:02 -07:00
Paul Dagnelie
95f0144675 Add regression test for "zpool list -p"
Other than this test, zpool list -p is not well tested by any of the 
automated tests.  Add a test for zpool list -p.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9134
2019-08-25 18:33:03 -07:00
Brian Behlendorf
4302698be1
ZTS: Fix in-tree dbufstats test case
Commit a887d653 updated the dbufstats such that escalated privileges
are required.  Since all tests under cli_user are run with normal
privileges move this test case to a location where it will be run
required privileges.

Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Michael Niewöhner <foss@mniewoehner.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #9118 
Closes #9196
2019-08-22 17:37:48 -07:00
Paul Dagnelie
1a26cb6160 Add more refquota tests
It used to be possible for zfs receive (and other operations related 
to clone swap) to bypass refquotas. This can cause a number of issues, 
and there should be an automated test for it.

Added tests for rollback and receive not overriding refquota.

Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #9139
2019-08-19 15:06:53 -07:00
Serapheim Dimitropoulos
3b9edd7b17 Introduce getting holds and listing bookmarks through ZCP
Consumers of ZFS Channel Programs can now list bookmarks,
and get holds from datasets. A minor-refactoring was also
applied to distinguish between user and system properties
in ZCP.

Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Ported-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Dan Kimmel <dan.kimmel@delphix.com>

OpenZFS-issue: https://illumos.org/issues/8862
Closes #7902
2019-08-12 10:02:34 -07:00
Serapheim Dimitropoulos
8098465558 Test cancelling a removal in ZTS
This patch adds a new test that sanity checks cancelling a removal.

Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #9101
2019-08-05 10:50:20 -07:00
Brian Behlendorf
1e620c9872
Revert "Develop tests for issues #5866 and #8858"
This reverts commit 693c1fc478.  This
change resulted in a kmem leak being observed in existing code which
needs to be identified and addressed.

Reviewed-by: Paul Zuchowski <pzuchowski@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #8978
Closes #9090
2019-07-29 12:46:56 -07:00
Paul Zuchowski
693c1fc478 Develop tests for issues #5866 and #8858
Provide zfstest coverage for these two issues which
were a panic accessing extended attributes and
a problem comparing 64 bit and 32 bit generation
numbers.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Zuchowski <pzuchowski@datto.com>
Issue #5866
Issue #8858 
Closes #8978
2019-07-26 17:52:13 -07:00
Tomohiro Kusumi
9fb6abe5ad Implement secpolicy_vnode_setid_retain()
Don't unconditionally return 0 (i.e. retain SUID/SGID).
Test CAP_FSETID capability.

https://github.com/pjd/pjdfstest/blob/master/tests/chmod/12.t
which expects SUID/SGID to be dropped on write(2) by non-owner fails
without this. Most filesystems make this decision within VFS by using
a generic file write for fops.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #9035 
Closes #9043
2019-07-26 13:52:30 -07:00
Sara Hartse
37f03da8ba Fast Clone Deletion
Deleting a clone requires finding blocks are clone-only, not shared
with the snapshot. This was done by traversing the entire block tree
which results in a large performance penalty for sparsely
written clones.

This is new method keeps track of clone blocks when they are
modified in a "Livelist" so that, when it’s time to delete,
the clone-specific blocks are already at hand.

We see performance improvements because now deletion work is
proportional to the number of clone-modified blocks, not the size
of the original dataset.

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Signed-off-by: Sara Hartse <sara.hartse@delphix.com>
Closes #8416
2019-07-26 10:54:14 -07:00
Tony Hutter
d84c5a120e Move some tests to cli_user/zpool_status
The tests in tests/functional/cli_root/zpool_status should all require
root. However, linux.run has "user =" specified for those tests, which
means they run as a normal user.  When I removed that line to run them
as root, the following tests did not pass:

zpool_status_003_pos
zpool_status_-c_disable
zpool_status_-c_homedir
zpool_status_-c_searchpath

These tests need to be run as a normal user.  To fix this, move these
tests to a new tests/functional/cli_user/zpool_status directory.

Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <guss80@gmail.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes #9057
2019-07-19 11:21:54 -07:00
Mike Gerdts
d45d7f08fa Add zfs create dryrun
Adds the ability to sanity check zfs create arguments and to see the
value of any additional properties that will local to the dataset.  For
example, automation that may need to adjust quota on a parent filesystem
before creating a volume may call `zfs create -nP -V <size> <volume>` to
obtain the value of refreservation.  This adds the following options to
zfs create:

- -n dry-run (no-op)
- -v verbose
- -P parseable (implies verbose)

Reviewed-by: Ryan Moeller <ryan@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Jerry Jelinek <jerry.jelinek@joyent.com>
Signed-off-by: Mike Gerdts <mike.gerdts@joyent.com>
Closes #8974
2019-07-16 11:19:24 -07:00
Serapheim Dimitropoulos
93e28d661e Log Spacemap Project
= Motivation

At Delphix we've seen a lot of customer systems where fragmentation
is over 75% and random writes take a performance hit because a lot
of time is spend on I/Os that update on-disk space accounting metadata.
Specifically, we seen cases where 20% to 40% of sync time is spend
after sync pass 1 and ~30% of the I/Os on the system is spent updating
spacemaps.

The problem is that these pools have existed long enough that we've
touched almost every metaslab at least once, and random writes
scatter frees across all metaslabs every TXG, thus appending to
their spacemaps and resulting in many I/Os. To give an example,
assuming that every VDEV has 200 metaslabs and our writes fit within
a single spacemap block (generally 4K) we have 200 I/Os. Then if we
assume 2 levels of indirection, we need 400 additional I/Os and
since we are talking about metadata for which we keep 2 extra copies
for redundancy we need to triple that number, leading to a total of
1800 I/Os per VDEV every TXG.

We could try and decrease the number of metaslabs so we have less
I/Os per TXG but then each metaslab would cover a wider range on
disk and thus would take more time to be loaded in memory from disk.
In addition, after it's loaded, it's range tree would consume more
memory.

Another idea would be to just increase the spacemap block size
which would allow us to fit more entries within an I/O block
resulting in fewer I/Os per metaslab and a speedup in loading time.
The problem is still that we don't deal with the number of I/Os
going up as the number of metaslabs is increasing and the fact
is that we generally write a lot to a few metaslabs and a little
to the rest of them. Thus, just increasing the block size would
actually waste bandwidth because we won't be utilizing our bigger
block size.

= About this patch

This patch introduces the Log Spacemap project which provides the
solution to the above problem while taking into account all the
aforementioned tradeoffs. The details on how it achieves that can
be found in the references sections below and in the code (see
Big Theory Statement in spa_log_spacemap.c).

Even though the change is fairly constraint within the metaslab
and lower-level SPA codepaths, there is a side-change that is
user-facing. The change is that VDEV IDs from VDEV holes will no
longer be reused. To give some background and reasoning for this,
when a log device is removed and its VDEV structure was replaced
with a hole (or was compacted; if at the end of the vdev array),
its vdev_id could be reused by devices added after that. Now
with the pool-wide space maps recording the vdev ID, this behavior
can cause problems (e.g. is this entry referring to a segment in
the new vdev or the removed log?). Thus, to simplify things the
ID reuse behavior is gone and now vdev IDs for top-level vdevs
are truly unique within a pool.

= Testing

The illumos implementation of this feature has been used internally
for a year and has been in production for ~6 months. For this patch
specifically there don't seem to be any regressions introduced to
ZTS and I have been running zloop for a week without any related
problems.

= Performance Analysis (Linux Specific)

All performance results and analysis for illumos can be found in
the links of the references. Redoing the same experiments in Linux
gave similar results. Below are the specifics of the Linux run.

After the pool reached stable state the percentage of the time
spent in pass 1 per TXG was 64% on average for the stock bits
while the log spacemap bits stayed at 95% during the experiment
(graph: sdimitro.github.io/img/linux-lsm/PercOfSyncInPassOne.png).

Sync times per TXG were 37.6 seconds on average for the stock
bits and 22.7 seconds for the log spacemap bits (related graph:
sdimitro.github.io/img/linux-lsm/SyncTimePerTXG.png). As a result
the log spacemap bits were able to push more TXGs, which is also
the reason why all graphs quantified per TXG have more entries for
the log spacemap bits.

Another interesting aspect in terms of txg syncs is that the stock
bits had 22% of their TXGs reach sync pass 7, 55% reach sync pass 8,
and 20% reach 9. The log space map bits reached sync pass 4 in 79%
of their TXGs, sync pass 7 in 19%, and sync pass 8 at 1%. This
emphasizes the fact that not only we spend less time on metadata
but we also iterate less times to convergence in spa_sync() dirtying
objects.
[related graphs:
stock- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGStock.png
lsm- sdimitro.github.io/img/linux-lsm/NumberOfPassesPerTXGLSM.png]

Finally, the improvement in IOPs that the userland gains from the
change is approximately 40%. There is a consistent win in IOPS as
you can see from the graphs below but the absolute amount of
improvement that the log spacemap gives varies within each minute
interval.
sdimitro.github.io/img/linux-lsm/StockVsLog3Days.png
sdimitro.github.io/img/linux-lsm/StockVsLog10Hours.png

= Porting to Other Platforms

For people that want to port this commit to other platforms below
is a list of ZoL commits that this patch depends on:

Make zdb results for checkpoint tests consistent
db587941c5

Update vdev_is_spacemap_addressable() for new spacemap encoding
419ba59145

Simplify spa_sync by breaking it up to smaller functions
8dc2197b7b

Factor metaslab_load_wait() in metaslab_load()
b194fab0fb

Rename range_tree_verify to range_tree_verify_not_present
df72b8bebe

Change target size of metaslabs from 256GB to 16GB
c853f382db

zdb -L should skip leak detection altogether
21e7cf5da8

vs_alloc can underflow in L2ARC vdevs
7558997d2f

Simplify log vdev removal code
6c926f426a

Get rid of space_map_update() for ms_synced_length
425d3237ee

Introduce auxiliary metaslab histograms
928e8ad47d

Error path in metaslab_load_impl() forgets to drop ms_sync_lock
8eef997679

= References

Background, Motivation, and Internals of the Feature
- OpenZFS 2017 Presentation:
youtu.be/jj2IxRkl5bQ
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemaps-project

Flushing Algorithm Internals & Performance Results
(Illumos Specific)
- Blogpost:
sdimitro.github.io/post/zfs-lsm-flushing/
- OpenZFS 2018 Presentation:
youtu.be/x6D2dHRjkxw
- Slides:
slideshare.net/SerapheimNikolaosDim/zfs-log-spacemap-flushing-algorithm

Upstream Delphix Issues:
DLPX-51539, DLPX-59659, DLPX-57783, DLPX-61438, DLPX-41227, DLPX-59320
DLPX-63385

Reviewed-by: Sean Eric Fagan <sef@ixsystems.com>
Reviewed-by: Matt Ahrens <matt@delphix.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #8442
2019-07-16 10:11:49 -07:00
Tomohiro Kusumi
ab5036df1c Fix race in parallel mount's thread dispatching algorithm
Strategy of parallel mount is as follows.

1) Initial thread dispatching is to select sets of mount points that
 don't have dependencies on other sets, hence threads can/should run
 lock-less and shouldn't race with other threads for other sets. Each
 thread dispatched corresponds to top level directory which may or may
 not have datasets to be mounted on sub directories.

2) Subsequent recursive thread dispatching for each thread from 1)
 is to mount datasets for each set of mount points. The mount points
 within each set have dependencies (i.e. child directories), so child
 directories are processed only after parent directory completes.

The problem is that the initial thread dispatching in
zfs_foreach_mountpoint() can be multi-threaded when it needs to be
single-threaded, and this puts threads under race condition. This race
appeared as mount/unmount issues on ZoL for ZoL having different
timing regarding mount(2) execution due to fork(2)/exec(2) of mount(8).
`zfs unmount -a` which expects proper mount order can't unmount if the
mounts were reordered by the race condition.

There are currently two known patterns of input list `handles` in
`zfs_foreach_mountpoint(..,handles,..)` which cause the race condition.

1) #8833 case where input is `/a /a /a/b` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list with two same top level directories.
 There is a race between two POSIX threads A and B,
  * ThreadA for "/a" for test1 and "/a/b"
  * ThreadB for "/a" for test0/a
 and in case of #8833, ThreadA won the race. Two threads were created
 because "/a" wasn't considered as `"/a" contains "/a"`.

2) #8450 case where input is `/ /var/data /var/data/test` after sorting.
 The problem is that libzfs_path_contains() can't correctly handle an
 input list containing "/".
 There is a race between two POSIX threads A and B,
  * ThreadA for "/" and "/var/data/test"
  * ThreadB for "/var/data"
 and in case of #8450, ThreadA won the race. Two threads were created
 because "/var/data" wasn't considered as `"/" contains "/var/data"`.
 In other words, if there is (at least one) "/" in the input list,
 the initial thread dispatching must be single-threaded since every
 directory is a child of "/", meaning they all directly or indirectly
 depend on "/".

In both cases, the first non_descendant_idx() call fails to correctly
determine "path1-contains-path2", and as a result the initial thread
dispatching creates another thread when it needs to be single-threaded.
Fix a conditional in libzfs_path_contains() to consider above two.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@gmail.com>
Closes #8450
Closes #8833
Closes #8878
2019-07-09 09:31:46 -07:00
Mike Gerdts
341166c843 OpenZFS 9318 - vol_volsize_to_reservation does not account for raidz skip blocks
When a volume is created in a pool with raidz vdevs and
volblocksize != 128k, the volume can reference more space than is
reserved with the automatically calculated refreservation.  There
are two deficiencies in vol_volsize_to_reservation that contribute
to this:

  1) Skip blocks may be added to keep each allocation a multiple
     of parity + 1. This is the dominating factor when volblocksize
     is close to 2^ashift.

  2) raidz deflation for 128 KB blocks is different for most other
     block sizes.

See "The theory of raidz space accounting" comment in
libzfs_dataset.c for a full explanation.

Authored by: Mike Gerdts <mike.gerdts@joyent.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Kody Kantor <kody.kantor@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@joyent.com>
Ported-by: Mike Gerdts <mike.gerdts@joyent.com>

Porting Notes:
* ZTS: wait for zvols to exist before writing
* ZTS: use log_must_busy with {zpool|zfs} destroy

OpenZFS-issue: https://www.illumos.org/issues/9318
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/b73ccab0
Closes #8973
2019-07-05 15:35:15 -07:00
George Wilson
681a85cb01 nopwrites on dmu_sync-ed blocks can result in a panic
After device removal, performing nopwrites on a dmu_sync-ed block
will result in a panic. This panic can show up in two ways:
1. an attempt to issue an IOCTL in vdev_indirect_io_start()
2. a failed comparison of zio->io_bp and zio->io_bp_orig in
   zio_done()
To resolve both of these panics, nopwrites of blocks on indirect
vdevs should be ignored and new allocations should be performed on
concrete vdevs.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #8957
2019-06-28 12:40:23 -07:00
Tom Caputi
765d1f0644 Add 'zfs umount -u' for encrypted datasets
This patch adds the ability for the user to unload keys for
datasets as they are being unmounted. This is analogous to
'zfs mount -l'.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alek Pinchuk <apinchuk@datto.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Closes: #8917
Closes: #8952
2019-06-28 12:38:37 -07:00
Matthew Ahrens
59ec30a329 Remove code for zfs remap
The "zfs remap" command was disabled by
6e91a72fe3, because it has little utility
and introduced some tricky bugs.  This commit removes the code for it,
the associated ZFS_IOC_REMAP ioctl, and tests.

Note that the ioctl and property will remain, but have no functionality.
This allows older software to fail gracefully if it attempts to use
these, and avoids a backwards incompatibility that would be introduced if
we renumbered the later ioctls/props.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #8944
2019-06-24 16:44:01 -07:00
Brian Behlendorf
8f12a4f8d2
Fix out-of-tree build failures
Resolve the incorrect use of srcdir and builddir references for
various files in the build system.  These have crept in over time
and went unnoticed because when building in the top level directory
srcdir and builddir are identical.

With this change it's again possible to build in a subdirectory.

    $ mkdir obj
    $ cd obj
    $ ../configure
    $ make

Reviewed-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8921 
Closes #8943
2019-06-24 09:32:47 -07:00
Don Brady
186898bbb5 OpenZFS 9425 - channel programs can be interrupted
Problem Statement
=================
ZFS Channel program scripts currently require a timeout, so that hung or
long-running scripts return a timeout error instead of causing ZFS to get
wedged. This limit can currently be set up to 100 million Lua instructions.
Even with a limit in place, it would be desirable to have a sys admin
(support engineer) be able to cancel a script that is taking a long time.

Proposed Solution
=================
Make it possible to abort a channel program by sending an interrupt signal.In
the underlying txg_wait_sync function, switch the cv_wait to a cv_wait_sig to
catch the signal. Once a signal is encountered, the dsl_sync_task function can
install a Lua hook that will get called before the Lua interpreter executes a
new line of code. The dsl_sync_task can resume with a standard txg_wait_sync
call and wait for the txg to complete.  Meanwhile, the hook will abort the
script and indicate that the channel program was canceled. The kernel returns
a EINTR to indicate that the channel program run was canceled.

Porting notes: Added missing return value from cv_wait_sig()

Authored by: Don Brady <don.brady@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Don Brady <don.brady@delphix.com>
Signed-off-by: Don Brady <don.brady@delphix.com>

OpenZFS-issue: https://www.illumos.org/issues/9425
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/d0cb1fb926
Closes #8904
2019-06-22 16:51:46 -07:00
Paul Dagnelie
30af21b025 Implement Redacted Send/Receive
Redacted send/receive allows users to send subsets of their data to 
a target system. One possible use case for this feature is to not 
transmit sensitive information to a data warehousing, test/dev, or 
analytics environment. Another is to save space by not replicating 
unimportant data within a given dataset, for example in backup tools 
like zrepl.

Redacted send/receive is a three-stage process. First, a clone (or 
clones) is made of the snapshot to be sent to the target. In this 
clone (or clones), all unnecessary or unwanted data is removed or
modified. This clone is then snapshotted to create the "redaction 
snapshot" (or snapshots). Second, the new zfs redact command is used 
to create a redaction bookmark. The redaction bookmark stores the 
list of blocks in a snapshot that were modified by the redaction 
snapshot(s). Finally, the redaction bookmark is passed as a parameter 
to zfs send. When sending to the snapshot that was redacted, the
redaction bookmark is used to filter out blocks that contain sensitive 
or unwanted information, and those blocks are not included in the send 
stream.  When sending from the redaction bookmark, the blocks it 
contains are considered as candidate blocks in addition to those 
blocks in the destination snapshot that were modified since the 
creation_txg of the redaction bookmark.  This step is necessary to 
allow the target to rehydrate data in the case where some blocks are 
accidentally or unnecessarily modified in the redaction snapshot.

The changes to bookmarks to enable fast space estimation involve 
adding deadlists to bookmarks. There is also logic to manage the 
life cycles of these deadlists.

The new size estimation process operates in cases where previously 
an accurate estimate could not be provided. In those cases, a send 
is performed where no data blocks are read, reducing the runtime 
significantly and providing a byte-accurate size estimate.

Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed-by: John Kennedy <john.kennedy@delphix.com>
Reviewed-by: George Wilson <george.wilson@delphix.com>
Reviewed-by: Chris Williamson <chris.williamson@delphix.com>
Reviewed-by: Pavel Zhakarov <pavel.zakharov@delphix.com>
Reviewed-by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes #7958
2019-06-19 09:48:12 -07:00
Tomohiro Kusumi
1608985a41 Add link count test for root inode
Add tests for
97aa3ba44("Fix link count of root inode when snapdir is visible")
as suggested in #8727.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Tomohiro Kusumi <kusumi.tomohiro@osnexus.com>
Closes #8732
2019-05-29 16:26:46 -07:00