Commit Graph

447 Commits

Author SHA1 Message Date
Mark Johnston
d6eb98610f Reference stdint.h types in ctf.5.
MFC after:	1 week
2019-07-17 16:31:50 +00:00
Alexander Motin
eb106113c2 MFV r348580: 9559 zfs diff handles files on delete queue in fromsnap poorly
illumos/illumos-gate@20633e304b

Reviewed by: Joshua M. Clulow <josh@sysmgr.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Paul Dagnelie <pcd@delphix.com>
2019-06-03 20:40:32 +00:00
Alexander Motin
c4af53b8f1 MFV r348534: 9616 Bogus error when attempting to set property on read-only pool
illumos/illumos-gate@f62db44dbc

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andrew Stormont <astormont@racktopsystems.com>
2019-06-03 17:19:05 +00:00
Leandro Lupori
ff7449d6f5 [PowerPC64] stand: fix build using clang 8 as compiler
This change fixes "stand" build issues when using clang 8
as compiler.

Submitted by:   alfredo.junior_eldorado.org.br
Reviewed by:    jhibbits
Differential Revision: https://reviews.freebsd.org/D20026
2019-05-20 19:21:35 +00:00
Baptiste Daroussin
c7851c5b7c Fix regression introduced in r344569
Reported by:	cy
Tested by:	cy
Submitted by:	Fatih Acar <fatih@gandi.net>
2019-02-27 07:55:53 +00:00
Sean Eric Fagan
50792eb553 Set process title during zfs send.
This adds a '-V' option to 'zfs send', which sets the process title once a
second to the progress information.

This code has been in FreeNAS for a long time now; this is just upstreaming
it here.  It was originially written by delphij.

Reviewed by:	mav
Obtained from:	iXsystems, Inc
Sponsored by:	iXsystems, Inc
Differential Revision:	https://reviews.freebsd.org/D19184
2019-02-26 19:23:22 +00:00
Baptiste Daroussin
0b858c82d8 Implement parallel mounting for ZFS filesystem
It was first implemented on Illumos and then ported to ZoL.
This patch is a port to FreeBSD of the ZoL version.
This patch also includes a fix for a race condition that was amended

With such patch Delphix has seen a huge decrease in latency of the mount phase
(https://github.com/openzfs/openzfs/commit/a3f0e2b569 for details).
With that current change Gandi has measured improvments that are on par with
those reported by Delphix.

Zol commits incorporated:
a10d50f999
e63ac16d25

Reviewed by:	avg, sef
Approved by:	avg, sef
Obtained from:	ZoL
MFC after:	1 month
Relnotes:	yes
Sponsored by:	Gandi.net
Differential Revision:	https://reviews.freebsd.org/D19098
2019-02-26 08:18:34 +00:00
Mark Johnston
9747bd8e02 MFV r344364:
9058 postmortem DTrace frequently broken under vmware

illumos/illumos-gate@793bd7e361

MFC after:	1 week
2019-02-20 17:10:30 +00:00
Andriy Gapon
30d6475b3f fix userland illumos taskq code to pass relative timeout to cv_timedwait
Unlike illumos, FreeBSD cv_timedwait requires a relative timeout.  That
applies both to the kernel illumos compatibility code and to the
userland "fake kernel" code.

MFC after:	2 weeks
Sponsored by:	Panzura
2019-02-20 13:19:08 +00:00
Andriy Gapon
f050611e7f MFV r342469: 9630 add lzc_rename and lzc_destroy to libzfs_core
illumos/illumos-gate@049ba636fa
049ba636fa

https://www.illumos.org/issues/9630
  Rename and destroy are very useful operations that deserve to be in
  libzfs_core.  And they are not hard to implement too.

MFC after:	2 weeks
Relnotes:	maybe
2018-12-26 10:37:41 +00:00
Mark Johnston
0b56e7a8e9 Disable the D subroutines msgsize() and msgdsize().
They are specific to illumos and the corresponding DIF subroutines are
already disabled on FreeBSD.

Reported by:	gnn
2018-08-10 19:23:20 +00:00
Matt Macy
648cfe57fd Performance optimization of AVL tree comparator functions
MFV:
commit ee36c709c3
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Sat Aug 27 20:12:53 2016 +0200

    perf: 2.75x faster ddt_entry_compare()
        First 256bits of ddt_key_t is a block checksum, which are expected
    to be close to random data. Hence, on average, comparison only needs to
    look at first few bytes of the keys. To reduce number of conditional
    jump instructions, the result is computed as: sign(memcmp(k1, k2)).

    Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
    which is computed efficiently.  Synthetic performance evaluation of
    original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
    CPU E5-2660 v3:

    old     6.85789 s
    new     2.49089 s

    perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
        Compute the result directly instead of using conditionals

    perf: zfs_range_compare()
        Speedup between 1.1x - 2.5x, depending on compiler version and
    optimization level.

    perf: spa_error_entry_compare()
        `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

    perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
    perf: 2.8x faster zil_bp_compare()
    perf: 2.8x faster mze_compare()
    perf: faster dbuf_compare()
    perf: faster compares in spa_misc
    perf: 2.8x faster layout_hash_compare()
    perf: 2.8x faster space_reftree_compare()
    perf: libzfs: faster avl tree comparators
    perf: guid_compare()
    perf: dsl_deadlist_compare()
    perf: perm_set_compare()
    perf: 2x faster range_tree_seg_compare()
    perf: faster unique_compare()
    perf: faster vdev_cache _compare()
    perf: faster vdev_uberblock_compare()
    perf: faster fuid _compare()
    perf: faster zfs_znode_hold_compare()

    Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
    Signed-off-by: Richard Elling <richard.elling@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #5033
2018-08-10 06:42:08 +00:00
Alexander Motin
07ddc55096 MFV r337223:
9580 Add a hash-table on top of nvlist to speed-up operations

illumos/illumos-gate@2ec7644aab

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-08-03 01:52:25 +00:00
Alexander Motin
0285589b38 MFV 337214:
9621 Make createtxg and guid properties public

illumos/illumos-gate@e8d4a73c86

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Josh Paetzel <josh@tcbug.org>
2018-08-03 00:24:27 +00:00
Alexander Motin
d49e9be14f MFV r337184: 9457 libzfs_import.c:add_config() has a memory leak
A memory leak occurs on lines 209 and 213 because the config is not freed
in the error case.  The interface to add_config() seems less than ideal -
it would be better if it copied any data necessary from the config and the
caller freed it.

illumos/illumos-gate@ddfe901b12

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     sara hartse <sara.hartse@delphix.com>
2018-08-02 21:25:32 +00:00
Alexander Motin
2bce9a5316 MFV r337182: 9330 stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just put
a limit of 50 levels to newly created datasets. Existing datasets should
work without a problem.

illumos/illumos-gate@5ac95da7d6

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author:     Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-08-02 21:19:35 +00:00
Alexander Motin
c423a5e6b8 MFV r337161: 9512 zfs remap poolname@snapname coredumps
Only filesystems and volumes are valid "zfs remap" parameters: when passed
a snapshot name zfs_remap_indirects() does not handle the EINVAL returned
from libzfs_core, which results in failing an assertion and consequently
crashing.

illumos/illumos-gate@0b2e825398

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     loli10K <ezomori.nozomu@gmail.com>
2018-08-02 19:13:45 +00:00
Alexander Motin
7fca1b93c4 Do not blindly include illumos kernel headers instead of user-space.
It is not needed now, and I doubt it much helped at all, creating more
confusions then good.
2018-08-02 18:55:55 +00:00
Alexander Motin
b59e9cd1c0 MFV r316926:
7955 libshare needs to initialize only those datasets being modified by the consumer

illumos/illumos-gate@8a981c3356
8a981c3356

https://www.illumos.org/issues/7955
  Libshare currently initializes all available filesystems when doing any
  libshare operation. This requires iterating through all the filesystem
  multiple times, which is a huge performance problem for sharing and
  unsharing operations.

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Daniel Hoffman <dj.hoffman@delphix.com>

For FreeBSD this is practically a NOP, just a diff reduction.
2018-08-01 21:51:49 +00:00
Alexander Motin
200c27a75d MFV r337014:
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue

illumos/illumos-gate@20b5dafb42

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-07-31 22:50:50 +00:00
Alexander Motin
0021e1c10c MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     George Wilson <george.wilson@delphix.com>
2018-07-31 21:06:04 +00:00
Alexander Motin
194000fa21 MFV r336950: 9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:25:39 +00:00
Alexander Motin
1960706625 MFV r336944: 9286 want refreservation=auto
When a ZFS volume is created with zfs create -V (but without -s), the
refreservation property is set to a value that is volsize plus the maximum
size of metadata. If refreservation is ever set to another value, it is
impossible to set it back to the automatically determined value. There are
other cases where refreservation may be wrong. These include receiving a
volume that was sent without properties and zfs clone.

We need:

zfs set refreservation=auto <volume>
zfs clone -o refreservation=auto <volume>

Each one would use the same function used by zfs create -V to determine the
proper value for refreservation.

illumos/illumos-gate@1c10ae76c0

Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Mike Gerdts <mike.gerdts@joyent.com>
2018-07-30 22:39:30 +00:00
Sean Eric Fagan
69724399c4 This originated from ZFS On Linux, as
d4a72f2386

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

Approved by:	Alexander Motin
Obtained from:	ZFS On Linux
Relnotes:	Yes (improved scrub performance, with tunables)
Differential Revision:	https://reviews.freebsd.org/D15562
2018-06-08 17:38:28 +00:00
Sean Bruno
86784a0a22 Cleanup sundry clang warnings for code that is not upstream in illumos.
https://github.com/illumos/illumos-gate/edit/master/usr/src/lib/libzfs/common/libzfs_sendrecv.c

Patch our version of it to quiesce warnings until someone decides to sync
up our code:

libzfs_sendrecv.c:2555:30: warning: format specifies type 'unsigned long'
  but the argument has type 'uint64_t' (aka 'unsigned long long') [-Wformat]
                                sprintf(guidname, "%lu", thisguid);
                                                   ~~~   ^~~~~~~~
                                                   %llu
libzfs_sendrecv.c:2612:29: warning: format specifies type 'unsigned long'
  but the argument has type 'uint64_t' (aka 'unsigned long long') [-Wformat]
                        sprintf(guidname, "%lu", parent_fromsnap_guid);
                                           ~~~   ^~~~~~~~~~~~~~~~~~~~
                                           %llu
libzfs_sendrecv.c:2645:29: warning: format specifies type 'unsigned long'
  but the argument has type 'uint64_t' (aka 'unsigned long long') [-Wformat]
                        sprintf(guidname, "%lu", parent_fromsnap_guid);
                                           ~~~   ^~~~~~~~~~~~~~~~~~~~
                                           %llu

Reviewed by:	allanjude
Differential Revision:	https://reviews.freebsd.org/D15325
2018-05-06 16:22:02 +00:00
Andriy Gapon
81f187e576 allow ZFS pool to have temporary name for duration of current import
The change adds -t <name> option to zpool create and -t option to zpool
import in its form with an old name and a new name.  This allows to
import (or create) a pool under a name that's different from its real,
permanent name without affecting that name.  This is useful when working
with VM images or images of other physical systems if they happen to
have a ZFS pool with the same name as the host system.

The changes come from ZoL with some small tweaks.
The porting has been done by julian.

The change is being submitted to OpenZFS:
https://github.com/openzfs/openzfs/pull/600

Submitted by:	julian
Reviewed by:	smh
MFC after:	2 weeks
Sponsored by:	Panzura (porting)
Differential Revision: https://reviews.freebsd.org/D14972
2018-04-12 10:37:26 +00:00
Alexander Motin
5c4561f332 MFV r331706:
9235 rename zpool_rewind_policy_t to zpool_load_policy_t

illumos/illumos-gate@5dafeea3eb

We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-03-28 22:29:06 +00:00
Alexander Motin
0b0c76bc58 MFV r331695, 331700: 9166 zfs storage pool checkpoint
illumos/illumos-gate@8671400134

The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that.  It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data).  It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it.  Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure.  She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state.  Otherwise, she discards it.  With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-03-28 22:01:27 +00:00
Conrad Meyer
e796cc77c5 libctf: Appease Coverity overrun warnings
Rather than zeroing and reading into the a smaller union member the full
union size, just zero and read directly into the union.

No functional change intended.

Reported by:	Coverity
Sponsored by:	Dell EMC Isilon
2018-03-26 21:57:44 +00:00
Alexander Motin
b8436536c9 MFV r331400: 8484 Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
2018-03-23 02:15:05 +00:00
Mark Johnston
df7165b88e Given hidden visibility to symbols referenced by the DOF section.
MFC after:	1 week
2018-03-19 19:32:05 +00:00
Conrad Meyer
a8c03de86d libdtrace: Fix another uninitialized dtt_flags UB
Like r331073, eliminate a UB by fully initializing the struct with a designated
initializer.  Note that the similar src_dtt is not fully used, so a similar
treatment was not absolutely required.  I chose to leave it alone.  It
wouldn't hurt to do the same thing, though.

Reported by:	Coverity
Sponsored by:	Dell EMC Isilon
2018-03-16 21:10:36 +00:00
Conrad Meyer
1ad2da031e libdtrace: Eliminate a minor UB by fully initializing parameter struct
The dtt_flags value is dereferenced by dt_type_pointer() and must be
initialized first.

Reported by:	Coverity
Sponsored by:	Dell EMC Isilon
2018-03-16 20:43:40 +00:00
Alan Somers
50b779bdc9 ZFS: fix adding vdevs to very large pools
r323791 changed the return value of zpool_read_label.  Error paths that
previously returned 0 began to return -1 instead.  However, not all error
paths initialized errno.  When adding vdevs to a very large pool, errno could
be prepopulated with ENOMEM, causing the operation to fail.  Fix the bug by
setting errno=ENOENT in the case that no ZFS label is found.

PR:		226096
Submitted by:	Nikita Kozlov
Reviewed by:	avg
MFC after:	3 weeks
Differential Revision:	https://reviews.freebsd.org/D13088
2018-03-02 21:26:48 +00:00
Alan Somers
7d3761dc72 Don't declare __assfail as static
It gets called by dmu_buf_init_user, which is inline but not static.  So it
needs global linkage itself.

Reported by:	GCC-6
MFC after:	17 days
X-MFC-With:	329722
2018-02-25 14:29:43 +00:00
Alexander Motin
03d54eb339 MFV r329807:
8940 Sending an intra-pool resumable send stream may result in EXDEV

illumos/illumos-gate@544132fce3

"zfs send -t <token>" for an incremental send should be able to resume
successfully when sending to the same pool: a subtle issue in
zfs_iter_children() doesn't currently allow this.

Because resuming from a token requires "guid" -> "dataset" mapping
(guid_to_name()), we have to walk the whole hierarchy to find the right
snapshots to send.
When resuming an incremental send both source and destination live in the
same pool and have the same guid: this is where zfs_iter_children() gets
confused and picks up the wrong snapshot, so we end up trying to send an
incremental "destination@snap1 -> source@snap2" stream instead of
"source@snap1 -> source@snap2": this fails with an "Invalid cross-device
link" (EXDEV) error.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 04:01:55 +00:00
Alexander Motin
1ea10a60f9 MFV r329793, r329795:
9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 03:15:35 +00:00
Alexander Motin
613b0d87da 8942 zfs promote .../%recv should be an error
illumos/illumos-gate@add927f8c8

Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843,
fixed by https://github.com/zfsonlinux/zfs/pull/6339:

If we are in the middle of an incremental zfs receive, the child .../%recv
will exist. If you concurrently run zfs promote .../%recv, it will "work",
but then zfs gets confused. For example, there's no obvious way to destroy
the containing filesystem (because it is now a clone of its invisible child).

Attempting to do this promote should be an error. We could fix this by
having zfs_ioc_promote() check if zc_name contains a %, similar to
zfs_ioc_rename().

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 01:42:13 +00:00
Alexander Motin
b17bfcde3d 9018 Replace kmem_cache_reap_now() with kmem_cache_reap_soon()
illumos/illumos-gate@36a64e6284

To prevent kmem_cache reaping from blocking other system resources, turn
kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers
to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which
exploits #9017's new taskq_empty().

Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Author: Tim Kordas <tim.kordas@joyent.com>

FreeBSD does not use taskqueue for kmem caches reaping, so this change
is less dramatic then it is on Illumos, just limiting reaping to 1 time
per second.  It may possibly be improved later, if needed.
2018-02-21 23:15:06 +00:00
Alexander Motin
24433f00ea MFV r329502: 7614 zfs device evacuation/removal
illumos/illumos-gate@5cabbc6b49

https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>
2018-02-21 16:51:02 +00:00
Andriy Gapon
cfb675a138 MFV r329718: 8520 7198 lzc_rollback_to should support rolling back to origin
illumos/illumos-gate@95643f75d2
95643f75d2

https://www.illumos.org/issues/8520
  lzc_rollback_to() should support rolling back to a clone's origin.
  The current checks in zfs_ioc_rollback() would not allow that because the
  origin snapshot belongs to a different filesystem.
  The overly restrictive check was introduced in 7600, but it was not a
  regression as none of the existing tools provided a way to rollback to the
  origin.

https://www.illumos.org/issues/7198
  EINVAL is returned when a dataset does not have any snapshots, so there is
  nothing to roll back to.
  Although the code in zfs_do_rollback checks for that condition in advance, it's
  still possible that the snapshot(s) gets removed after the check and before the
  rollback sync task is executed.
  At the moment zfs command would crash when that happens.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after:	2 weeks
2018-02-21 15:12:14 +00:00
Alexander Motin
73c9b6d523 MFV r322231:
8430 dir_is_empty_readdir() doesn't properly handle error from fdopendir()

illumos/illumos-gate@ba6e7e6505
ba6e7e6505

https://www.illumos.org/issues/8430
  we should close dirfd if fdopendir() fails.

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Sowrabha Gopal <sowrabha.gopal@delphix.com>
2018-02-21 02:21:22 +00:00
Alexander Motin
e5a4a83784 MFV r318941: 7446 zpool create should support efi system partition
illumos/illumos-gate@7855d95b30
7855d95b30

https://www.illumos.org/issues/7446
  Since we support whole-disk configuration for boot pool, we also will need
  whole disk support with UEFI boot and for this, zpool create should create efi-
  system partition.
  I have borrowed the idea from oracle solaris, and introducing zpool create -
  B switch to provide an way to specify that boot partition should be created.
  However, there is still an question, how big should the system partition be.
  For time being, I have set default size 256MB (thats minimum size for FAT32
  with 4k blocks). To support custom size, the set on creation "bootsize"
  property is created and so the custom size can be set as: zpool create B -
  o bootsize=34MB rpool c0t0d0
  After pool is created, the "bootsize" property is read only. When -B switch is
  not used, the bootsize defaults to 0 and is shown in zpool get output with
  value ''. Older zfs/zpool implementations are ignoring this property.
  https://www.illumos.org/rb/r/219/

Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>

This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.
2018-02-21 00:18:57 +00:00
Alexander Motin
502b48823a MFV r316918: 7990 libzfs: snapspec_cb() does not need to call zfs_strdup()
illumos/illumos-gate@d8584ba6fb
d8584ba6fb

https://www.illumos.org/issues/7990
  The snapspec_cb() callback function in libzfs does not need to call zfs_strdup().

Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Marcel Telka <marcel@telka.sk>
2018-02-20 20:46:27 +00:00
Alexander Motin
67aa4cb2b9 MFV r316902: 7745 print error if lzc_* is called before libzfs_core_init
illumos/illumos-gate@7c13517fff
7c13517fff

https://www.illumos.org/issues/7745
  The problem is that consumers of `libZFS_Core` that forget to call
  `libzfs_core_init()` before calling any other function of the library
  are having a hard time realizing their mistake. The library's internal
  file descriptor is declared as global static, which is ok, but it is not
  initialized explicitly; therefore, it defaults to 0, which is a valid
  file descriptor. If `libzfs_core_init()`, which explicitly initializes
  the correct fd, is skipped, the ioctl functions return errors that do
  not have anything to do with `libZFS_Core`, where the problem is
  actually located.
  Even though assertions for that existed within `libZFS_Core` for debug
  builds, they were never enabled because the `-DDEBUG` flag was missing
  from the compiler flags.
  This patch applies the following changes:
  1. It adds `-DDEBUG` for debug builds of `libZFS_Core` and `libzfs`,
         to enable their assertions on debug builds.
  2. It corrects an assertion within `libzfs`, where a function had
         been spelled incorrectly (`zpool_prop_unsupported()`) and nobody
         knew because the `-DDEBUG` flag was missing, and the preprocessor
         was taking that part of the code away.
  3. The library's internal fd is initialized to `-1` and `VERIFY`
         assertions have been placed to check that the fd is not equal to
         `-1` before issuing any ioctl. It is important here to note, that
         the `VERIFY` assertions exist in both debug and non-debug builds.
  4. In `libzfs_core_fini` we make sure to never increment the
         refcount of our fd below 0, and also reset the fd to `-1` when no
         one refers to it. The reason for this, is for the rare case that
         the consumer closes all references but then calls one of the
         library's functions without using `libzfs_core_init()` first, and
         in the mean time, a previous call to `open()` decided to reuse
         our previous fd. This scenario would have passed our assertion in

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-02-20 20:40:55 +00:00
Alexander Motin
499a9a917a MFV r316901:
7730 libzfs`add_config() leaks config nvl when reading spare/l2cache devices

illumos/illumos-gate@105686550e
105686550e

https://www.illumos.org/issues/7730
  antares:root:~# mdb /usr/sbin/zpool
  > ::sysbp _exit
  > ::run import
     pool: data
       id: 2093977168778024605
    state: ONLINE
   action: The pool can be imported using its name or numeric identifier.
   config:

          data        ONLINE
            c6t0d0    ONLINE
            c6t1d0    ONLINE
          cache
            c6t2d0
  mdb: stop on entry to _exit
  mdb: target stopped at:
  0xfee556ba:     nop
  mdb: You've got symbols!
  Loading modules: [ ld.so.1 libumem.so.1 libc.so.1 libtopo.so.1 libavl.so.1
  libnvpair.so.1 ]
  > ::findleaks -d
  BYTES             LEAKED VMEM_SEG CALLER
  4096                  10 fda7b000 MMAP
  8192                   1 fea8d000 MMAP
  8192                   1 fe76d000 MMAP
  8192                   1 fe66e000 MMAP
  4096                   1 fe570000 MMAP
  8192                   1 fe470000 MMAP
  4096                   1 fe372000 MMAP
  4096                   1 fe273000 MMAP

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
2018-02-20 20:37:01 +00:00
Alexander Motin
9f6122a6e7 MFV r316893:
7604 if volblocksize property is the default, it displays as "-" rather than 8K

illumos/illumos-gate@4d86c0eab2
4d86c0eab2

https://www.illumos.org/issues/7604
  If a zvol has the default setting for the "volblocksize" property, it is
  8KB. However, it is displayed as "-" (not present), rather than "8K".
  The problem was introduced by:
  commit 25228e830e86924a41243343b1de9daf2d7dd43a
      Author: Matthew Ahrens &lt;mahrens@delphix.com&gt;
      Date:   Thu Nov 17 14:37:24 2016 -0800
  7571 non-present readonly numeric ZFS props do not have default value
  which changed changed get_numeric_property() to indicate that readonly
  default properties are not present. However, zfs_prop_readonly() returns
  TRUE for both readonly and set-once properties (e.g. volblocksize).
  Amusingly, that commit essentially reverted:
  6900484 default volblocksize is no longer being reported correctly
  from November 2009. However, that change was not correct either; the
  correct solution is to only do this check for "truly readonly" (i.e. not
  setonce) properties.
  $ zfs list -t volume -o name,volblocksize
      NAME
  VOLBLOCK
      domain0/group-100/appdata_container-101/appdata_windows_timeflow-102/
  archive            -
      domain0/group-100/appdata_container-101/appdata_windows_timeflow-102/
  datafile           -
      domain0/group-100/appdata_container-101/appdata_windows_timeflow-102/
  external           -
      rpool/dump
  128K
      rpool/swap
  4K
      rpool/swap1
  ===============================================================================

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-02-20 20:34:04 +00:00
Alexander Motin
806141acfc MFV r316876: 7542 zfs_unmount failed with EZFS_UNSHARENFSFAILED
illumos/illumos-gate@09c9e6dc9b
09c9e6dc9b

https://www.illumos.org/issues/7542
  libshare keeps a cached copy of the sharetab listing in memory, which can
  become out of date if shares are destroyed or created while leaving a libzfs
  handle open. This results in a spurious unmounting failure when an NFS share
  exists but isn't in the stale libshare cache.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chris Williamson <chris.williamson@delphix.com>
2018-02-20 20:30:40 +00:00
Alexander Motin
6d03c5504b MFV r316875: 7336 vfork and O_CLOEXEC causes zfs_mount EBUSY
illumos/illumos-gate@873c4903a5
873c4903a5

https://www.illumos.org/issues/7336
  We can run into a problem where we call into zfs_mount, which in turn calls
  is_dir_empty, which opens the directory to try and make sure it's empty. The
  issue with the current approach is that it holds the directory open while it
  traverses it with readdir, which, due to subtle interaction with the Java JVM,
  vfork, and exec can cause a tricky race condition resulting in zfs_mount
  failures.
  The approach to resolving the issue in this patch is to drop the usage of
  readdir altogether, and instead rely on the fact that ZFS stores the number of
  entries contained in a directory using the st_size field of the stat structure.
  Thus, if the directory in question is a ZFS directory, we can check to see if
  it's empty by calling stat() and inspecting the st_size field of structure
  returned.
  ===============================================================================
  The root cause appears to be an interesting race between vfork, exec, and
  zfs_mount's usage of O_CLOEXEC when calling openat. Here's what is going on:
  1. We call zfs_mount, and this in turn calls openat to check if the directory
  is empty, which results in opening the directory we're trying to mount onto,
  and increment v_count.
  2. As we're in the middle of reading the directory, vfork is called by the JVM
  and proceeds to exec the jspawnhelper utility. As a result of the vfork, we
  take an additional hold on the directory, which increments v_count a second
  time. The semantics of vfork mean the parent process will wait for the child
  process to exit or exec before the parent can continue; at this point the
  parent is in the middle of zfs_mount, reading the directory to determine if
  it's empty or not.
  3. The child process exec-ing jspawnhelper gets to the relvm call within
  exec_args (which is called by exec_common). relvm is the function that releases
  the parent process, allowing the parent to proceed. The problem is, at this
  point of calling relvm, the child hasn't yet called close_exec which is
  responsible for closing the file descriptors inherited from the parent process

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>
2018-02-20 20:26:48 +00:00
Alexander Motin
c2e79753fe MFV r316873: 7233 dir_is_empty should open directory with CLOEXEC
illumos/illumos-gate@d420209d9c
d420209d9c

https://www.illumos.org/issues/7233
  This fixes a race where one thread is executing zfs_mount() while another
  thread forks and execs. If the fork occurs while the directory is open, the
  child process will inherit (but not necessarily close immediately) the open fd
  for the directory, preventing the mount.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alex Reece <alex@delphix.com>
2018-02-20 20:17:19 +00:00