freebsd-nq/sys
Alexander Motin 1ea10a60f9 MFV r329793, r329795:
9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 03:15:35 +00:00
..
amd64 Correct proper nouns in the Linuxulator 2018-02-22 02:24:17 +00:00
arm Adjust whitespace of things added in the past couple years to match the 2018-02-20 14:59:29 +00:00
arm64 vm_wait() rework. 2018-02-20 10:13:13 +00:00
bsm sys: further adoption of SPDX licensing ID tags. 2017-11-20 19:43:44 +00:00
cam Minor formatting nits. 2018-02-21 23:49:35 +00:00
cddl MFV r329793, r329795: 2018-02-22 03:15:35 +00:00
compat Correct proper nouns in the Linuxulator 2018-02-22 02:24:17 +00:00
conf MFV r329502: 7614 zfs device evacuation/removal 2018-02-21 16:51:02 +00:00
contrib Define CK_MD_TSO for the relevant arches (i386, amd64 and sparc64). 2018-02-16 17:50:06 +00:00
crypto ccp(4): Store IV in output buffer in GCM software fallback when requested 2018-01-27 07:41:31 +00:00
ddb Implement 'domainset', a cpuset based NUMA policy mechanism. This allows 2018-01-12 22:48:23 +00:00
dev Avoid grabbing locks when grabbing the vt(4) console for DDB. 2018-02-22 02:26:29 +00:00
dts Add a skeleton Clock Manager for RPi2/3, and use that from pwm 2018-01-22 07:10:30 +00:00
fs {ext2|ufs}_readdir: Avoid setting negative ncookies. 2018-02-06 22:38:19 +00:00
gdb sys/gdb: further adoption of SPDX licensing ID tags. 2017-11-27 15:16:59 +00:00
geom Fix a memory leak introduced in r328426. 2018-02-16 15:41:03 +00:00
gnu bwn(4): txpid2g/txpid5g[lh] are not defined after sromrev 7; the default 2018-02-13 17:43:54 +00:00
i386 Correct proper nouns in the Linuxulator 2018-02-22 02:24:17 +00:00
isa Add ISA PNP tables to ISA drivers. Fix a few incidental comments. 2018-01-29 00:22:30 +00:00
kern Fix the broken subqueue assignment for the cleanq. 2018-02-20 21:27:17 +00:00
kgssapi kgssapi: Remove trivial deadcode 2018-02-14 00:12:03 +00:00
libkern libkern: use nul for terminating char rather than 0 2018-02-13 19:17:48 +00:00
mips vm_wait() rework. 2018-02-20 10:13:13 +00:00
modules Add required header files. 2018-02-21 16:36:44 +00:00
net Allow route change requests to not specify the gateway. 2018-02-21 19:13:23 +00:00
net80211 net80211: sanitize input for ieee80211_output() 2017-12-30 00:40:34 +00:00
netgraph ng_pppoe(8): add support for user-supplied Host-Uniq tag. 2018-02-14 21:17:44 +00:00
netinet Reinitialize IP header length after checksum calculation. It is used 2018-02-10 10:13:17 +00:00
netinet6 Update the MTU in affected routes when IPv6 RA changes the MTU 2018-02-12 19:49:20 +00:00
netipsec Remove unused variables and sysctl declaration. 2018-02-19 12:20:51 +00:00
netpfil Remove duplicate #include <netinet/ip_var.h>. 2018-02-07 19:12:05 +00:00
netsmb Unsign some values related to allocation. 2018-01-22 02:08:10 +00:00
nfs Modernize nfssvc(2) registartion. 2018-02-08 20:09:42 +00:00
nfsclient style: Remove remaining deprecated MALLOC/FREE macros 2018-01-25 22:25:13 +00:00
nfsserver sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
nlm Use syscall_helper_register() to register syscalls and initialize though 2018-02-10 01:09:22 +00:00
ofed Import the mthca kernel side infiniband driver from Linux 4.9 and fix 2018-02-13 17:04:34 +00:00
opencrypto Move per-operation data out of the csession structure. 2018-01-26 23:21:50 +00:00
powerpc Add definition for the PowerPC A2. 2018-02-21 15:15:58 +00:00
riscv vm_wait() rework. 2018-02-20 10:13:13 +00:00
rpc Do pass removing some write-only variables from the kernel. 2017-12-25 04:48:39 +00:00
security Reduce duplication in __mac_*_(file|link)(2) implementation. 2018-02-15 18:57:22 +00:00
sparc64 Make v_wire_count a per-cpu counter(9) counter. This eliminates a 2018-02-12 22:53:00 +00:00
sys Use 'const int *' for sysentvec errno translation table 2018-02-22 01:59:59 +00:00
teken sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
tests
tools Avoid using \$. It's an unknown escape sequence. Some awks warn about 2018-01-28 05:13:08 +00:00
ufs Refactor fix in r329600 to do its check once in readsuper() rather 2018-02-21 19:56:19 +00:00
vm vm_wait() rework. 2018-02-20 10:13:13 +00:00
x86 Don't include DMAR map entry zone items in kernel dumps. 2018-02-18 16:03:50 +00:00
xdr sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
xen sys: general adoption of SPDX licensing ID tags. 2017-11-27 15:23:17 +00:00
Makefile Move sys/boot to stand. Fix all references to new location 2017-11-14 23:02:19 +00:00