Commit Graph

1685 Commits

Author SHA1 Message Date
Alexander Motin
0285589b38 MFV 337214:
9621 Make createtxg and guid properties public

illumos/illumos-gate@e8d4a73c86

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Josh Paetzel <josh@tcbug.org>
2018-08-03 00:24:27 +00:00
Alexander Motin
54b838a67c MFV r337212:
9465 ARC check for 'anon_size > arc_c/2' can stall the system

illumos/illumos-gate@abe1fd01ce

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Don Brady <don.brady@delphix.com>
2018-08-03 00:14:36 +00:00
Alexander Motin
904eaff3f7 MFV r337210: 9577 remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can
instead pass a flag down in a few places to prevent recursive dbuf eviction.
Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing TSD values
(by setting to NULL vs non-NULL) is expensive, and we do it very often.
3. According to Nexenta, the current semantics can cause a deadlock when
concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they
are working on a "parallel unmount" change that triggers this more easily)

illumos/illumos-gate@c2919acbea

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-03 00:01:48 +00:00
Alexander Motin
63ddf13e0d MFV r337208: 9591 ms_shift can be incorrectly changed in MOS config for
indirect vdevs that have been historically expanded

illumos/illumos-gate@11f6a9680e

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2018-08-02 23:56:07 +00:00
Alexander Motin
4d696665fd MFV r337206: 9338 moved dnode has incorrect dn_next_type
illumos/illumos-gate@c7fbe46df9

Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:50:03 +00:00
Alexander Motin
b9b2457429 MFV r337204: 9439 ZFS double-free due to failure to dirty indirect block
illumos/illumos-gate@99a19144e8

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:46:30 +00:00
Alexander Motin
050b16aea8 MFV r337200:
9438 Holes can lose birth time info if a block has a mix of birth times

Ultimately, the problem here is that when you truncate and write a file in
the same transaction group, the dbuf for the indirect block will be zeroed
out to deal with the truncation, and then written for the write. During
this process, we will lose hole birth time information for any holes in the
range. In the case where a dnode is being freed, we need to determine
whether the block should be converted to a higher-level hole in the zio
pipeline, and if so do it when the dnode is being synced out.

illumos/illumos-gate@738e2a3ce3

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-08-02 23:43:01 +00:00
Alexander Motin
7420e614fe Fix build after r337196 mismerge. 2018-08-02 23:40:28 +00:00
Alexander Motin
fc5b6d13a6 MFV r337197: 9456 ztest failure in zil_commit_waiter_timeout
illumos/illumos-gate@b6031810da

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Prakash Surya <prakash.surya@delphix.com>
2018-08-02 23:25:49 +00:00
Alexander Motin
3a412471a4 MFV r337195: 9454 ::zfs_blkstats should count embedded blocks
illumos/illumos-gate@dec267e7ea

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:23:10 +00:00
Alexander Motin
afc0e55ebb MFV r337193:
9424 ztest failure: "unprotected error in call to Lua API (Invalid value type 'f
unction' for key 'error')"

illumos/illumos-gate@fe3ba4d122

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:15:10 +00:00
Alexander Motin
01c0f8bef3 MFV r337190: 9486 reduce memory used by device removal on fragmented pools
In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

illumos/illumos-gate@cfd63e1b1b

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 21:59:46 +00:00
Alexander Motin
2bce9a5316 MFV r337182: 9330 stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just put
a limit of 50 levels to newly created datasets. Existing datasets should
work without a problem.

illumos/illumos-gate@5ac95da7d6

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author:     Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-08-02 21:19:35 +00:00
Alexander Motin
3f586250d8 9539 Make zvol operations use _by_dnode routines
Continues what was started in 7801 add more by-dnode routines by fully
converting zvols to avoid unnecessary dnode_hold() calls. This saves a
small amount of CPU time and slightly improves latencies of operations
on zvols.

illumos/illumos-gate@8dfe5547fb

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Richard Yao <richard.yao@prophetstor.com>
2018-08-02 21:07:04 +00:00
Alexander Motin
5463ca24bf MFV r337175: 9487 Free objects when receiving full stream as clone
All objects after the last written or freed object are not supposed to
exist after receiving the stream. We should free them accordingly, as if
a freeobjects record for them had been included in the stream.

zfsonlinux/zfs@48fbb9ddbf
illumos/illumos-gate@7864b8192b

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-08-02 20:33:13 +00:00
Alexander Motin
0485d5cba8 MFV r337171:
9464 txg_kick() fails to see that we are quiescing, forcing transactions
to their next stages without leaving them accumulate changes

Ideally we would like txg_kick() to get triggered only when we are sure
that we are not syncing AND not quiescing any txg. This way we can kick
an open TXG to the quiescing state when we are sure that there is nothing
going on and we would benefit from the different states running
concurrently.

illumos/illumos-gate@fa41d87de9

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2018-08-02 20:18:49 +00:00
Alexander Motin
aba9f1124f MFV r337167: 9442 decrease indirect block size of spacemaps
Updates to indirect blocks of spacemaps can contribute significantly to
write inflation.  Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

illumos/illumos-gate@221813c13b

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 20:06:46 +00:00
Alexander Motin
3c09eaee95 MFV r337029:
9426 metaslab size can exceed offset addressable by spacemap

metaslab size can exceed offset addressable by spacemap. The vdev can
address up to 2^63 * SPA_MAXBLOCKSIZE (512). A metaslab can address up to
2^47 * 2^vdev_ashift. Therefore we may need to increase the number of
metaslabs so that the maximum metaslab size is capped at the amount that
can be addressed by the spacemap. This should happen in
vdev_metaslab_set_size().

illumos/illumos-gate@b4bf0cf045

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Don Brady <don.brady@delphix.com>
2018-08-01 03:21:17 +00:00
Alexander Motin
e74aee0b25 MFV r337027:
9328 zap code can take advantage of c99
9329 panic in zap_leaf_lookup() due to concurrent zapification

illumos/illumos-gate@bf26014c55

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-01 03:07:33 +00:00
Alexander Motin
f12261798c MFV r337022:
9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error

This assertion (VERIFY) failure was reported when reading a block. Turns out
the problem is that if we get an i/o error (ECKSUM in this case), and there
are multiple concurrent ARC reads of the same block (from different clones),
then the ARC will put multiple buf's on the same ANON hdr, which isn't
supposed to happen, and then causes a panic when we try to arc_buf_destroy()
the buf.

illumos/illumos-gate@fa98e487a9

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-01 02:39:44 +00:00
Alexander Motin
a5c4d9335b MFV r337020:9443 panic when scrub a v10 pool
illumos/illumos-gate@bb1f424574

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-07-31 23:00:58 +00:00
Alexander Motin
200c27a75d MFV r337014:
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue

illumos/illumos-gate@20b5dafb42

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-07-31 22:50:50 +00:00
Alexander Motin
0021e1c10c MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     George Wilson <george.wilson@delphix.com>
2018-07-31 21:06:04 +00:00
Alexander Motin
25f15f404d MFV r336960: 9256 zfs send space estimation off by > 10% on some datasets
illumos/illummos-gate@df477c0afa

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-07-31 01:02:22 +00:00
Alexander Motin
9abeb6d79e MFV r336958: 9337 zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).

illumos/illumos-gate@adb52d9262

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:58:21 +00:00
Alexander Motin
d1cf4052d0 MFV r336955: 9236 nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg.  Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.

illumos/illumos-gate@21f7c81cc1

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:47:27 +00:00
Alexander Motin
656d46109d MFV r336952: 9192 explicitly pass good_writes to vdev_uberblock/label_sync
Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.

illumos/illumos-gate@a3b5583021

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:37:45 +00:00
Alexander Motin
194000fa21 MFV r336950: 9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:25:39 +00:00
Alexander Motin
9cd6f162c0 MFV r336948: 9112 Improve allocation performance on high-end systems
On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.

illumos/illumos-gate@f78cdc34af

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>
2018-07-31 00:02:42 +00:00
Alexander Motin
6413a6d31f MFV r336946: 9238 ZFS Spacemap Encoding V2
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).

The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.

illumos/illumos-gate@17f11284b4

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-07-30 23:47:38 +00:00
Alexander Motin
eb235f2f8e MFV r336942: 9189 Add debug to vdev_label_read_config when txg check fails
illumos/illumos-gate@b6bf6e1540

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-07-30 22:03:29 +00:00
Allan Jude
68232f2053 ZFS: Reserve DMU_BACKUP_FEATURE flags for Native Encryption and ZSTD 2018-07-24 04:38:11 +00:00
Sean Eric Fagan
1bbaf1401c Fix a couple of typos in r334844 noticed by Richard Kojedzinszky.
Submitted by:	Richard Kojedzinszky
Reviewed by:	sef
Approved by:	mav
2018-07-18 16:03:40 +00:00
Sean Eric Fagan
072ffd4b20 Fix up some missed and mis-merges from the sequential scan code
(r334844). Most of the changes involve moving some code around to
reduce conflicts with future merges.  One of the missing changes
included a notification on scrub cancellation.

Approved by:	mav
Sponsored by:	iXsystems Inc
2018-07-10 20:11:32 +00:00
Sean Eric Fagan
aad5531e71 This exposes ZFS user and group quotas via the normal
quatactl(2) mechanism.  (Read-only at this point, however.)
In particular, this is to allow rpc.rquotad query quotas
for NFS mounts, allowing users to see their quotas on the
hosts using the datasets.

The changes specifically:

* Add new RPC entry points for querying quotas.
* Changes the library routines to allow non-UFS quotas.
* Changes rquotad to check for quotas on mounted filesystems,
rather than being limited to entries in /etc/fstab
* Lastly, adds a VFS entry-point for ZFS to query quotas.

Note that this makes one unavoidable behavioural change: if quotas
are enabled, then they can be queried, as opposed to the current
method of checking for quotas being specified in fstab.  (With
ZFS, if there are user or group quotas, they're used, always.)

Reviewed by:	delphij, mav
Approved by:	mav
Sponsored by:	iXsystems Inc
Differential Revision:	https://reviews.freebsd.org/D15886
2018-07-05 22:56:13 +00:00
Sean Eric Fagan
69724399c4 This originated from ZFS On Linux, as
d4a72f2386

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

Approved by:	Alexander Motin
Obtained from:	ZFS On Linux
Relnotes:	Yes (improved scrub performance, with tunables)
Differential Revision:	https://reviews.freebsd.org/D15562
2018-06-08 17:38:28 +00:00
Benno Rice
b3b11d6400 Break recursion involving getnewvnode and zfs_rmnode.
When we're at our vnode limit, getnewvnode will call into the vnode LRU
cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we
end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling
represents something with extended attributes, zfs_rmnode will call
zfs_zget which will attempt to allocate another vnode. If the next vnode we
try to recycle is also a ZFS vnode representing something with extended
attributes we can recurse further. This ends up being unbounded and can end
up overflowing the stack.

In order to avoid this, restructure zfs_rmnode to simply add the extended
attribute directory's object ID to the unlinked set, thus not requiring the
allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain
which will do the work of properly marking the vnodes for unlinking.
zfs_unlinked_drain is also called on mount so these will be cleaned up
there.

Reviewed by:	avg, mav
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D15342
2018-06-07 18:59:32 +00:00
Andriy Gapon
620b779158 fix zfs_getpages crash when called from sendfile, followup to r329363
It turns out that sendfile_swapin() has an optimization where it may
insert pointers to bogus_page into the page array that it passes to
VOP_GETPAGES.  That happens to work with buffer cache, because it
extensively uses bogus_page internally, so it has the necessary checks.
However, ZFS did not expect bogus_page as VOP_GETPAGES(9) does not
document such a (ab)use of bogus_page.

So, this commit adds checks and handling of bogus_page.

I expect that use of bogus_page with VOP_GETPAGES will get documented
sooner rather than later.

Reported by:	Andrew Reilly <areilly@bigpond.net.au>, delphij
Tested by:	Andrew Reilly <areilly@bigpond.net.au>
Requested by:	many
MFC after:	1 week
2018-05-25 07:29:52 +00:00
Andriy Gapon
873c2703d8 Fix 'zpool create -t <tempname>'
Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.

Fix this by specifying the correct pool name when setting the dataset
properties.

Author: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>

Obtained from:	ZFS on Linux, zfsonlinux/zfs@4ceb8dd6fd
MFC after:	1 week
2018-05-15 13:27:29 +00:00
Matt Macy
cbd92ce62e Eliminate the overhead of gratuitous repeated reinitialization of cap_rights
- Add macros to allow preinitialization of cap_rights_t.

- Convert most commonly used code paths to use preinitialized cap_rights_t.
  A 3.6% speedup in fstat was measured with this change.

Reported by:	mjg
Reviewed by:	oshogbo
Approved by:	sbruno
MFC after:	1 month
2018-05-09 18:47:24 +00:00
Jamie Gritton
0e5c6bd436 Make it easier for filesystems to count themselves as jail-enabled,
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of.  This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.

Reviewed by:	kib
Differential Revision:	D14681
2018-05-04 20:54:27 +00:00
Ed Maste
3804f572a3 zfs_ioctl: avoid out-of-bound read
admbugs:	796
Submitted by:	Domagoj Stolfa <ds815@cam.ac.uk>
Reported by:	Ilja Van Sprundel <ivansprundel@ioactive.com>
Reviewed by:	avg
MFC after:	1 day
2018-05-04 00:56:41 +00:00
Alexander Motin
bbbac409fe 9433 Fix ARC hit rate
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified.  As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.

This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf.  Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list.  This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.

This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.

Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-04-16 00:54:58 +00:00
Andriy Gapon
81f187e576 allow ZFS pool to have temporary name for duration of current import
The change adds -t <name> option to zpool create and -t option to zpool
import in its form with an old name and a new name.  This allows to
import (or create) a pool under a name that's different from its real,
permanent name without affecting that name.  This is useful when working
with VM images or images of other physical systems if they happen to
have a ZFS pool with the same name as the host system.

The changes come from ZoL with some small tweaks.
The porting has been done by julian.

The change is being submitted to OpenZFS:
https://github.com/openzfs/openzfs/pull/600

Submitted by:	julian
Reviewed by:	smh
MFC after:	2 weeks
Sponsored by:	Panzura (porting)
Differential Revision: https://reviews.freebsd.org/D14972
2018-04-12 10:37:26 +00:00
Mark Johnston
87c1cb45cd Correct a comment.
Submitted by:	Domagoj Stolfa
X-MFC with:	r332364
Sponsored by:	DARPA, AFRL
2018-04-10 14:07:02 +00:00
Mark Johnston
b32171ea5a Set zfs_arc_free_target to v_free_target.
Page daemon output is now regulated by a PID controller with a setpoint
of v_free_target. Moreover, the page daemon now wakes up regularly
rather than waiting for a wakeup from another thread. This means that
the free page count is unlikely to drop below the old
zfs_arc_free_target value, and as a result the ARC was not readily
freeing pages under memory pressure. Address the immediate problem by
updating zfs_arc_free_target to match the page daemon's new behaviour.

Reported and tested by:	truckman
Discussed with:	jeff
X-MFC with:	r329882
Differential Revision:	https://reviews.freebsd.org/D14994
2018-04-10 13:56:06 +00:00
Mark Johnston
8593136428 Assert that dtrace_probe() doesn't re-enter itself.
This helps catch cases where an instrumented function is called while
in probe context.

Submitted by:	Domagoj Stolfa <domagoj.stolfa@gmail.com>
MFC after:	2 weeks
Sponsored by:	DARPA/AFRL
Differential Revision:	https://reviews.freebsd.org/D14863
2018-04-10 13:47:09 +00:00
Alexander Motin
8b26d76a50 9434 Speculative prefetch is blocked by device removal code.
Device removal code does not set spa_indirect_vdevs_loaded for pools
that never experienced device removal.  At least one visual consequence
of it is completely blocked speculative prefetcher.  This patch sets
the variable in such situations.
2018-04-03 21:16:41 +00:00
Alexander Motin
849a7ce2d5 MFV r331712:
9280 Assertion failure while running removal_with_ganging test with 4K devices

illumos/illumos-gate@243952c7ee

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matt Ahrens <Matt.Ahrens@delphix.com>
2018-03-28 23:17:29 +00:00
Alexander Motin
a96086ed50 MFV 331710:
9188 increase size of dbuf cache to reduce indirect block decompression

illumos/illumos-gate@268bbb2a2f

With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.

If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)

In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
2018-03-28 23:05:48 +00:00
Alexander Motin
9311abdd7e MFV r331708:
9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value

illumos/illumos-gate@9be12bd737

arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally
In the case of zfs_compressed_arc_enabled=0, when the buf is returned via
arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes
is decremented by lsize, not psize.

Switch to using arc_buf_size(buf), instead of psize, which will return
psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf).

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Allan Jude <allanjude@freebsd.org>
2018-03-28 22:50:05 +00:00
Alexander Motin
5c4561f332 MFV r331706:
9235 rename zpool_rewind_policy_t to zpool_load_policy_t

illumos/illumos-gate@5dafeea3eb

We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.

For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-03-28 22:29:06 +00:00
Alexander Motin
6cc8a8260f MFV 331704:
9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices

illumos/illumos-gate@ccef24b493

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-03-28 22:10:06 +00:00
Alexander Motin
2221f0d8af MFV 331702:
9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate

illumos/illumos-gate@d1de72cfa2

ztest failed with uncorrectable IO error despite having the fix for #7163.
Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes
it from that issue.

Definitely seems like a racing condition between the vdev_validate and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.

Solution: do not check txg in vdev_validate unless config lock is held.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-03-28 22:07:31 +00:00
Alexander Motin
0b0c76bc58 MFV r331695, 331700: 9166 zfs storage pool checkpoint
illumos/illumos-gate@8671400134

The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that.  It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data).  It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it.  Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure.  She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state.  Otherwise, she discards it.  With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-03-28 22:01:27 +00:00
Andriy Gapon
f4043145f2 ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code.  Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.

Also, the change required making vfs_refcount_release_if_not_last()
public.  I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now.  While making the move I've dropped the
vfs_ prefix.

Reviewed by:	mjg
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D14869
2018-03-28 08:55:31 +00:00
John Baldwin
d41e41f9f0 Remove very old and unused signal information codes.
These have been supplanted by the MI signal information codes in
<sys/signal.h> since 7.0.  The FPE_*_TRAP ones were deprecated even
earlier in 1999.

PR:		226579 (exp-run)
Reviewed by:	kib
Differential Revision:	https://reviews.freebsd.org/D14637
2018-03-27 20:57:51 +00:00
Andriy Gapon
f3fe7e5eff zfs: fix mismatch between format specifier and type
vdev_dbgmsg_print_tree printed vdev_id of uint64_t type with %u format
specifier.  That caused subsequent parameters to be incorrectly read
from the stack and lead to a crash when a wrong value was interpreted as
a string pointer.

This should be upstreamed.

Reported by:	pho
MFC after:	3 days
2018-03-23 09:42:47 +00:00
Alexander Motin
4148c56f78 Reduce struct aggsum_bucket padding to fit into one cache line.
Reported by:	mjg
2018-03-23 02:50:38 +00:00
Alexander Motin
e76e77a972 MFV r331407: 9213 zfs: sytem typo
illumos/illumos-gate@edc8ef7d92

Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Toomas Soome <tsoome@me.com>
2018-03-23 02:30:29 +00:00
Alexander Motin
f222611ab0 MFV r331405: 9084 spa_*_ashift must ignore spare devices
illumos/illumos-gate@b037f3dbd6

Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
2018-03-23 02:24:52 +00:00
Alexander Motin
b8436536c9 MFV r331400: 8484 Implement aggregate sum and use for arc counters
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
2018-03-23 02:15:05 +00:00
Mark Johnston
95099bbad1 Fix an access of an uninitialized variable in dtrace_probe().
Reported by:	Coverity, via cem
MFC after:	3 days
2018-03-18 17:01:50 +00:00
Andriy Gapon
289c14e811 MFV r330973: 9164 assert: newds == os->os_dsl_dataset
illumos/illumos-gate@5f5913bb83
5f5913bb83

https://www.illumos.org/issues/9164
  This issue has been reported by Alan Somers as
  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877

  dmu_objset_refresh_ownership() first disowns a dataset (and releases
  it) and then owns it again. There is an assert that the new dataset
  object is the same as the old dataset object.  When running ZFS Test
  Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test:

  panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000
  == 0xfffff80021ab4800)

  I see that the old dataset has dsl_dataset_evict_async() pending in
  ds_dbu.dbu_tqent and its ds_dbuf is NULL.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>

PR:		225877
Reported by:	asomers
MFC after:	1 week
2018-03-15 08:49:21 +00:00
Steven Hartland
a59e720e65 Prevent ZFS TRIM breaking VTOC8 partitions
Update the ZFS TRIM code to ensure it respects VTOC8 partition headers as
documented by the ZFS On-Disk Specification section 1.3

Before this a zpool create on a VTOC8 partitioned device would overwrite the
partition metadata.

Reported by:	marius
Reviewed by:	marius agv
MFC after:	1 week
Sponsored by:	Multiplay
2018-03-14 21:21:03 +00:00
Andriy Gapon
4a24755d68 MFV r330591: 8984 fix for 6764 breaks ACL inheritance
illumos/illumos-gate@e9bacc6d1a
e9bacc6d1a

https://www.illumos.org/issues/8984
  Consider a directory configured as:
  drwx-ws---+ 2 henson cpp 3 Jan 23 12:35 dropbox/
  user:henson:rwxpdDaARWcC--:f-i----:allow
  owner@:--------------:f-i----:allow
  group@:--------------:f-i----:allow
  everyone@:--------------:f-i----:allow
  owner@:rwxpdDaARWcC--:-di----:allow
  group:cpp:-wx-----------:-------:allow
  owner@:rwxpdDaARWcC--:-------:allow
  A new file created in this directory ends up looking like:
  rw-r--r-+ 1 astudent cpp 0 Jan 23 12:39 testfile
  user:henson:rw-pdDaARWcC--:------I:allow
  owner@:--------------:------I:allow
  group@:--------------:------I:allow
  everyone@:--------------:------I:allow
  owner@:rw-p--aARWcCos:-------:allow
  group@:r-----a-R-c--s:-------:allow
  everyone@:r-----a-R-c--s:-------:allow
  with extraneous group@ and everyone@ entries allowing read access that
  shouldn't exist.
  Per Albert Lee on the zfs mailing list:
  "aclinherit=passthrough/passthrough-x should still
  ignore the requested mode when an inheritable ACE for owner@ group@,
  or everyone@ is present in the parent directory.
  It appears there was an oversight in my fix for
  https://www.illumos.org/issues/6764 which made calling zfs_acl_chmod
  from zfs_acl_inherit unconditional. I think the parent ACL check for
  aclinherit=passthrough needs to be reintroduced in zfs_acl_inherit."
  We have a large number of faculty who use dropbox directories like the example
  to have students submit projects. All of these directories are now allowing

Reviewed by: Sam Zaydel <szaydel@racktopsystems.com>
Reviewed by: Paul B. Henson <henson@acm.org>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dominik Hassler <hadfl@omniosce.org>

PR:		216886
MFC after:	2 weeks
2018-03-07 13:49:26 +00:00
Andriy Gapon
f3b7b054dd add ZFS_ENTER protection to .zfs/snapshot vnode operations that need it
Those operations, zfsctl_snapdir_readdir and zfsctl_snapdir_getattr,
access the filesystem's objset and it can be unstable during operations
like receive and rollback.

MFC after:	2 weeks
2018-02-27 14:08:54 +00:00
Alexander Motin
060bac1db8 Add sysctls/tunables for dbuf cache size.
MFC after:	2 weeks
2018-02-27 01:36:43 +00:00
Alan Somers
92bd443160 Implement CTASSERT using _Static_assert
Prevents warnings about "unused typedef" with GCC-6

Reported by:	GCC-6
MFC after:	18 days
X-MFC-With:	329722
2018-02-24 16:01:21 +00:00
Andriy Gapon
de2cb430ad another rework of getzfsvfs / getzfsvfs_impl code
This change is designed to account for yet another difference between
illumos and FreeBSD VFS.  In FreeBSD a filesystem driver is supposed to
clean up mnt_data in its VFS_UNMOUNT method because it's the last call
into the driver before a struct mount object is destroyed.  The VFS
drains all references to the object before destroying it, but for the
driver it's already as good as gone.
In contrast, illumos VFS provides another method, VFS_FREEVFS, that is
called when all references are drained.  So, the driver can keep its
data after VFS_UNMOUNT and clean it up in VFS_FREEVFS after all
references are gone. This is what ZFS does on illumos.
So there a reference to a filesystem is sufficient to guarantee that the
ZFS specific data, aka zfsvfs_t, stays around (even if the filesystem
gets unmounted).  In FreeBSD we need to vfs_busy the filesystem to get
the same guarantee.  vfs_ref guarantees only that the struct mount is
kept.

The following rules should be observed in getzfsvfs / getzfsvfs_impl on
FreeBSD:
- if we need access to zfsvfs_t then we must use vfs_busy
- if only we need to access struct mount (aka vfs_t), then vfs_ref is
  enough
- when illumos code actually needs only the vfs_t, they still can pass
  the zfsvfs_t and get the vfs_t from it;  that can work in FreeBSD if
  the filesystem is busied, but when it's just referenced then we have
  to pass the vfs_t explicitly
- we cannot call vfs_busy while holding a dataset because that creates a
  LOR with dp_config_rwlock

As a result:
- getzfsvfs_impl now only references the filesystem, same as in illumos,
  but unlike illumos it has to return the vfs_t
- the consumers are updated to account for the change
- getzfsvfs busies the filesystem (and drops the reference from
  getzfsvfs_impl)

Also, zfs_unmount_snap() now gets a busied a filesystem, references it
and then unbusies it essentially reverting actions done in getzfsvfs.
This is needed because the code may perform some checks that require the
zfsvfs_t.  So, those are done before the unbusying.

MFC after:	2 weeks
2018-02-22 13:06:27 +00:00
Andriy Gapon
8d69fe5cc8 followup to r329556, completely remove the covered vnode assert
vrele() acquires the vnode lock only if the hold count drops to zero.
In other scenarios it needs only the interlock.  So,
zfsctl_snapdir_lookup() can race with vfs_mount_destroy() -> vrele()
such that the lookup adds a new reference and then vrele() drops the
mountpoint's reference and only then we check the reference count.
It would be just one in this case.

In fact, the assert should have been removed in r323483 when the code
learned how to deal with the uncovered vnode.

PR:		225795
MFC after:	4 days
X-MFC with:	r329556
2018-02-22 11:41:00 +00:00
Alexander Motin
dd9ceab333 MFV r329803:
9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap()

illumos/illumos-gate@bdfded42e6

A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls
vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that
was already acquired from the first call to vdev_indirect_remap(). The specific scenario,
is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist
is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the
same vdev, which brings up the aforementioned issue.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-02-22 03:54:59 +00:00
Alexander Motin
064827be34 MFV r329799, r329800:
9079 race condition in starting and ending condesing thread for indirect vdevs

illumos/illumos-gate@667ec66f1b

The timeline of the race condition is the following:
[1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(),
so it calls the spa_condense_indirect_complete_sync() sync task which sets the
spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A
sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock
and set spa_condense_thread to NULL.
[2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks
whether it should condense the second vdev in vdev_indirect_should_condense() by checking
the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread()
from thread A. So it goes on and tries to spawn a new condensing thread in
spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A
has not set spa_condense_thread to NULL (which is basically the last thing it does before
returning).

The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to
signify whether a condensing thread is running. Ideally we would only use one throughout the
codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which
basically tights condensing to scrubing when it comes to pausing and resuming those actions
during spa export.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-02-22 03:49:06 +00:00
Alexander Motin
1ea10a60f9 MFV r329793, r329795:
9075 Improve ZFS pool import/load process and corrupted pool recovery

illumos/illumos-gate@6f7938128a

Some work has been done lately to improve the debugability of the ZFS pool
load (and import) process. This includes:

https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions
https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed
https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's

To iterate on top of that, there's a few changes that were made to make the
import process more resilient and crash free. One of the first tasks during the
pool load process is to parse a config provided from userland that describes
what devices the pool is composed of. A vdev tree is generated from that config,
and then all the vdevs are opened.

The Meta Object Set (MOS) of the pool is accessed, and several metadata objects
that are necessary to load the pool are read. The exact configuration of the
pool is also stored inside the MOS. Since the configuration provided from
userland is external and might not accurately describe the vdev tree
of the pool at the txg that is being loaded, it cannot be relied upon to safely
operate the pool. For that reason, the configuration in the MOS is read early
on. In the past, the two configurations were compared together and if there was
a mismatch then the load process was aborted and an error was returned.

The latter was a good way to ensure a pool does not get corrupted, however it
made the pool load process needlessly fragile in cases where the vdev
configuration changed or the userland configuration was outdated. Since the MOS
is stored in 3 copies, the configuration provided by userland doesn't have to be
perfect in order to read its contents. Hence, a new approach has been adopted:
The pool is first opened with the untrusted userland configuration just so that
the real configuration can be read from the MOS. The trusted MOS configuration
is then used to generate a new vdev tree and the pool is re-opened.

When the pool is opened with an untrusted configuration, writes are disabled
to avoid accidentally damaging it. During reads, some sanity checks are
performed on block pointers to see if each DVA points to a known vdev;
when the configuration is untrusted, instead of panicking the system if those
checks fail we simply avoid issuing reads to the invalid DVAs.

This new two-step pool load process now allows rewinding pools accross
vdev tree changes such as device replacement, addition, etc. Loading a pool
from an external config file in a clustering environment also becomes much
safer now since the pool will import even if the config is outdated and didn't,
for instance, register a recent device addition.

With this code in place, it became relatively easy to implement a
long-sought-after feature: the ability to import a pool with missing top level
(i.e. non-redundant) devices. Note that since this almost guarantees some loss
Of data, this feature is for now restricted to a read-only import.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 03:15:35 +00:00
Alexander Motin
613b0d87da 8942 zfs promote .../%recv should be an error
illumos/illumos-gate@add927f8c8

Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843,
fixed by https://github.com/zfsonlinux/zfs/pull/6339:

If we are in the middle of an incremental zfs receive, the child .../%recv
will exist. If you concurrently run zfs promote .../%recv, it will "work",
but then zfs gets confused. For example, there's no obvious way to destroy
the containing filesystem (because it is now a clone of its invisible child).

Attempting to do this promote should be an error. We could fix this by
having zfs_ioc_promote() check if zc_name contains a %, similar to
zfs_ioc_rename().

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 01:42:13 +00:00
Alexander Motin
a33ba3dbde MFV r329776: 8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa)
illumos/illumos-gate@f4c1745bd6

Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think
this is the intended behaviour, we shouldn't be allowed to clear readonly
pools. Probably.

A fix is already in the ZFS on Linux repository to addess this issue:
92e43c1718

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 01:00:46 +00:00
Alexander Motin
eea9be67e6 MFV r329774:
8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly

illumos/illumos-gate@85723e5eec

When iterating over the input nvlist in dsl_props_set_sync_impl() when we
don't preserve the nvpair name before looking up ZPROP_VALUE, so when we
later go to process it nvpair_name() is always "value" instead of the actual
property name.

This results in a couple of bugs in the recv code:

 - received properties are not restored correctly when failing to receive
   an incremental send stream
 - received properties are not completely replaced by the new ones when
   successfully receiving an incremental send stream

This was discovered on ZFS on Linux (fixed in
5f1346c299)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
2018-02-22 00:55:25 +00:00
Alexander Motin
756595f675 MFV r329770: 9035 zfs: this statement may fall through
illumos/illumos-gate@46ac8fdfc5

Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Toomas Soome <tsoome@me.com>
2018-02-22 00:47:38 +00:00
Alexander Motin
502d18a8f1 MFV r329766: 8962 zdb should work on non-idle pools
illumos/illumos-gate@e144c4e6c9

Currently `zdb` consistently fails to examine non-idle pools as it fails
during the `spa_load()` process. The main problem seems to be that
`spa_load_verify()` fails as can be seen below:

$ sudo zdb -d -G dcenter
    zdb: can't open 'dcenter': I/O error

ZFS_DBGMSG(zdb):
    spa_open_common: opening dcenter
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950
    spa_load(dcenter): using uberblock with txg=40824950
    spa_load(dcenter): UNLOADING
    spa_load(dcenter): RELOADING
    spa_load(dcenter): LOADING
    disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952
    spa_load(dcenter): using uberblock with txg=40824952
    spa_load(dcenter): FAILED: spa_load_verify failed [error=5]
    spa_load(dcenter): UNLOADING

This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is
done by creating a global flag in zfs and then setting it in `zdb`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 00:42:12 +00:00
Alexander Motin
fa607d017d MFV r329762: 8961 SPA load/import should tell us why it failed
illumos/illumos-gate@3ee8c80c74

When we fail to open or import a storage pool, we typically don't get any
additional diagnostic information, just "no pool found" or "can not import".

While there may be no additional user-consumable information, we should at
least make this situation easier to debug/diagnose for developers and support.
For example, we could start by using `zfs_dbgmsg()` to log each thing that we
try when importing, and which things failed. E.g. "tried uberblock of txg X
from label Y of device Z". Also, we could log each of the stages that we go
through in `spa_load_impl()`.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-22 00:03:14 +00:00
Alexander Motin
df331510ab MFV r329760: 7638 Refactor spa_load_impl into several functions
illumos/illumos-gate@1fd3785ff6

spa_load_impl has grown out of proportions.  It is currently over 700
lines long and makes it very hard to follow or debug the import process
even for experienced ZFS developers.  The objective is to split it up
in a series of well commented functions.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-02-21 23:38:30 +00:00
Alexander Motin
b17bfcde3d 9018 Replace kmem_cache_reap_now() with kmem_cache_reap_soon()
illumos/illumos-gate@36a64e6284

To prevent kmem_cache reaping from blocking other system resources, turn
kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers
to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which
exploits #9017's new taskq_empty().

Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Author: Tim Kordas <tim.kordas@joyent.com>

FreeBSD does not use taskqueue for kmem caches reaping, so this change
is less dramatic then it is on Illumos, just limiting reaping to 1 time
per second.  It may possibly be improved later, if needed.
2018-02-21 23:15:06 +00:00
Alexander Motin
d208c07cf3 MFV r329753: 8809 libzpool should leverage work done in libfakekernel
illumos/illumos-gate@f06dce2c1f

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andrew Stormont <astormont@racktopsystems.com>
2018-02-21 21:18:04 +00:00
Alexander Motin
832deba1e5 MFV r329736: 8969 Cannot boot from RAIDZ with parity > 1
illumos/illumos-gate@0fb055e81f

At present it is possible to boot from a root pool that is on RAIDZ but not
one that is on RAIDZ2 or RAIDZ3.  This is because, at the time the pool
version is checked to ensure support for dual/triple parity, the uberblock
has not yet been loaded into the SPA and therefore the code determines that
the pool version is too old and returns ENOTSUP.

Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andy Fiddaman <omnios@citrus-it.co.uk>

FreeBSD already had this fixed, so this is just a diff reduction.
2018-02-21 18:12:19 +00:00
Alexander Motin
24433f00ea MFV r329502: 7614 zfs device evacuation/removal
illumos/illumos-gate@5cabbc6b49

https://www.illumos.org/issues/7614:
This project allows top-level vdevs to be removed from the storage pool with
“zpool remove”, reducing the total amount of storage in the pool. This
operation copies all allocated regions of the device to be removed onto other
devices, recording the mapping from old to new location. After the removal is
complete, read and free operations to the removed (now “indirect”) vdev must
be remapped and performed at the new location on disk. The indirect mapping
table is kept in memory whenever the pool is loaded, so there is minimal
performance overhead when doing operations on the indirect vdev.

The size of the in-memory mapping table will be reduced when its entries
become “obsolete” because they are no longer used by any block pointers in
the pool. An entry becomes obsolete when all the blocks that use it are
freed. An entry can also become obsolete when all the snapshots that
reference it are deleted, and the block pointers that reference it have been
“remapped” in all filesystems/zvols (and clones). Whenever an indirect block
is written, all the block pointers in it will be “remapped” to their new
(concrete) locations if possible. This process can be accelerated by using
the “zfs remap” command to proactively rewrite all indirect blocks that
reference indirect (removed) vdevs.

Note that when a device is removed, we do not verify the checksum of the data
that is copied. This makes the process much faster, but if it were used on
redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy
the wrong data, when we have the correct data on e.g. the other side of the
mirror. Therefore, mirror and raidz devices can not be removed.

Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prashanth Sreenivasa <pks@delphix.com>
2018-02-21 16:51:02 +00:00
Andriy Gapon
cfb675a138 MFV r329718: 8520 7198 lzc_rollback_to should support rolling back to origin
illumos/illumos-gate@95643f75d2
95643f75d2

https://www.illumos.org/issues/8520
  lzc_rollback_to() should support rolling back to a clone's origin.
  The current checks in zfs_ioc_rollback() would not allow that because the
  origin snapshot belongs to a different filesystem.
  The overly restrictive check was introduced in 7600, but it was not a
  regression as none of the existing tools provided a way to rollback to the
  origin.

https://www.illumos.org/issues/7198
  EINVAL is returned when a dataset does not have any snapshots, so there is
  nothing to roll back to.
  Although the code in zfs_do_rollback checks for that condition in advance, it's
  still possible that the snapshot(s) gets removed after the check and before the
  rollback sync task is executed.
  At the moment zfs command would crash when that happens.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after:	2 weeks
2018-02-21 15:12:14 +00:00
Andriy Gapon
754d27df02 MFV r329715: 8997 ztest assertion failure in zil_lwb_write_issue
illumos/illumos-gate@f864f99efe
f864f99efe

https://www.illumos.org/issues/8997
  When dmu_tx_assign is called from zil_lwb_write_issue, it's possible
  for either ERESTART or EIO to be returned.
  If ERESTART is returned, this will cause an assertion to fail directly
  in zil_lwb_write_issue, where the code assumes the return value is
  EIO if dmu_tx_assign returns a non-zero value. This can occur if the
  SPA is suspended when dmu_tx_assign is called, and most often occurs
  when running zloop.
  If EIO is returned, this can cause assertions to fail elsewhere in the
  ZIL code. For example, zil_commit_waiter_timeout contains the
  following logic:
    lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb);
    ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED);
  In this case, if dmu_tx_assign returned EIO from within
  zil_lwb_write_issue, the lwb variable passed in will not be issued
  to disk. Thus, it's lwb_state field will remain LWB_STATE_OPENED and
  this assertion will fail. zil_commit_waiter_timeout assumes that after
  it calls zil_lwb_write_issue, the lwb will be issued to disk, and
  doesn't handle the case where this is not true; i.e. it doesn't handle
  the case where dmu_tx_assign returns EIO.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after:	3 weeks
2018-02-21 15:07:49 +00:00
Andriy Gapon
9d6810819c MFV r329713: 8731 ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks
illumos/illumos-gate@a6c1eb3c08
a6c1eb3c08

https://www.illumos.org/issues/8731
  annotate_ecksum() asserts that nui64s, calculated as nui64s = size / sizeof
  (uint64_t), is not greater than UINT16_MAX.
  This restriction is needed because histograms of incorrectly set and cleared
  bits have 16 bit counters and if the buffer consists of too many 64-bit words,
  then a counter can potentially overflow producing an incorrect result.
  When the largest buffer size was 128KB the greatest value of nui64s was 16K,
  well within the limit.
  But now we have support for large buffers and for buffer sizes of 512KB and
  above the restriction is violated.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after:	2 weeks
2018-02-21 14:31:48 +00:00
Andriy Gapon
70639f9a5e MFV r329710: 8966 Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable
illumos/illumos-gate@82693e09cc
82693e09cc
https://www.illumos.org/issues/8966

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: WHR <msl0000023508@gmail.com>
PR:		225162
Submitted by:	WHR <msl0000023508@gmail.com>
Reported by:	WHR <msl0000023508@gmail.com>
MFC after:	1 week
2018-02-21 14:17:07 +00:00
Alexander Motin
d8e89539c8 MFV r324198: 8081 Compiler warnings in zdb
illumos/illumos-gate@3f7978d02b
3f7978d02b

https://www.illumos.org/issues/8081
  zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
  which uses -WError, the only way to build it is to disable all compiler
  warnings. This makes it much harder to detect newly introduced bugs. We should
  cleanup all the warnings.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>
2018-02-21 03:08:47 +00:00
Alexander Motin
03618fe74d MFV r319737: 6939 add sysevents to zfs core for commands
illumos/illumos-gate@ce1577b049
ce1577b049

https://www.illumos.org/issues/6939
  Originally created https://smartos.org/bugview/OS-4489
       sysevents should be fired in the kernel from ZFS whenever a command
       is run that is logged in zpool history.
  Example output
  Terminal 1
  root - gz sunos ~ # zfs create zones/foobar
  root - gz sunos ~ # zfs set quota=10g zones/foobar
  root - gz sunos ~ # zfs destroy zones/foobar
  Terminal 2
  root - gz sunos ~ # sysevent EC_zfs
  nvlist version: 0
      date = 2016-04-28T14:50:08.964Z
      vendor = SUNW
      publisher = zfs
      class = EC_zfs
      subclass = ESC_ZFS_history_event
      pid = 0
      data = (embedded nvlist)
      nvlist version: 0
          pool_name = zones
          pool_guid = 0x40c964e8f9a7a694
          history_record = (embedded nvlist)
          nvlist version: 0
              dsname = zones/foobar
              dsid = 0x1525
              history internal str =
              internal_name = create
              history txg = 0x4c4ef3

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dave Eddy <dave@daveeddy.com>
2018-02-21 02:19:42 +00:00
Alexander Motin
63e739af67 MFV r319736: 6396 remove SVM
illumos/illumos-gate@5f10ef697f
5f10ef697f

https://www.illumos.org/issues/6396
  LVM = SVM = Solaris Volume Manager
  dead code and not using with ZFS based platform.

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
2018-02-21 00:24:54 +00:00
Alexander Motin
e5a4a83784 MFV r318941: 7446 zpool create should support efi system partition
illumos/illumos-gate@7855d95b30
7855d95b30

https://www.illumos.org/issues/7446
  Since we support whole-disk configuration for boot pool, we also will need
  whole disk support with UEFI boot and for this, zpool create should create efi-
  system partition.
  I have borrowed the idea from oracle solaris, and introducing zpool create -
  B switch to provide an way to specify that boot partition should be created.
  However, there is still an question, how big should the system partition be.
  For time being, I have set default size 256MB (thats minimum size for FAT32
  with 4k blocks). To support custom size, the set on creation "bootsize"
  property is created and so the custom size can be set as: zpool create B -
  o bootsize=34MB rpool c0t0d0
  After pool is created, the "bootsize" property is read only. When -B switch is
  not used, the bootsize defaults to 0 and is shown in zpool get output with
  value ''. Older zfs/zpool implementations are ignoring this property.
  https://www.illumos.org/rb/r/219/

Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>

This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.
2018-02-21 00:18:57 +00:00
Alexander Motin
89dabdb4ea MFC r316910: 7812 Remove gender specific language
illumos/illumos-gate@48bbca8168
48bbca8168

https://www.illumos.org/issues/7812
  This change removes all gendered language that did not refer specifically
  to an individual person or pet. The convention taken was to use
  variations on "they" when referring to users and/or human beings, while
  using "it" when referring to code, functions, and/or libraries.
  Additionally, we took the liberty to fix up any whitespace issues that
  were found in any files that were already being modified.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>
2018-02-20 05:07:21 +00:00
Alexander Motin
3542f1bd3a MFV r307315:
7301 zpool export -f should be able to interrupt file freeing

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>

Closes #175
2018-02-20 04:36:51 +00:00
Alexander Motin
4a0867c8d2 MFV r302649: 7016 arc_available_memory is not 32-bit safe
illumos/illumos-gate@0dd053d7d8
0dd053d7d8

https://www.illumos.org/issues/7016
  upstream DLPX-39446 arc_available_memory is not 32-bit safe
  https://github.com/delphix/delphix-os/commit/
  6b353ea3b8a1610be22e71e657d051743c64190b
  related to this upstream:
  DLPX-38547 delphix engine hang
  https://github.com/delphix/delphix-os/commit/
  3183a567b3e8c62a74a65885ca60c86f3d693783
  DLPX-38547 delphix engine hang (fix static global)
  https://github.com/delphix/delphix-os/commit/
  22ac551d8ef085ad66cc8f65e51ac372b12993b9
  DLPX-38882 system hung waiting on free segment
  https://github.com/delphix/delphix-os/commit/
  cdd6beef7548cd3b12f0fc0328eeb3af540079c2

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: Prakash Surya <prakash.surya@delphix.com>
2018-02-20 04:14:12 +00:00
Andriy Gapon
8cebd0e419 relax an assert in zfsctl_snapdir_lookup to match r323578
Since r323578 we may remove the last reference to a covered vnode with
vrele() instead of vput().  So, v_usecount may be decremented before
the vnode is locked and zfsctl_snapdir_lookup may "catch" the vnode
with v_usecount of zero and v_holdcnt of one.

PR:		225795
Reported by:	asomers
MFC after:	1 week
2018-02-19 08:55:22 +00:00
Alan Somers
4571b2776f zfs: fix formatting in a log statement
Submitted by:	Dave Baukus <daveb@spectralogic.com>
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
2018-02-16 21:59:08 +00:00
Alan Somers
dfbc272d5d Handle generic pathconf attributes in the .zfs ctldir
MFC instructions: change the value of _PC_LINK_MAX to INT_MAX

Reported by:	jhb
MFC after:	19 days
X-MFC-With:	329265
Sponsored by:	Spectra Logic Corp
2018-02-16 16:56:09 +00:00
Andriy Gapon
c945107d23 read-behind / read-ahead support for zfs_getpages()
ZFS caches blocks it reads in its ARC, so in general the optional
pages are not as useful as with filesystems that read the data
directly into the target pages.  But still the optional pages
are useful to reduce the number of page faults and associated
VM / VFS / ZFS calls.
Another case that gets optimized (as a side effect) is paging in
from a hole.  ZFS DMU does not currently provide a convenient
API to check for a hole.  Instead it creates a temporary zero-filled
block and allows accessing it as if it were a normal data block.
Getting multiple pages one by one from a hole results in repeated
creation and destruction of the temporary block (and an associated
ARC header).

Tested with fsx using various supported blocks sizes from 512 bytes
to 128 KB and additionally 1 MB.

Please note that in illumos and ZoL they do not do the range-locking in
the page-in path. This is because ZFS has a double-caching problem
between ARC and page cache and that requires zfs_read() and zfs_write()
to consult pages in the page cache. So, in those functions they first
lock a range and then lock pages corresponding to the range. While in
the page-in (and maybe page-out) path they first lock the pages and then
would lock the range. So, they would have a deadlock.

I believe that FreeBSD does not have that problem, because the page-in
deals only with invalid pages while zfs_read() and zfs_write() need to
access only valid pages. They do not wait on a busy page unless it's
already valid.

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D14263
2018-02-16 06:59:35 +00:00