Commit Graph

1611 Commits

Author SHA1 Message Date
Alexander Motin
74a148f46f Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz. 2016-10-29 23:25:12 +00:00
Andriy Gapon
97371ba2a9 zfsbootcfg: a simple tool to set next boot (one time) options for zfsboot
(gpt)zfsboot will read one-time boot directives from a special ZFS pool
area.  The area was previously described as "Boot Block Header", but
currently it is know as Pad2, marked as reserved and is zeroed out on
pool creation.  The new code interprets data in this area, if any, using
the same format as boot.config.  The area is immediately wiped out.
Failure to parse the directives results in a reboot right after the
cleanup.  Otherwise the boot sequence proceeds as usual.

zfsbootcfg writes zfsboot arguments specified on its command line to the
Pad2 area of a disk identified by vfs.zfs.boot.primary_pool and
vfs.zfs.boot.primary_vdev kenv variables that are set by loader during
boot.  Please see the manual page for more.

Thanks to all who reviewed, contributed and made suggestions!  There are
many potential improvements to the feature, please see the review for
details.

Reviewed by:	wblock (docs)
Discussed with:	jhb, tsoome
MFC after:	3 weeks
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D7612
2016-10-29 14:09:32 +00:00
Alexander Motin
471cf6ce7d Add vdev_reopening support to vdev_geom.
It allows to avoid extra GEOM providers flapping without significant need.
Since GEOM got resize support, we don't need to reopen provider to get new
size.  If provider was orphaned and no longer valid, ZFS should already
know that, and in such case reopen should be done in full as expected.

MFC after:	2 weeks
2016-10-28 17:05:14 +00:00
Alexander Motin
f106f43aa2 Matching GUIDs, handle possible race on vdev detach.
In case of vdev detach, causing top level mirror vdev destruction, leaf
vdev changes its GUID to one of the destroyed mirror, that creates race
condition when GUID in vdev label may not match one in the pool config.

This change replicates logic nuance of vdev_validate() by adding special
exception, matching the vdev GUID against the top level vdev GUID.
Since this exception is not completely reliable (may give false positives
if we fail to erase label on detached vdev), use it only as last resort.

Quick way to reproduce this scenario now is detach vdev from a pool with
enabled autoextend.  During vdev detach autoextend logic tries to reopen
remaining vdev, that always fails now since in-memory configuration is
already updated, while on-disk labels are not yet.

MFC after:	2 weeks
2016-10-28 16:21:31 +00:00
Alexander Motin
4be4cba048 Improve few debugging log messages. 2016-10-28 15:30:10 +00:00
Andriy Gapon
539fc86f2e 3746 ZRLs are racy
illumos/illumos-gate@260af64db7
260af64db7

https://www.illumos.org/issues/3746
  From the original change log:
  It was possible for a reference to be added even with the lock held, and
  for references added just after a lock release to be lost.
  This bug was also independently found and reported in wesunsolve.net
  issues 6985013 6995524.
  In zrl_add(), always use an atomic operation to update the refcount.
  The mutex in the ZRL only guarantees that wakeups occur for waiters on the
  lock. It offers no protection against concurrent updates of the refcount.
  The only refcount transition that is safe to perform without an atomic
  operation is from ZRL_LOCKED back to 0, since this can only be performed
  by the thread which has the ZRL locked.

Authored by: Will Andrews <will@freebsd.org>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Youzhong Yang <yyang@mathworks.com>
PR:		204037
MFC after:	1 week
2016-10-27 07:38:07 +00:00
Alexander Motin
f0cbbdecbc Fix panic after ZVOL renamed to name invalid for DEVFS.
MFC after:	2 weeks
2016-10-24 12:24:24 +00:00
Alexander Motin
9be66df1e1 Add vfs.zfs.zil_log_limit sysctl.
It is at least partially broken now, but that is another question.
2016-10-16 18:49:15 +00:00
Alexander Motin
a059d8ccbc Optimize ZIL itx memory allocation on FreeBSD.
These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB.  To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.

To make it possible de-inline few instances of zil_itx_destroy().
2016-10-16 10:43:12 +00:00
Alexander Motin
1899e205d1 MFV r307314:
6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates

Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().

The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.

We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".

This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.

Closes #107

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@5fc46359c5
2016-10-14 12:03:04 +00:00
Alexander Motin
b3a8b04807 MFV r307313:
5120 zfs should allow large block/gzip/raidz boot pool (loader project)

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>

openzfs/openzfs@c8811bd3e2

FreeBSD still does not support booting from gzip-compressed datasets,
so keep one chunk of this commit out.
2016-10-14 12:01:33 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Konstantin Belousov
f71d08566c Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

Reported and tested by:	jkim
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
2016-10-07 11:38:28 +00:00
Andriy Gapon
6f98c83306 implement zfs_vptocnp() using z_parent property
This should allow vn_fullpath() to work even when vfs name cache is
disabled for zfs, which is the case when zfs properties like
casesensitivity and normalization are set non-default values.

The new code should be 100% reliable for directories and "mostly"
reliable for files, that is, when hardlinks across directories are
not used.

Reported by:	Frederic Chardon <chardon.frederic@gmail.com>
Reviewed by:	kib (vfs contract)
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D8146
2016-10-07 06:29:24 +00:00
Andriy Gapon
9ba3abc30e zfs: fix a wrong assertion for extended attributes
For the extended attributes the order between z_teardown_lock and the
vnode lock is different.
The bug was triggered only with DIAGNOSTIC turned on.
This fix is developed in cooperation with avos.

PR:		213112
Reported by:	avos
Tested by:	avos
MFC after:	1 week
2016-10-04 08:09:25 +00:00
Mark Johnston
4538cee5bf Allow tracing of functions prefixed by "__".
This restriction was inherited from upstream but is not relevant on FreeBSD.
Furthermore, it hindered the tracing of locking primitive subroutines.

MFC after:	1 week
2016-10-02 00:35:00 +00:00
Alexander Motin
863ef2ca62 Add #ifdef _KERNEL around send_holes_without_birth_time sysctl.
Reported by:	avg@
2016-09-29 17:48:53 +00:00
Alexander Motin
226a11f81e MFV r306423: 7402 Create tunable to ignore hole_birth feature
Until we can resolve the numerous hole_birth bugs that have cropped up
recently, and come up with a way going forwards to protect users from
corruption, we should disable the hole_birth feature.  Using a tunable
allows those who are confident that their data is correct to continue to
take advantage of the feature.

Closes #188

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
2016-09-29 00:00:37 +00:00
Alexander Motin
bb97118138 MFV r306422: 7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs
dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.

Closes #180

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
2016-09-28 23:54:47 +00:00
Mark Johnston
9e579a58c3 Move implementations of uread() and uwrite() to the illumos compat layer.
MFC after:	1 week
2016-09-24 21:40:14 +00:00
Andriy Gapon
d26312a4e4 fix vnode lock assertion for extended attributes directory
Background.  In ZFS a file with extended attributes has a special
directory associated with it where each extended attribute is a file.
The attribute's name is a file name and its value is a file content.
When the ownership of a file with extended attributes is changed, ZFS
also changes ownership of the special directory.  This is where the bug
was hit.

The bug was introduced in r209158.

Nota bene.  ZFS vnode locks are typically acquired before
z_teardown_lock (i.e., before ZFS_ENTER).  But this is not the case for
the vnodes that represent the extended attribute directory and files.
Those are always locked after ZFS_ENTER.  This is confusing and fragile.

PR:		212702
Reported by:	Christian Fuss to FreeNAS
Tested by:	mav
MFC after:	1 week
2016-09-24 08:13:15 +00:00
Mark Johnston
36f5d07745 Re-check the systrace probe ID before calling dtrace_probe().
Otherwise there exists a narrow window during which a syscall probe can be
disabled and cause a concurrently-running thread to call dtrace_probe()
with an invalid probe ID.

Reported by:	ngie
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2016-09-22 23:22:53 +00:00
Allan Jude
c2b475d0ee MFV r268120:
4936 lz4 could theoretically overflow a pointer with a certain input

  illumos/illumos-gate@58d0718061

Reviewed by:	delphij
MFC after:	2 weeks
Sponsored by:	ScaleEngine Inc.
Differential Revision:	https://reviews.freebsd.org/D7850
2016-09-11 17:48:06 +00:00
Alexander Motin
20e45e033c Switch random_get_pseudo_bytes() shim to arc4rand().
Our shim for Solaris random_get_bytes() uses read_random(), that looks
reasonable, since it guaranties reliably seeded random data.  On the other
side Solaris random_get_pseudo_bytes() does not provide this guarantie,
and its original Solaris implementation is equivalent to our arc4rand(),
using software crypto without stressing slower hardware RNG.
2016-09-10 09:37:41 +00:00
Alexander Motin
4605bf63c4 MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused
The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch:

    commit ca0cc3918a1789fa839194af2a9245f801a06b1a
    Author: Matthew Ahrens <mahrens@delphix.com>
    Date:   Fri Jul 24 09:53:55 2015 -0700

        5959 clean up per-dataset feature count code
        Reviewed by: Toomas Soome <tsoome@me.com>
        Reviewed by: George Wilson <george@delphix.com>
        Reviewed by: Alex Reece <alex@delphix.com>
        Approved by: Richard Lowe <richlowe@richlowe.net>

This patch simply removes this macro from dsl_dataset.h.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-07 20:09:24 +00:00
Alexander Motin
de1fdddeda MFV r305560: 7278 tuning zfs_arc_max does not impact arc_c_min
When changing zfs_arc_max (e.g. as zdb does), it may be set to less
than the default arc_c_min. arc_c_min should decrease to not be more than
arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@608764bead
2016-09-07 20:05:10 +00:00
Andriy Gapon
1a82707cd7 fix zfs pool creation accidentally broken by r305331
The upstream change introduced a new load state, SPA_LOAD_CREATE,
and vdev_geom code needs to be aware of it.

Tested by:	cy
MFC after:	1 week
X-MFC with:	r305331
2016-09-06 06:09:12 +00:00
Alexander Motin
9b9258a12a Missed FreeBSD-specific piece of r305338. 2016-09-03 11:17:33 +00:00
Alexander Motin
d7e781bda3 MFC r305337: 7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes #109

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@d3e523d489
2016-09-03 11:00:29 +00:00
Alexander Motin
4ad4b70e77 MFV r305336: 7247 zfs receive of deduplicated stream fails
This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.

Closes #159

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>

openzfs/openzfs@b09697c8c1
2016-09-03 10:59:05 +00:00
Alexander Motin
070da3f779 MFV r305335: 7003 zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Closes #108

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@0780b3eab5
2016-09-03 10:58:14 +00:00
Alexander Motin
d3ec2cdb4a MFV r304157:
7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records

illumos/illumos-gate@12b90ee2d3
https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f
e1d9cfa

https://www.illumos.org/issues/7230
  A test failure occurred where a send stream had only a BEGIN record. This
  should not be possible if the send returns without error. Prevented this from
  happening in the future by adding an assertion to dmu_send_impl() to verify
  that if the function returns 0 (success) both a BEGIN and END record are
  present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
  END records sent), having dump_record() set flags appropriately, adding VERIFY
  statement to dmu_send_impl().

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>
2016-09-03 10:10:58 +00:00
Alexander Motin
7aafc9d4c8 MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr
illumos/illumos-gate@bd56f80007
https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec
844a96b

https://www.illumos.org/issues/7235
  The function dsl_dataset_set_blkptr() is unused. We should remove it.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-03 10:09:23 +00:00
Alexander Motin
c9fa25c110 MFV r304155: 7090 zfs should improve allocation order and throttle allocations
illumos/illumos-gate@0f7643c737
https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b
10c846a

https://www.illumos.org/issues/7090
  When write I/Os are issued, they are issued in block order but the ZIO pipelin
e
  will drive them asynchronously through the allocation stage which can result i
n
  blocks being allocated out-of-order. It would be nice to preserve as much of
  the logical order as possible.
  In addition, the allocations are equally scattered across all top-level VDEVs
  but not all top-level VDEVs are created equally. The pipeline should be able t
o
  detect devices that are more capable of handling allocations and should
  allocate more blocks to those devices. This allows for dynamic allocation
  distribution when devices are imbalanced as fuller devices will tend to be
  slower than empty devices.
  The change includes a new pool-wide allocation queue which would throttle and
  order allocations in the ZIO pipeline. The queue would be ordered by issued
  time and offset and would provide an initial amount of allocation of work to
  each top-level vdev. The allocation logic utilizes a reservation system to
  reserve allocations that will be performed by the allocator. Once an allocatio
n
  is successfully completed it's scheduled on a given top-level vdev. Each top-
  level vdev maintains a maximum number of allocations that it can handle
  (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
  mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
  groups and round robin across all eligible metaslab groups to distribute the
  work. As top-levels complete their work, they receive additional work from the
  pool-wide allocation queue until the allocation queue is emptied.

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 10:04:37 +00:00
Alexander Motin
0b51a59fc7 MFV r303078:
7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer

illumos/illumos-gate@926549256b
https://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7
8fa08ef

https://www.illumos.org/issues/7086
  In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the
  db_blkptr, to prevent it from being changed by syncing context.
  Otherwise we may see that ztest got a segfault from this stack:
  libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20,
0)
  libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0)
  libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530)
  libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1)
  libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1)
  libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780)
  libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a)
  ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0)
  ztest_remove+0x9f(ea294588, ea293f50, 4, 3)
  ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1)
  ztest_dmu_object_alloc_free+0x71(ea294588, 13)
  ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0)
  ztest_execute+0x89(a, 807c720, 13, 0)
  ztest_thread+0xea(13, 0, 0, 0)
  libc.so.1`_thrp_setup+0x88(f0983240)
  libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0)
  Looking into it a bit, we see that this is an embedded blockpointer, so
  BP_GET_NDVAS should have returned 0:
       b32b240::blkptr
  EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L
  Instead, it looks like another thread is modifying this blockpointer:
       b32b240::ugrep | ::whatis
  f47a0e0c is in [ stack tid=0x19f ]
  ebd6ec40 is in [ stack tid=0x226 ]
  ea293bd0 is in [ stack tid=0x244 ]
  ea293be4 is in [ stack tid=0x244 ]

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-03 08:43:43 +00:00
Alexander Motin
84c3781ac9 MFV r303077:
7072 zfs fails to expand if lun added when os is in shutdown state

illumos/illumos-gate@c39a2aae1e
c39a2aae1e

https://www.illumos.org/issues/7072
  upstream:
  38733 zfs fails to expand if lun added when os is in shutdown state
  DLPX-36910 spares and caches should not display expandable space
  DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 08:42:12 +00:00
Alexander Motin
efa0867fb0 MFV r302991: 6950 ARC should cache compressed data
illumos/illumos-gate@dcbf3bd6a1
dcbf3bd6a1

https://www.illumos.org/issues/6950
  When reading compressed data from disk, the ARC should keep the compressed
  block cached and only decompress it when consumers access the block. The
  uncompressed data should be short-lived allowing the ARC to cache a much larger
  amount of data. The DMU would also maintain a smaller cache of uncompressed
  blocks to minimize the impact of decompressing frequently accessed blocks.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 08:30:51 +00:00
Alexander Motin
c543b519be MFV r304158:
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information

7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often

illumos/illumos-gate@b72b6bb10a
https://github.com/illumos/illumos-gate/commit/b72b6bb10ad55121a1b352c6f68ebdc8e
20c9086

https://www.illumos.org/issues/7136
  6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents
  whenever an aux device gets removed from a pool. However, those sysevents will
  be created without the vdev_guid and vdev_path fields. It would be better to
  always populate those fields.

https://www.illumos.org/issues/7115
  The addition of spa_event_notify in vdev removal code (see #6922) causes event
s
  to be generated even if the spare failed to be removed with EBUSY.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>
2016-09-01 18:37:11 +00:00
Alexander Motin
25584d12e7 MFV r302993: 7104 increase indirect block size
illumos/illumos-gate@4b5c8e93ca
https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3
be1db1c

https://www.illumos.org/issues/7104
  The current default indirect block size is 16KB. We can improve
  performance by increasing it to 128KB. This is especially helpful for
  any workload that needs to read most of the metadata, e.g.
  scrub/resilver, file deletion, filesystem deletion, and zfs send.
  We also need to fix a few space estimation errors to make the tests
  pass.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 18:33:39 +00:00
Alexander Motin
dd7f7cb7ac MFV r302992: 7071 lzc_snapshot does not fill in errlist on ENOENT
illumos/illumos-gate@25f7d993ad
https://github.com/illumos/illumos-gate/commit/25f7d993adbfb3452ac4625b379167074
6d35ae3

https://www.illumos.org/issues/7071
  upstream
  DLPX-40482 lzc_snapshot does not fill in errlist on ENOENT

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 18:25:49 +00:00
Alexander Motin
3d1e0e0830 MFV r302662: 6447 handful of nvpair cleanups
illumos/illumos-gate@759e89be35
https://github.com/illumos/illumos-gate/commit/759e89be359f2af635e4122d147df56bc
e948773

https://www.illumos.org/issues/6447
  I got a patch from someone who uses nvpair code outside of illumos. It fixes a
  couple of gcc warnings/bugs for him.
     1. silence uninitialized use warnings
     2. add parentheses around assignment used as truth value
     3. fix printf format specifier (ll is for integers only)
     4. strstr, strspn, strcspn, and strcmp are declared in string.h, not
        strings.h.
     5. avoid scanning integer into boolean variable

Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Steve Dougherty <sdougherty@barracuda.com>
2016-09-01 15:17:39 +00:00
Alexander Motin
3421688c2d MFV r302661: 7082 bptree_iterate() passes wrong args to zfs_dbgmsg()
illumos/illumos-gate@10e67aa0db
https://github.com/illumos/illumos-gate/commit/10e67aa0db0823d5464aafdd681f3c966
155c68e

https://www.illumos.org/issues/7082
  upstream
  DLPX-40542 bptree_iterate() passes wrong args to zfs_dbgmsg()

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 15:10:40 +00:00
Alexander Motin
41b9077ef6 MFV r302660: 6314 buffer overflow in dsl_dataset_name
illumos/illumos-gate@9adfa60d48
https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6
92b6660

https://www.illumos.org/issues/6314
  Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but
  dsl_dataset_name copies the datasets' name PLUS the snapshot name to it,
  resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 15:08:27 +00:00
Alexander Motin
e12a269749 MFV r302659: 6931 lib/libzfs: cleanup gcc warnings
illumos/illumos-gate@88f61dee20
88f61dee20

https://www.illumos.org/issues/6931
  need cleanup:
  CERRWARN += -_gcc=-Wno-switch
  CERRWARN += -_gcc=-Wno-parentheses
  CERRWARN += -_gcc=-Wno-unused-function

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Igor Kozhukhov <ikozhukhov@gmail.com>
2016-09-01 14:57:06 +00:00
Alexander Motin
a95a9fe945 MFV r302651: 7054 dmu_tx_hold_t should use refcount_t to track space
illumos/illumos-gate@0c779ad424
https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009
189202b

https://www.illumos.org/issues/7054
  upstream:
  ee0003de7d3e598499be7ac3fe6b61efcc47cb7f
  DLPX-40399 dmu_tx_hold_t should use refcount_t to track space

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 14:38:25 +00:00
Alexander Motin
96bf48b8cb MFV r302648: 7019 zfsdev_ioctl skips secpolicy when FKIOCTL is set
Note that the bulk of the upstream change is not applicable to FreeBSD
and the affected files are not even in the vendor area.

illumos/illumos-gate@45b1747515
45b1747515

https://www.illumos.org/issues/7019
  Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set,
  skips all processing of secpolicy functions. This means that ZFS is not doing
  any kind of verification of the credentials or access rights of the caller and
  assuming that (as it is an in-kernel client) all such checks have already been
  done.
  This turns out to be quite a dangerous assumption, especially with respect to
  sdev. In general I don't think it's particularly reasonable to offload this
  enforcement of access rights onto other kernel subsystems when ZFS has some
  particular local semantics in this area (delegated datasets etc) and does not
  provide any kind of API to allow other subsystems to avoid code duplication
  when doing it. ZFS should apply its normal access policy to requests from
  within the kernel, and callers should take care to give it the correct
  credentials and call it from the correct context in order to get the results
  they need.
  You can observe the currently unfortunate consequences of this bug in any non-
  global zone that has access to /dev/zvol or any subset of it via sdev profiles.
  In particular, a zone used to contain a KVM or similar which has a single zvol
  passed through to it using a <device match= block in its zone XML.
  Even though sdev makes something of an attempt to control for whether the
  caller should have access to nodes in /dev/zvol, it doesn't do this correctly,
  or really at all in the lookup call path. So, if we have a zone that's been
  given access to any part of /dev/zvol, it can simply look up the full path to
  any other zvol on the entire system, and the node will appear and be able to be
  used.

Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Wilson <alex.wilson@joyent.com>
2016-09-01 14:24:54 +00:00
Alexander Motin
13876b47d7 MFV r302647: 6922 Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device
illumos/illumos-gate@63364b0ee2
https://github.com/illumos/illumos-gate/commit/63364b0ee2604783e7a55f84258888677
68eafa4

https://www.illumos.org/issues/6922
  ZFS does not do a config_sync after removing an aux (spare, log, or cache)
  device. AFAICT this isn't being done because it is slow and was deemed
  unnecessary. However, it should be such a rare operation that speed doesn't
  matter, and not doing it results in two problems:
  1) It is theoretically possible to remove an aux device from one pool and
  attach it to another, then lose power. When power is restored, both pools woul
d
  think that they own the aux device.
  2) Removal of the aux device doesn't send any useful sysevents to userland.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@gmail.com>
2016-09-01 14:17:30 +00:00
Alexander Motin
1c7d88abed MFV r302646:
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch

illumos/illumos-gate@ea4a67f462
https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84
9de5371

https://www.illumos.org/issues/6980
  doing zfs send -i snap1 snap2 >testfile results in
  internal error: Invalid argument
  Abort (core dumped)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 14:06:30 +00:00
Alexander Motin
4536fd9bed MFV r302643:
6902 speed up listing of snapshots if requesting name only and sorting by name

This was our change from the beginning, so just reduce the upstream diff.
2016-09-01 13:29:53 +00:00
Alexander Motin
5fd28943d6 MFV r302642:
6876 Stack corruption after importing a pool with a too-long name

illumos/illumos-gate@c971037baa
c971037baa

https://www.illumos.org/issues/6876
  Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for
  trouble. We should check every dataset on import, using a 1024 byte buffer and
  checking each time to see if the dataset's new name is longer than 256 bytes.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
2016-09-01 13:04:36 +00:00