497 Commits

Author SHA1 Message Date
Andriy Gapon
f9e7ac9d61 8063 verify that we do not attempt to access inactive txg
illumos/illumos-gate@b7b2590dd9
b7b2590dd9

https://www.illumos.org/issues/8063
  A standard practice in ZFS is to keep track of "per-txg" state. Any of
  the 3 active TXG's (open, quiescing, syncing) can have different values
  for this state. We should assert that we do not attempt to modify other
  (inactive) TXG's.

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-05-26 11:35:34 +00:00
Andriy Gapon
92455d0cca 7786 zfs`vdev_online() needs better notification about state changes
illumos/illumos-gate@5f368aef86
5f368aef86

https://www.illumos.org/issues/7786
  Currently, vdev_online() will only post sysevent if previous state was
  "offline". It should also post the event when the state changes from "removed"
  or "faulted" to "healthy" or "degraded".
  This will fix the following scenario:
  - pull disk from slot A
  - check that hotspare has taken its place (if available)
  - insert disk into slot B
  - check that hotspare moved back to "avail" state (if spare was used)
  The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail
  to update the vdev FRU.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Approved by: Albert Lee <trisk@forkgnu.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
2017-05-26 11:32:05 +00:00
Andriy Gapon
1c42c71f38 8025 dbuf_read() creates unnecessary zio_root() for bonus buf
illumos/illumos-gate@def4fac588
def4fac588

https://www.illumos.org/issues/8025
  dbuf_read() creates a zio_root() to track and wait for all the zio's
  that may happen as part of this call. However, if the blkptr_t for
  this buffer is NULL or a hole, we will not create any more zio's, so
  this zio_root() is unnecessary. This is always the case when calling
  dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
  the containing dnode). For workloads that read a lot of bonus buffers
  (e.g. file creation and removal), creating and destroying these
  unnecessary zio's can decrease performance by around 3%.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-05-26 11:29:31 +00:00
Andriy Gapon
c1290d6583 5814 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
illumos/illumos-gate@b67dde11a7
b67dde11a7

https://www.illumos.org/issues/5814
  Lets pull in this patch from freebsd:
  http://svnweb.freebsd.org/base?view=revision&revision=271781
  bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
  If bpobj_space() returned non-zero here, the sublist would have been
  left open, along with the bonus buffer hold it requires. This call
  does not invoke any calls to bpobj_close() itself.
  This bug doesn't have any known vector, but was found on inspection.
  MFC after: 1 week
  Sponsored by: Spectra Logic
  Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc)
  MFSpectraBSD: r1050998 on 2014/03/26
  Fix bpobj_iterate_impl() to properly call bpobj_close() if bpobj_space()
  returns an error.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Will Andrews <will@freebsd.org>
2017-04-14 18:43:10 +00:00
Andriy Gapon
9a884731a3 6914 kernel virtual memory fragmentation leads to hang
illumos/illumos-gate@af868f46a5
af868f46a5

https://www.illumos.org/issues/6914
  This change allows the kernel to use more virtual address space. This will
  allow us to devote 1.5x physmem for the zio arena, and an additional 1.5x
  physmem for the kernel heap.
  We saw a hang when unable to find any 128K contiguous memory segments. Looking
  at the core file we see many threads in stacks similar to this:
  > ffffff68c9c87c00::findstack -v
  stack pointer for thread ffffff68c9c87c00: ffffff02cd63d8b0
  [ ffffff02cd63d8b0 _resume_from_idle+0xf4() ]
    ffffff02cd63d8e0 swtch+0x141()
    ffffff02cd63d920 cv_wait+0x70(ffffff6009b1b01e, ffffff6009b1b020)
    ffffff02cd63da50 vmem_xalloc+0x640(ffffff6009b1b000, 20000, 1000, 0, 0, 0, 0, ffffff0200000004)
    ffffff02cd63dac0 vmem_alloc+0x135(ffffff6009b1b000, 20000, 4)
    ffffff02cd63db60 segkmem_xalloc+0x171(ffffff6009b1b000, 0, 20000, 4, 0, fffffffffb885fe0, fffffffffbcefa10)
    ffffff02cd63dbc0 segkmem_alloc_vn+0x4a(ffffff6009b1b000, 20000, 4, fffffffffbcefa10)
    ffffff02cd63dbf0 segkmem_zio_alloc+0x20(ffffff6009b1b000, 20000, 4)
    ffffff02cd63dd20 vmem_xalloc+0x5b1(ffffff6009b1c000, 20000, 1000, 0, 0, 0, 0, 4)
    ffffff02cd63dd90 vmem_alloc+0x135(ffffff6009b1c000, 20000, 4)
    ffffff02cd63de20 kmem_slab_create+0x8d(ffffff605fd37008, 4)
    ffffff02cd63de80 kmem_slab_alloc+0x11e(ffffff605fd37008, 4)
    ffffff02cd63dee0 kmem_cache_alloc+0x233(ffffff605fd37008, 4)
    ffffff02cd63df10 zio_data_buf_alloc+0x5b(20000)
    ffffff02cd63df70 arc_get_data_buf+0x92(ffffff6265a70588, 20000, ffffff901fd796f8)
    ffffff02cd63dfb0 arc_buf_alloc_impl+0x9c(ffffff6265a70588, ffffff6d233ab0b8)

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:41:37 +00:00
Andriy Gapon
9f33176926 7256 low probability race in zfs_get_data
illumos/illumos-gate@0c94e1af67
0c94e1af67

https://www.illumos.org/issues/7256
                         error = dmu_sync(zio, lr->lr_common.lrc_txg,
                              zfs_get_done, zgd);
                         ASSERT(error || lr->lr_length <= zp->z_blksz);
  It's possible, although extremely rare, that the zfs_get_done() callback is
  executed before dmu_sync() returns.
  In that case the znode's range lock is dropped and the znode is unreferenced.
  Thus, the assertion can access some invalid or wrong data via the zp pointer.
  size variable caches the correct value of z_blksz and can be safely used here.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
2017-04-14 18:38:53 +00:00
Andriy Gapon
2ccc5f6fd1 5379 modifying a mmap()-ed file does not update its timestamps
illumos/illumos-gate@80e10fd0d2
80e10fd0d2

https://www.illumos.org/issues/5379
  The following is based on a review of the illumos code and on a similar problem
  reported for FreeBSD where the relevant code is different.
  Looking at this block of code http://src.illumos.org/source/xref/illumos-gate/
  usr/src/uts/common/fs/zfs/zfs_vnops.c#4187 I see code to set up an
  sa_bulk_attr_t object, I see code to set up mtime and ctime values, but I do
  not see code to actually apply the attributes...
  I would expect there to be a call to sa_bulk_update(), there is such a call in
  zfs_write() for instance.
  mmap_write.c [Magnifier] - demo (1.42 KB) Andriy Gapon, 2015-11-11 01:53 PM

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
2017-04-14 18:38:21 +00:00
Andriy Gapon
b7db959db3 6101 attempt to lzc_create() a filesystem under a volume results in a panic
illumos/illumos-gate@b127fe3c05
b127fe3c05

https://www.illumos.org/issues/6101
  lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to
  create a filesystem as a child of a volume, instead it proceeds to a crash.
  A crash stack obtained on FreeBSD:
  page fault while in kernel mode

  zap_leaf_lookup()
  fzap_lookup()
  zap_lookup_norm()
  zap_lookup()
  zfs_get_zplprop()
  zfs_fill_zplprops_impl()
  zfs_ioc_create()
  zfsdev_ioctl()
  devfs_ioctl_f()
  kern_ioctl()
  sys_ioctl()
  This crash happened with a kernel without debugging assertions.
  The immediate cause of crash appears to an attempt to interpret a zvol object
  as a zap object.
  For filesystems:
  #define MASTER_NODE_OBJ 1
  For zvols:
  #define ZVOL_OBJ                1ULL
  #define ZVOL_ZAP_OBJ            2ULL
  So, I see two problems here:
     1. an attempt to create a filesystem under a zvol should be rejected as
        early as possible, maybe in zfs_fill_zplprops()
     2. maybe zap_lookup / zap_lockdir should reject objects that are not of one
        of the zap object types

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
2017-04-14 18:33:20 +00:00
Andriy Gapon
d72c1a5bb1 8061 sa_find_idx_tab can be declared more type-safely
illumos/illumos-gate@7f0bdb4257
7f0bdb4257

https://www.illumos.org/issues/8061
  sa_find_idx_tab() is declared as taking and returning "void *" parameters.
  These can be declared to be the specific types.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:32:38 +00:00
Andriy Gapon
0843b4c8ae 8026 retire zfs_throttle_delay and zfs_throttle_resolution
illumos/illumos-gate@6b03625981
6b03625981

https://www.illumos.org/issues/8026
  zfs_throttle_delay and zfs_throttle_resolution became disused since the new
  write throttling mechanism was introduced.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>
2017-04-14 18:32:12 +00:00
Andriy Gapon
45967327e2 8027 tighten up dsl_pool_dirty_delta
illumos/illumos-gate@313ae1e182
313ae1e182

https://www.illumos.org/issues/8027
  dsl_pool_dirty_delta() should not wake up waiters when dp->dp_dirty_total ==
  zfs_dirty_data_max, because they wait for dp_dirty_total to fall strictly below
  the threshold.
  It's probably very rare for that condition to occur, but it's better to have
  more accurate code.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
2017-04-14 18:29:35 +00:00
Andriy Gapon
bf644e9b6f 8023 Panic destroying a metaslab deferred range tree
illumos/illumos-gate@3991b535a8
3991b535a8

https://www.illumos.org/issues/8023
       $C
  ffffff0011bc0970 vpanic()
  ffffff0011bc0a00 strlog()
  ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00)
  ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380)
  ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800)
  ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000)
  ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0)
  ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000)
  ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000)
  ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68)
  ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68)
  ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68, 0)
  ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68, 0)
  ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190)
  ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149()

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
2017-04-14 18:29:13 +00:00
Andriy Gapon
99988b024d 7885 zpool list can report 16.0e for expandsz
illumos/illumos-gate@c040c10cdd
c040c10cdd

https://www.illumos.org/issues/7885
  When a member of a RAIDZ has been replaced with a device smaller than the
  original, then the top level vdev can report its expand size as 16.0E.
  The reduced child asize causes the RAIDZ to have a vdev_asize lower than its
  vdev_max_asize which then results in an underflow during the calculation of the
  parents expand size.
  Also for RAIDZ vdevs the sum of their child vdev_min_asize could be smaller
  than the parents vdev_min_size.
  Fixed by: https://github.com/openzfs/openzfs/pull/296

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Steven Hartland <steven.hartland@multiplay.co.uk>
2017-04-14 18:28:40 +00:00
Andriy Gapon
318cd645a2 7968 multi-threaded spa_sync()
illumos/illumos-gate@94c2d0eb22
94c2d0eb22

https://www.illumos.org/issues/7968
  spa_sync() iterates over all the dirty dnodes and processes each of them by
  calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
  or removed a lot of files), the single thread of spa_sync() calling dnode_sync
  () can become a bottleneck. Additionally, if many dnodes are dirtied
  concurrently in open context (e.g. due to concurrent file creation), the
  os_lock will experience lock contention via dnode_setdirty().
  The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
  use separate threads to process each of the sublists in the multilist.
  On the concurrent file creation microbenchmark, the performance improvement
  from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
  spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
  reward, once the other bottlenecks are addressed, fixing this bug will provide
  a medium-large performance gain and require a medium amount of effort to
  implement.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:26:24 +00:00
Andriy Gapon
260aaa4c3d 7970 zfs_arc_num_sublists_per_state should be common to all multilists
illumos/illumos-gate@10fbdecb05
10fbdecb05

https://www.illumos.org/issues/7970
  The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
  the dbuf cache, and other users are planned. We should change this
  tunable to be common to all multilists.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:25:49 +00:00
Andriy Gapon
bc390f4947 7801 add more by-dnode routines (lint)
illumos/illumos-gate@411be58a6e
411be58a6e

https://www.illumos.org/issues/7801
  Add *_by_dnode() routines for accessing objects given their
  dnode_t *, this is more efficient than accessing the object by
  (objset_t *, uint64_t object). This change converts some but
  not all of the existing consumers. As performance-sensitive
  code paths are discovered they should be converted to use
  these routines.
  Ported from: https://github.com/zfsonlinux/zfs/commit/
  0eef1bde31d67091d3deed23fe2394f5a8bf2276

Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:25:22 +00:00
Andriy Gapon
fd870ba29a 7801 add more by-dnode routines
illumos/illumos-gate@b0c42cd470
b0c42cd470

https://www.illumos.org/issues/7801
  Add *_by_dnode() routines for accessing objects given their
  dnode_t *, this is more efficient than accessing the object by
  (objset_t *, uint64_t object). This change converts some but
  not all of the existing consumers. As performance-sensitive
  code paths are discovered they should be converted to use
  these routines.
  Ported from: https://github.com/zfsonlinux/zfs/commit/
  0eef1bde31d67091d3deed23fe2394f5a8bf2276

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>
2017-04-14 18:25:02 +00:00
Andriy Gapon
c9327351e8 7869 panic in bpobj_space(): null pointer dereference
illumos/illumos-gate@a3905a4592
a3905a4592

https://www.illumos.org/issues/7869
  The issue fixed by this patch is a race condition in the deadlist code.
  A thread executing an administrative command that uses
  `dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to
  protect the access of all its entries that the deadlist contains in an
  avl tree.
  Sync threads trying to insert a new entry in the deadlist
  (through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the
  deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our
  sentinel value), we close and reopen it. Between these two operations,
  it is possible for the `dsl_deadlist_space_range()` thread to dereference
  that bpobj which is `NULL` during that window.
  Threads should hold the a deadlist's `dl_lock` when they manipulate its
  internal data so scenarios like the one above are avoided. In addition,
  threads should also hold the bpobj lock whenever they are allocating the
  subobj list of a bpobj, and not just when they actually insert the subobj
  to the list. This way we can avoid potential memory leaks.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2017-04-14 18:24:38 +00:00
Andriy Gapon
1ea6ff796d 7793 ztest fails assertion in dmu_tx_willuse_space
illumos/illumos-gate@61e255ce72
61e255ce72

https://www.illumos.org/issues/7793
  Background information: This assertion about tx_space_* verifies that we
  are not dirtying more stuff than we thought we would. We “need” to know
  how much we will dirty so that we can check if we should fail this
  transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
  transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
  typically less than a millisecond), we call dbuf_dirty() on the exact
  blocks that will be modified. Once this happens, the temporary
  accounting in tx_space_* is unnecessary, because we know exactly what
  blocks are newly dirtied; we call dnode_willuse_space() to track this
  more exact accounting.
  The fundamental problem causing this bug is that dmu_tx_hold_*() relies
  on the current state in the DMU (e.g. dn_nlevels) to predict how much
  will be dirtied by this transaction, but this state can change before we
  actually perform the transaction (i.e. call dbuf_dirty()).
  This bug will be fixed by removing the assertion that the tx_space_*
  accounting is perfectly accurate (i.e. we never dirty more than was
  predicted by dmu_tx_hold_*()). By removing the requirement that this
  accounting be perfectly accurate, we can also vastly simplify it, e.g.
  removing most of the logic in dmu_tx_count_*().
  The new tx space accounting will be very approximate, and may be more or
  less than what is actually dirtied. It will still be used to determine
  if this transaction will put us over quota. Transactions that are marked
  by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
  an attempt to determine how much space will be freed by the transaction
  — this was rarely accurate enough to determine if a transaction should
  be permitted when we are over quota, which is why dmu_tx_mark_netfree()
  was introduced in 2014.
  We also won’t attempt to give “credit” when overwriting existing blocks,
  if those blocks may be freed. This allows us to remove the
  do_free_accounting logic in dbuf_dirty(), and associated routines. This

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:24:03 +00:00
Andriy Gapon
59dc4d6bbc 7816 remove static unused variable in zfs_vfsops.c
illumos/illumos-gate@2e972bf18f
2e972bf18f

https://www.illumos.org/issues/7816
  found by gcc6 build unused static variable:
  -static const fs_operation_def_t zfs_vfsops_eio_template[] = {
  -    VFSNAME_FREEVFS,    { .vfs_freevfs =  zfs_freevfs },
  -    NULL,            NULL
  -};

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andy Stormont astormont@racktopsystems.com
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Igor Kozhukhov <igork@argotech.io>
2017-04-14 18:23:18 +00:00
Andriy Gapon
d3f85b54a4 7812 Remove gender specific language
illumos/illumos-gate@48bbca8168
48bbca8168

https://www.illumos.org/issues/7812
  This change removes all gendered language that did not refer specifically
  to an individual person or pet. The convention taken was to use
  variations on "they" when referring to users and/or human beings, while
  using "it" when referring to code, functions, and/or libraries.
  Additionally, we took the liberty to fix up any whitespace issues that
  were found in any files that were already being modified.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>
2017-04-14 18:22:42 +00:00
Andriy Gapon
b71204d77d 1300 filename normalization doesn't work for removes
illumos/illumos-gate@1c17160ac5
1c17160ac5

https://www.illumos.org/issues/1300
  Problem:
  We can create invisible file in ZFS.
  How to reproduce:
  0. Prepare normalization formD ZFS.
      # cat cat /etc/release
      cat: cat: No such file or directory
                   OpenIndiana Development oi_151 X86 (powered by illumos)
              Copyright 2011 Oracle and/or its affiliates. All rights reserved.
                              Use is subject to license terms.
                                 Assembled 28 April 2011
      # mkfile 100M 100M
      # zpool create -O utf8only=on -O normalization=formD test_pool $( pwd )/
  100M
      # zpool upgrade
      This system is currently running ZFS pool version 28.

      All pools are formatted using this version.
      # zfs get normalization test_pool
      NAME       PROPERTY       VALUE          SOURCE
      test_pool  normalization  formD          -
      # chmod 777 /test_pool

  1. Create a NFD file.
      $ cd /test_pool/
      $ cp /etc/release $( echo "\x75\xcc\x88" )
      $ ls -la
      total 4
      drwxrwxrwx  2 root  root    3 2011-07-29 08:53 .
      drwxr-xr-x 25 root  root   26 2011-07-29 08:53 ..
      -r--r--r--  1 test1 staff 251 2011-07-29 08:53 u?

Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>
2017-04-14 18:20:20 +00:00
Andriy Gapon
be49e7b29b 4242 file rename event fires before the rename happens
illumos/illumos-gate@54207fd2e1
54207fd2e1

https://www.illumos.org/issues/4242
  From Joyent's OS-2557:
  So we're basically just doing a check here that after we got a 'rename' event
  for our file, the file has actually been moved out of the way.
  What I've seen in bh1-stage2 is that this happens as you'd expect for all zones
  but many times over the last week it's failed because when we do the fs.exists
  () check here, the file that we got the rename event for, still exists.
  I've confirmed what's happening using the following dtrace script:
  #!/usr/sbin/dtrace -s

  #pragma D option quiet
  #pragma D option bufsize=256k

  syscall::open:entry,
  syscall::open64:entry
  /copyinstr(arg0) == "/var/svc/provisioning" || (strlen(copyinstr(arg0)) == 69
  && substr(copyinstr(arg0), 48) == "/var/svc/provisioning")/
  {
      this->watching_open = 1;
      printf("%d zone %s process %s(%d) [%s] open(%s)\\n",
          timestamp,
          zonename,
          execname,
          pid,
          curpsinfo->pr_psargs,
          copyinstr(arg0));
  }

  syscall::open:return,
  syscall::open64:return
  /this->watching_open == 1/

Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
2017-04-14 18:19:48 +00:00
Andriy Gapon
9dfe195883 7740 fix for 6513 only works in hole punching case, not truncation
illumos/illumos-gate@7de35a3ed0
7de35a3ed0

https://www.illumos.org/issues/7740
  The problem is that dbuf_findbp will return ENOENT if the block it's
  trying to find is beyond the end of the file. If that happens, we assume
  there is no birth time, and so we lose that information when we write
  out new blkptrs. We should teach dbuf_findbp to look for things that are
  beyond the current end, but not beyond the absolute end of the file.
  To verify, create a large file, truncate it to a short length, and then
  write beyond the end. Check with zdb to make sure that there are no
  holes with birth time zero (will appear as gaps).

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>
2017-04-14 18:15:47 +00:00
Andriy Gapon
16244ff154 7779 clean up unused definitions in zfs ctldir code
illumos/illumos-gate@b7f9f60c8e
b7f9f60c8e

https://www.illumos.org/issues/7779
  zfsctl_ops_shares_dir and ZFSCTL_INO_SHARES are essentially dead code.
  While there, fix the index range check in zfsctl_root_inode_cb.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
2017-04-14 18:14:41 +00:00
Andriy Gapon
768d3677f6 7743 per-vdev-zaps have no initialize path on upgrade
illumos/illumos-gate@555da5111b
555da5111b

https://www.illumos.org/issues/7743
  When loading a pool that had been created before the existance of
  per-vdev zaps, on a system that knows about per-vdev zaps, the
  per-vdev zaps will not be allocated and initialized.
  This appears to be because the logic that would have done so, in
  spa_sync_config_object(), is not reached under normal operation. It is
  only reached if spa_config_dirty_list is non-empty.
  The fix is to add another `AVZ_ACTION_` enum that will allow this code
  to be reached when we detect that we're loading an old pool, even when
  there are no dirty configs.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
2017-04-14 18:13:06 +00:00
Andriy Gapon
2bb93dd84d 7659 Missing thread_exit() in dmu_send.c
illumos/illumos-gate@f2c1e9bc48
f2c1e9bc48

https://www.illumos.org/issues/7659
  Two threads send_traverse_thread() and receive_writer_thread() should
  end with thread_exit();
  Mostly a cosmetic issue under IllumOS.
  https://github.com/openzfs/openzfs/pull/252

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Jorgen Lundman <lundman@lundman.net>
2017-04-14 18:11:53 +00:00
Andriy Gapon
c5c6a287f1 7613 ms_freetree[4] is only used in syncing context
illumos/illumos-gate@5f14577801
5f14577801

https://www.illumos.org/issues/7613
  metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should
  replace it with two trees: the freeing tree (ranges that we are freeing this
  syncing txg) and the freed tree (ranges which have been freed this txg).

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:11:16 +00:00
Andriy Gapon
26af00ef5b 7586 remove #ifdef __lint hack from dmu.h
illumos/illumos-gate@4ba5b96163
4ba5b96163

https://www.illumos.org/issues/7586
  The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if
  we add other inline functions into header files in ZFS, especially since it is
  difficult to make any other solution work across all compilation targets. We
  should switch to disabling the lint flags that are failing instead.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>
2017-04-14 18:10:23 +00:00
Andriy Gapon
2bfa921051 7580 ztest failure in dbuf_read_impl
illumos/illumos-gate@1a01181fdc
1a01181fdc

https://www.illumos.org/issues/7580
  We need to prevent any reader whenever we're about the zero out all the
  blkptrs. To do this we need to grab the dn_struct_rwlock as writer in
  dbuf_write_children_ready and free_children just prior to calling bzero.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
2017-04-14 18:09:32 +00:00
Andriy Gapon
cdeb4e5f08 7606 dmu_objset_find_dp() takes a long time while importing pool
illumos/illumos-gate@7588687e6b
7588687e6b

https://www.illumos.org/issues/7606
  When importing a pool with a large number of filesystems within the same
  parent filesystem, we see that dmu_objset_find_dp() takes a long time.
  It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
  and spa_load_verify().
  There are several ways to improve performance here:
  1. We don't really need to do spa_check_logs() or
         spa_ld_claim_log_blocks() if the pool was closed cleanly.
  2. spa_load_verify() uses dmu_objset_find_dp() to check that no
         datasets have too long of names.
  3. dmu_objset_find_dp() is slow because it's doing
         zap_value_search() (which is O(N sibling datasets)) to determine
         the name of each dsl_dir when it's opened. In this case we
         actually know the name when we are opening it, so we can provide
         it and avoid the lookup.
  This change implements fix #3 from the above list; i.e. make
  dmu_objset_find_dp() provide the name of the dataset so that we don't
  have to search for it.

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-14 18:08:50 +00:00
Andriy Gapon
b3264caf7b 7252 7628 compressed zfs send / receive
illumos/illumos-gate@5602294fda
5602294fda

https://www.illumos.org/issues/7252
  This feature includes code to allow a system with compressed ARC enabled to
  send data in its compressed form straight out of the ARC, and receive data in
  its compressed form directly into the ARC.

https://www.illumos.org/issues/7628
  We should have longer, more readable versions of the ZFS send / recv options.

7628 create long versions of ZFS send / receive options

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>
2017-04-14 18:07:43 +00:00
Andriy Gapon
8f77e16c27 7490 real checksum errors are silenced when zinject is on
illumos/illumos-gate@6cedfc397d
6cedfc397d

https://www.illumos.org/issues/7490
  When zinject is on, error codes from zfs_checksum_error() can be overwritten
  due to an incorrect and overly-complex if condition.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2017-04-14 17:15:49 +00:00
Andriy Gapon
aa06caf86c 7448 ZFS doesn't notice when disk vdevs have no write cache
illumos/illumos-gate@295438ba32
295438ba32

https://www.illumos.org/issues/7448
       I built a SmartOS image with all the NVMe commits including 7372
       (support NVMe volatile write cache) and repeated my dd testing:
       > #!/bin/bash
       > for i in `seq 1 1000`; do
       > dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
       > dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
       > wait
       > rm file00 file01
       > done
       >
       Previously each dd command took ~145 seconds to finish, now it takes
       ~400 seconds.
       Eventually I figured out it is 7372 that causes unnecessary
       nvme_bd_sync() executions which wasted CPU cycles.
  If a NVMe device doesn't support a write cache, the nvme_bd_sync function will
  return ENOTSUP to indicate this to upper layers.
  It seems this returned value is ignored by ZFS, and as such this bug is not
  really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the
  disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled
  in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the
  nvme driver in general does support cache flush. Instead it will issue an
  asynchronous flush to nvme and immediately return 0, and hence ZFS will not set
  vdev_nowritecache here. The nvme driver will at some point process the cache
  flush command, and if there is no write cache on the device it will return
  ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This
  function will not check the error code and not set nowritecache.
  The right place to check the error code from the cache flush is in
  zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous
  cache flushes. This would also be independent of the implementation detail that
  some drivers can return ENOTSUP immediately.

Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
2017-04-14 17:15:11 +00:00
Andriy Gapon
40e6e931e0 7430 Backfill metadnode more intelligently
illumos/illumos-gate@af346df588
af346df588

https://www.illumos.org/issues/7430
  Description and patch from brought over from the following ZoL commit: https://
  github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6
  Only attempt to backfill lower metadnode object numbers if at least
  4096 objects have been freed since the last rescan, and at most once
  per transaction group. This avoids a pathology in dmu_object_alloc()
  that caused O(N^2) behavior for create-heavy workloads and
  substantially improves object creation rates. As summarized by
  @mahrens in #4636:
  "Normally, the object allocator simply checks to see if the next
  object is available. The slow calls happened when dmu_object_alloc()
  checks to see if it can backfill lower object numbers. This happens
  every time we move on to a new L1 indirect block (i.e. every 32 *
  128 = 4096 objects). When re-checking lower object numbers, we use
  the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
  indirect blocks that don’t have enough free dnodes (defined as an L2
  with at least 393,216 of 524,288 dnodes free). Therefore, we may
  find that a block of dnodes has a low (or zero) fill count, and yet
  we can’t allocate any of its dnodes, because they've been allocated
  in memory but not yet written to disk. In this case we have to hold
  each of the dnodes and then notice that it has been allocated in
  memory.
  The end result is that allocating N objects in the same TXG can
  require CPU usage proportional to N^2."
  Add a tunable dmu_rescan_dnode_threshold to define the number of
  objects that must be freed before a rescan is performed. Don't bother
  to export this as a module option because testing doesn't show a
  compelling reason to change it. The vast majority of the performance
  gain comes from limit the rescan to at most once per TXG.

Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Ned Bass <bass6@llnl.gov>
2017-04-14 17:13:49 +00:00
Josh Paetzel
fedc1382bd 7603 xuio_stat_wbuf_* should be declared (void)
illumos/illumos-gate@99aa8b5505
99aa8b5505

https://www.illumos.org/issues/7603

  The funcs are declared k&r style, where the args are not specified:

  void xuio_stat_wbuf_copied();
  They should be declared to take no arguments:

  void xuio_stat_wbuf_copied(void);
  Need to change both .c and .h.

Author: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
2017-03-26 16:49:20 +00:00
Josh Paetzel
7e45ac57cb 7303 dynamic metaslab selection
illumos/illumos-gate@8363e80ae7
https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18aa53f0

https://www.illumos.org/issues/7303

  This change introduces a new weighting algorithm to improve metaslab selection.
  The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result,
  the metaslab weight now encodes the type of weighting algorithm used
  (size-based vs segment-based).

  This also introduce a new allocation tracing facility and two new dcmds to help
  debug allocation problems. Each zio now contains a zio_alloc_list_t structure
  that is populated as the zio goes through the allocations stage. Here's an
  example of how to use the tracing facility:

> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
  MSID    DVA    ASIZE      WEIGHT             RESULT               VDEV
     -      0      400           0    NOT_ALLOCATABLE           ztest.0a
     -      0      400           0    NOT_ALLOCATABLE           ztest.0a
     -      0      400           0             ENOSPC           ztest.0a
     -      0      200           0    NOT_ALLOCATABLE           ztest.0a
     -      0      200           0    NOT_ALLOCATABLE           ztest.0a
     -      0      200           0             ENOSPC           ztest.0a
     1      0      400      1 x 8M            17b1a00           ztest.0a

> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
  MSID    DVA    ASIZE      WEIGHT             RESULT               VDEV
     -      0      200           0    NOT_ALLOCATABLE           mirror-2
     -      0      200           0    NOT_ALLOCATABLE           mirror-0
     1      0      200      1 x 4M            112ae00           mirror-1
     -      1      200           0    NOT_ALLOCATABLE           mirror-2
     -      1      200           0    NOT_ALLOCATABLE           mirror-0
     1      1      200      1 x 4M            112b000           mirror-1
     -      2      200           0    NOT_ALLOCATABLE           mirror-2

  If the metaslab is using segment-based weighting then the WEIGHT column will
  display the number of segments available in the bucket where the allocation
  attempt was made.

Author: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
2017-03-15 04:18:40 +00:00
Andriy Gapon
b060bbc16a 7867 ARC space accounting leak
illumos/illumos-gate@6de76ce2a9
6de76ce2a9

https://www.illumos.org/issues/7867
  It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we
  would fail to update the ARC space statistics.
  In the normal case those statistics are updated in arc_free_data_buf(). But in
  the arc_hdr_free_on_write() path we don't do that.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
2017-03-08 13:40:24 +00:00
Andriy Gapon
f1a99e4268 7843 get_clones_stat() is suboptimal for lots of clones
illumos/illumos-gate@c5bde7273e
c5bde7273e

https://www.illumos.org/issues/7843
  get_clones_stat() could be very slow if a snapshot has many (thousands) clones.
  Clone names are added to an nvlist that's created with NV_UNIQUE_NAME.
  So, each time a new name is appended to the list, the whole list is searched
  linearly to see if that name is not already in the list. That results in the
  quadratic complexity.
  That should be easy to fix as we know in advance that we should not get any
  duplicate names, so we can drop NV_UNIQUE_NAME when creating the list.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
2017-03-08 13:39:22 +00:00
Josh Paetzel
dbac7ab9c0 7570 tunable to allow zvol SCSI unmap to return on commit of txn to ZIL
illumos/illumos-gate@1c9272b861
1c9272b861

https://www.illumos.org/issues/7570

  Based on the discovery that every unmap waits for the commit of the txn to the ZIL,
  introducing a very high latency to unmap commands, this behavior was made into a
  tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is
  that by default SCSI unmap commands will result in space being freed within the zvol
  (today they are ignored and returned with good status). However, unlike the code
  today, instead of 18+ms per unmap, they take about 30us.

  With the testing done on NTFS against a Win2k12 target, the new behavior should work
  seamlessly. Files on the zvol that have already been set with the zfree application
  will continue to write 0's when deleted, and any new files created since zvol
  creation will send unmap commands when deleted. This behavior exists today, but with
  this change the unmap commands will be processed and result in reclaim of space.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
2017-02-25 19:16:20 +00:00
Josh Paetzel
acf1cb602b 6676 Race between unique_insert() and unique_remove() causes ZFS fsid change
illumos/illumos-gate@40510e8eba
40510e8eba

https://www.illumos.org/issues/6676

  The fsid of zfs filesystems might change after reboot or remount. The problem seems to
  be caused by a race between unique_insert() and unique_remove(). The unique_remove()
  is called from dsl_dataset_evict() which is now an asynchronous thread. In a case the
  dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end
  up with changed fsid on zfs mount.

  This problem is very likely caused by #5056.

  Steps to Reproduce
  Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore
  machines it is not so easy to reproduce.

# uname -a
SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris
# zfs create rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
    vfs_fsid.val = [ 0x54d7028a, 0x70311508 ]
}
# zfs umount rpool/TEST
# zfs mount rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
    vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ]
}
#

  Impact
  The persistent fsid (filesystem id) is essential for proper NFS functionality.
  If the fsid of a filesystem changes on remount (or after reboot) the NFS
  clients might not be able to automatically recover from such event and the
  manual remount of the NFS filesystems on every NFS client might be needed.

Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
2017-02-25 03:34:22 +00:00
Josh Paetzel
91c038e77a 7504 kmem_reap hangs spa_sync and administrative tasks
illumos/illumos-gate@405a5a0f5c
https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80

https://www.illumos.org/issues/7504

  We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some
  other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(),
  which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding
  dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv.
  Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds.

Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
2017-02-17 15:00:13 +00:00
Josh Paetzel
281825cdf1 7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid
illumos/illumos-gate@653af1b809
653af1b809

https://www.illumos.org/issues/7500
  With the integration of:

    commit 0f6d88aded0d165f5954688a9b13bac76c38da84
    Author: Alex Reece <alex@delphix.com>
    Date:   Sat Jul 26 13:40:04 2014 -0800
    4873 zvol unmap calls can take a very long time for larger datasets

  the dnode's dn_bufs field was changed from a list to a tree. As a result,
  the dn_unlisted_l0_blkid field is no longer necessary.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
2017-02-16 01:44:56 +00:00
Josh Paetzel
1e37d7e555 6569 large file delete can starve out write ops
illumos/illumos-gate@ff5177ee8b
ff5177ee8b

https://www.illumos.org/issues/6569
  The core issue I've found is that there is no throttle for how many
  deletes get assigned to one TXG. As a results when deleting large files
  we end up filling consecutive TXGs with deletes/frees, then write
  throttling other (more important) ops.

  There is an easy test case for this problem. Try deleting several
  large files (at least 1/2 TB) while you do write ops on the same
  pool. What we've seen is performance of these write ops (let's
  call it sideload I/O) would drop to zero.

  More specifically the problem is that dmu_free_long_range_impl()
  can/will fill up all of the dirty data in the pool "instantly",
  before many of the sideload ops can get in. So sideload
  performance will be impacted until all the files are freed.

  The solution we have tested at Nexenta (with positive results)
  creates a relatively simple throttle for how many "free" ops we let
  into one TXG.

  However this solution exposes other problems that should also be
  addressed. If we are to slow down freeing of data that means one
  has to wait even longer (assuming vnode ref count of 1) to get shell
  back after an rm or for NFS thread to finish the free-ing op.
  To avoid this the proposed solution is to call zfs_inactive() async
  for "large" files. Async freeing then begs for the reclaimed space
  to be accounted for in the zpool's "freeing" prop.

  The other issue with having a longer delete is the inability to
  export/unmount for a longer period of time. The proposed solution
  is to interrupt freeing of blocks when a fs is unmounted.

Author: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
2017-01-19 20:44:29 +00:00
Andriy Gapon
aed3e94d2f 3821 Race in rollback, zil close, and zil flush
illumos/illumos-gate@43297f973a
43297f973a

https://www.illumos.org/issues/3821
  We recently had nodes with some of the latest zfs bits panic on us in a
  rollback-heavy environment. The following is from my preliminary analysis:
  Let's look at where we died:
  > $C
  ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1)
  ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1)
  ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1)
  ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1)
  ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680)
  ffffff01ea6b9c30 thread_start+8()
  If we dig in we can find that this dataset corresponds to a zone:
  > ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname
  zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233-
  9e378e89877b" ]
  Okay so we have a null taskq pointer. That only happens during the calls to
  zil_open and zil_close. If we poke around we can see that we're actually in
  midst of a rollback:
  > ::pgrep zfs | ::printf "0x%x %s\\n" proc_t . p_user.u_psargs
  0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d-
  4b8ab06213c2@marlin_init
  0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233-
  9e378e89877b@marlin_init
  0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc-
  4ac4ba8fe9f8@marlin_init
  0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e-
  10cd45ff8e9f@marlin_init
  0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c-
  6d1b2751e6f5@marlin_init
  0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24-
  2ed45cd0da70@marlin_init

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
2016-11-28 15:09:58 +00:00
Andriy Gapon
aee4554721 7181 race between zfs_mount and zfs_ioc_rollback
illumos/illumos-gate@90f2c094b3
90f2c094b3

https://www.illumos.org/issues/7181
  zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
  dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
  before the setup is actually completed,
  thus an under-constructed zfsvfs becomes visible.
  Additionally, there is nothing to serialize the two call paths. As a result two
  threads can step on each other's toes.
  assertion failed: zilog->zl_clean_taskq == NULL, file:
  ../../common/fs/zfs/zil.c, line: 1772

  > $c
  vpanic()
  0xfffffffffbdf6928()
  zil_open+0x45(ffffff1bbc5dd000, fffffffff7993880)
  zfsvfs_setup+0x84(ffffffb378d77000, 0)
  zfs_resume_fs+0x132(ffffffb378d77000, ffffffb37ddcf000)
  zfs_ioc_rollback+0x96(ffffffb37ddcf000, ffffff01dcdc4cd0, ffffff01aa091000)
  zfsdev_ioctl+0x215(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
  ffffff0004b59e58)
  cdev_ioctl+0x39(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
  ffffff0004b59e58)
  spec_ioctl+0x60(ffffff0197737700, 5a19, 80465f8, 100003,
  ffffff01ab318368, ffffff0004b59e58)
  fop_ioctl+0x55(ffffff0197737700, 5a19, 80465f8, 100003,
  ffffff01ab318368, ffffff0004b59e58)
  ioctl+0x9b(7, 5a19, 80465f8)
  sys_syscall32+0x1f7()

  > ffffff1bbc5dd000::print objset_t os_zil
  os_zil = 0xffffff1c053cf7c0
  > 0xffffff1c053cf7c0::print zilog_t zl_clean_taskq

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
2016-11-22 11:51:55 +00:00
Andriy Gapon
7f74889d89 7199, 7200 dsl_dataset_rollback_sync may try to free already free blocks
7199 dsl_dataset_rollback_sync may try to free already free blocks
7200 no blocks must be born in a txg after a snaphot is created

illumos/illumos-gate@bfaed0b91e
bfaed0b91e

https://www.illumos.org/issues/7199
  dsl_dataset_rollback_sync may try to free already freed blocks when it calls
  dsl_destroy_head_sync_impl to destroy a temporary clone.
  That happens if a snapshot to which we are rolling back and from which the
  clone is created has some ZIL records.

https://www.illumos.org/issues/7200
  No new blocks must be born in a dataset in the same TXG after a snapshot of the
  dataset is taken.
  Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
  and as such they would be presumed to belong o the snapshot while in fact they
  do not.
  All the datasets must be clean before sync tasks are run, so the described
  scenario may happen only if one of the sync tasks dirties the dataset and
  another sync task takes its snapshot.
  Then, there will be another sync pass because of the dirty data and the new
  blocks will be born in the same TXG when the data is written out.
  It seems that almost all of the existing sync tasks modify only MOS and do not
  dirty any objsets.
  The only exception that I've been able to identify so far is the rollback which
  can modify an objset when it zeroes out the objset's ZIL.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
2016-11-22 11:49:55 +00:00
Andriy Gapon
d414bfe23b 7180 potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename
illumos/illumos-gate@690041b9ca
690041b9ca

https://www.illumos.org/issues/7180
  If a filesystem is not unmounted while the rename is being performed, then, for
  example, a concurrect zfs rollback may call zfs_suspend_fs followed by
  zfs_resume_fs on the same filesystem.
  The latter takes the filesystem's name as an argument. If the filesystem name
  changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
  call in zfs_resume_fs would fail resulting in a kernel panic.
  So far I have been able to reproduce this problem on FreeBSD where zfs rename
  has -u option that skips the unmounting before doing the renaming.
  But I think that in theory the same problem can occur on illumos as well,
  because the unmounting is done in userland before invoking the rename ioctl and
  there could be a race with, e.g., zfs mount.
  panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
  == 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
  zfs/zfs_vfsops.c, line: 2210
  KDB: stack backtrace:
  db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
  vpanic() at vpanic+0x182/frame 0xfffffe004df30790
  panic() at panic+0x43/frame 0xfffffe004df307f0
  assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
  zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
  zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
  zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
  devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
  kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
  sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
  amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
  Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
2016-11-22 11:47:27 +00:00
Andriy Gapon
0a682531f4 3746 ZRLs are racy
illumos/illumos-gate@260af64db7
260af64db7

https://www.illumos.org/issues/3746
  From the original change log:
  It was possible for a reference to be added even with the lock held, and
  for references added just after a lock release to be lost.
  This bug was also independently found and reported in wesunsolve.net
  issues 6985013 6995524.
  In zrl_add(), always use an atomic operation to update the refcount.
  The mutex in the ZRL only guarantees that wakeups occur for waiters on the
  lock. It offers no protection against concurrent updates of the refcount.
  The only refcount transition that is safe to perform without an atomic
  operation is from ZRL_LOCKED back to 0, since this can only be performed
  by the thread which has the ZRL locked.

Authored by: Will Andrews <will@freebsd.org>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Youzhong Yang <yyang@mathworks.com>
2016-10-27 07:11:31 +00:00
Alexander Motin
760c70cd25 7301 zpool export -f should be able to interrupt file freeing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>

Closes #175
2016-10-14 11:49:36 +00:00