Commit Graph

631 Commits

Author SHA1 Message Date
Tim Chase
84b0aac5fd Fix atime handling.
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime.  However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.

This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.

It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.

The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949
2013-12-12 10:23:58 -08:00
david.chen
be5db977ea Remove MAX when initializing arc_c_max
The MAX when initializing arc_c_max doesn't make any sense because
it hasn't been set anywhere before. Though, arc_c_max should be
implicitly set to zero when initializing arc_stats, so the MAX
doesn't make any difference.

The MAX was mistakenly left if place when the Illumos default
values were changed for Linux.

Signed-off-by: david.chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1941
2013-12-10 10:05:40 -08:00
Ned Bass
b6e335bfc4 Revert "Use directory xattrs for symlinks"
This reverts commit 6a7c0ccca4.

A proper fix for Issue #1648 was landed under Issue #1890, so this is no
longer needed.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1648
2013-12-10 09:48:30 -08:00
James Pan
472e7c6085 sa_find_sizes() may compute wrong SA header size
Under the right conditions sa_find_sizes() will compute an incorrect
size of the system attribute (SA) header.  This causes a failed assertion
when the SA_HDR_SIZE_MATCH_LAYOUT() test returns false, and may lead
to corruption of SA data.

The bug presents itself when there are more than two variable-length SAs
of just the right size to fit in the bonus buffer of a dnode.  The
existing logic fails to account for the SA header space needed to store
the sizes of all the variable-length SAs.

A reproducer was possible on Linux by setting the xattr=sa dataset
property and storing xattrs on symbolic links (Issue #1648).  Note the
corrupt link target name:

$ zfs set xattr=sa tank/fish
$ cd /tank/fish
$ ln -fs 12345678901234567 link
$ setfattr -n trusted.0000000000000000000 -v 0x000000000000000000000000 -h link
$ setfattr -n trusted.1111111111111111111 -v 0x000000000000000000000000 -h link
$ ls -l link
lrwxrwxrwx 1 root root 17 Dec  6 15:40 link -> 90123456701234567

Commit 6a7c0ccca4 worked around this bug
by forcing xattr's on symlinks to be stored in directory format.  This
change implements a proper fix, so the workaround can now be reverted.

The reference link below contains a reproducer for FreeBSD.

References:
  http://lists.open-zfs.org/pipermail/developer/2013-November/000306.html

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1890
2013-12-10 09:48:15 -08:00
Brian Behlendorf
90ee9ed32f Fix 'zfs diff' shares error
When creating a dataset with ZoL a zsb->z_shares_dir ZAP object
will not be created because shares are unimplemented.  Instead ZoL
just sets zsb->z_shares_dir to zero to indicate there are no shares.

However, if you import a pool which was created with a different
ZFS implementation then the shares ZAP object may exist.  Code was
added to handle this case but it clearly wasn't sufficiently tested
with other ZFS pools.

There was a bug in the zpl_shares_getattr() function which passed
the wrong inode to zfs_getattr_fast() for the case where are shares
ZAP object does exist.  This causes an EIO to be returned to stat64()
which in turn causes 'zfs diff' to fail.

This fix is the pass the correct inode after a sucessful zfs_zget().
Additionally, only put away the references if we were able to get one.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Graham Booker <https://github.com/gbooker>
Signed-off-by: timemaster67 <https://github.com/timemaster67>
Closes #1426
Closes #481
2013-12-06 09:42:39 -08:00
Brian Behlendorf
99e349db92 Add module versioning
Use the standard Linux MODULE_VERSION macro to expose the installed
zavl, znvpair, zunicode, zcommon, zfs, and zpios module versions.
This will also automatically add a checksum of the .c files and
headers in "srcversion".  See:

  /sys/module/zavl/version
  /sys/module/zavl/srcversion
  /sys/module/znvpair/version
  /sys/module/znvpair/srcversion
  /sys/module/zunicode/version
  /sys/module/zunicode/srcversion
  /sys/module/zcommon/version
  /sys/module/zcommon/srcversion
  /sys/module/zfs/version
  /sys/module/zfs/srcversion
  /sys/module/zpios/version
  /sys/module/zpios/srcversion

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1923
2013-12-06 09:34:41 -08:00
Matthew Ahrens
e8b96c6007 Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work

1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver.  The scheduler
issues a number of concurrent i/os from each class to the device.  Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes).  The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is.  See the block comment in vdev_queue.c (reproduced
below) for more details.

2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load.  The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system.  When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount.  This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens.  One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync().  Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes.  See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.

This diff has several other effects, including:

 * the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.

 * the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently.  There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.

 * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc.  This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).

--matt

APPENDIX: problems with the current i/o scheduler

The current ZFS i/o scheduler (vdev_queue.c) is deadline based.  The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.

For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due".  One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).

If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os.  This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future.  If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due.  Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).

Notes on porting to ZFS on Linux:

- zio_t gained new members io_physdone and io_phys_children.  Because
  object caches in the Linux port call the constructor only once at
  allocation time, objects may contain residual data when retrieved
  from the cache. Therefore zio_create() was updated to zero out the two
  new fields.

- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
  (vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
  This tree has been replaced by vq->vq_active_tree which is now used
  for the same purpose.

- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
  the number of vdev I/O buffers to pre-allocate.  That global no longer
  exists, so we instead use the sum of the *_max_active values for each of
  the five I/O classes described above.

- The Illumos implementation of dmu_tx_delay() delays a transaction by
  sleeping in condition variable embedded in the thread
  (curthread->t_delay_cv).  We do not have an equivalent CV to use in
  Linux, so this change replaced the delay logic with a wrapper called
  zfs_sleep_until(). This wrapper could be adopted upstream and in other
  downstream ports to abstract away operating system-specific delay logic.

- These tunables are added as module parameters, and descriptions added
  to the zfs-module-parameters.5 man page.

  spa_asize_inflation
  zfs_deadman_synctime_ms
  zfs_vdev_max_active
  zfs_vdev_async_write_active_min_dirty_percent
  zfs_vdev_async_write_active_max_dirty_percent
  zfs_vdev_async_read_max_active
  zfs_vdev_async_read_min_active
  zfs_vdev_async_write_max_active
  zfs_vdev_async_write_min_active
  zfs_vdev_scrub_max_active
  zfs_vdev_scrub_min_active
  zfs_vdev_sync_read_max_active
  zfs_vdev_sync_read_min_active
  zfs_vdev_sync_write_max_active
  zfs_vdev_sync_write_min_active
  zfs_dirty_data_max_percent
  zfs_delay_min_dirty_percent
  zfs_dirty_data_max_max_percent
  zfs_dirty_data_max
  zfs_dirty_data_max_max
  zfs_dirty_data_sync
  zfs_delay_scale

  The latter four have type unsigned long, whereas they are uint64_t in
  Illumos.  This accommodates Linux's module_param() supported types, but
  means they may overflow on 32-bit architectures.

  The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
  likely to overflow on 32-bit systems, since they express physical RAM
  sizes in bytes.  In fact, Illumos initializes zfs_dirty_data_max_max to
  2^32 which does overflow. To resolve that, this port instead initializes
  it in arc_init() to 25% of physical RAM, and adds the tunable
  zfs_dirty_data_max_max_percent to override that percentage.  While this
  solution doesn't completely avoid the overflow issue, it should be a
  reasonable default for most systems, and the minority of affected
  systems can work around the issue by overriding the defaults.

- Fixed reversed logic in comment above zfs_delay_scale declaration.

- Clarified comments in vdev_queue.c regarding when per-queue minimums take
  effect.

- Replaced dmu_tx_write_limit in the dmu_tx kstat file
  with dmu_tx_dirty_delay and dmu_tx_dirty_over_max.  The first counts
  how many times a transaction has been delayed because the pool dirty
  data has exceeded zfs_delay_min_dirty_percent.  The latter counts how
  many times the pool dirty data has exceeded zfs_dirty_data_max (which
  we expect to never happen).

- The original patch would have regressed the bug fixed in
  zfsonlinux/zfs@c418410, which prevented users from setting the
  zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
  A similar fix is added to vdev_queue_aggregate().

- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
  heap instead of the stack.  In Linux we can't afford such large
  structures on the stack.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  http://www.illumos.org/issues/4045
  illumos/illumos-gate@69962b5647

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-12-06 09:32:43 -08:00
Matthew Ahrens
384f8a09f8 Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT)
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.

Porting Notes:

zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version.  This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4347
  illumos/illumos-gate@e722410c49

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-12-06 09:30:51 -08:00
Brian Behlendorf
2e40f09410 Remove incorrect ASSERT in zfs_sb_teardown()
As part of zfs_sb_teardown() there is an assertion that all inodes
which are part of the zsb->z_all_znodes list have at least one
reference on them.  This is always true for the standard unmount
case but there are two other cases where it is not strictly true.

* zfs_ioc_rollback() - This is the most common case and it results
  from the fact that we aren't unmounting the filesystem.  During a
  normal unmount the MS_ACTIVE flag will be cleared on the super block
  causing iput_final() to evict the inode when its reference count
  drops to zero.  However, during a rollback MS_ACTIVE remains set
  since we're rolling back a live filesystem and need to preserve the
  existing super block.  This allows inodes with a zero reference count
  to stay in the cache thereby violating the assertion.

* destroy_inode() / zfs_sb_teardown() - There exists a small race
  between dropping the last reference on an inode and removing it from
  the zsb->z_all_znodes list.  This is unlikely to occur but could also
  trigger the assertion which is incorrect.  The inode may safely have
  a zero reference count in this case.

Since allowing a zero reference count on the inode is expected and
safe for both of these cases the simplest thing to do is remove the
ASSERT.  This code is only enabled for default builds so removing
this entirely is a very safe change.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #1417
Closes #1536
2013-12-02 15:58:58 -08:00
Tim Chase
f707635fa5 Some nvlist allocations in hold processing need to use KM_PUSHPAGE.
This should hopefully catch the rest of the allocations in the
user hold/release processing that were missed by commit
65c67ea86e.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1852
Closes #1855
2013-12-02 14:02:46 -08:00
Etienne Dechamps
119a394ab0 Only commit the ZIL once in zpl_writepages() (msync() case).
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1849
Closes #907
2013-11-23 15:08:29 -08:00
Brian Behlendorf
e3dc14b861 Add I/O Read/Write Accounting
Because ZFS bypasses the page cache we don't inherit per-task I/O
accounting for free.  However, the Linux kernel does provide helper
functions allow us to perform our own accounting.  These are most
commonly used for direct IO which also bypasses the page cache, but
they can be used for the common read/write call paths as well.

Signed-off-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #313
Closes #1275
2013-11-21 08:56:24 -08:00
Steven Hartland
e5bacf2109 Illumos #4322
4322 ZFS deadlock on dp_config_rwlock
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ilya Usvyatsky <ilya.usvyatsky@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4322
  illumos/illumos-gate@c50d56f667

Ported by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1886
2013-11-20 15:27:32 -08:00
Brian Behlendorf
64ad2b26e2 Remove the slog restriction on bootfs pools
Under Linux this restriction does not apply because we have access
to all the required devices.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1631
2013-11-14 14:28:35 -08:00
Matthew Thode
227bc96951 Fixes (extends) support for selinux xattrs to more inode types
Properly initialize SELinux xattrs for all inode types.  The
initial implementation accidentally only did this for files.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1832
2013-11-14 14:28:35 -08:00
Brian Behlendorf
a168788053 Reduce stack for traverse_visitbp() recursion
During pool import stack overflows may still occur due to the
potentially deep recursion of traverse_visitbp().  This is most
likely to occur when additional layers are added to the block
device stack such as DM multipath.  To minimize the stack usage
for this call path the following changes were made:

1) Added the keywork 'noinline' to the vdev_*_map_alloc() functions
   to prevent them from being inlined by gcc.  This reduced the
   stack usage of vdev_raidz_io_start() from 208 to 128 bytes, and
   vdev_mirror_io_start() from 144 to 128 bytes.

2) The 'saved_poolname' charater array in zfsdev_ioctl() was moved
   from the stack to the heap.  This reduced the stack usage of
   zfsdev_ioctl() from 368 to 112 bytes.

3) The major saving came from slimming down traverse_visitbp() from
   from 224 to 144 bytes.  Since this function is called recursively
   the 80 bytes saved per invokation adds up.  The following changes
   were made:

  a) The 'hard' local variable was replaced by a TD_HARD() macro.

  b) The 'pd' local variable was replaced by 'td->td_pfd' references.

  c) The zbookmark_t was moved to the heap.  This does cost us an
     additional memory allocation per recursion by that cost should
     still be minimal.  The cost could be further reduced by adding
     a dedicated zbookmark_t slab cache.

  d) The variable declarations in 'if (BP_GET_LEVEL()) { }' were
     restructured to use the minimum amount of stack.  This includes
     removing the 'cbp' local variable.

Overall for the offending use case roughly 1584 of total stack space
has been saved.  This is enough to avoid overflowing the stack on
stock kernels with 8k stacks.  See #1778 for additional details.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #1778
2013-11-14 14:28:12 -08:00
Tim Chase
65c67ea86e Some nvlist allocations in hold processing need to use KM_PUSHPAGE.
Commit 95fd54a1c5 restructured the
hold/release processing and moved some of the work into the sync task.
A number of nvlist allocations now need to use KM_PUSHPAGE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1852
Closes #1855
2013-11-14 11:11:37 -08:00
Tim Chase
2008e9209f Fix rollback of mounted filesystem regression
The Illumos #3875 patch reverted a part of ZoL's 7b3e34b which added
special-case error handling for zfs_rezget().  The error handling dealt
with the case in which an all-ones object number ended up being passed
to dnode_hold() and causing an EINVAL to be returned from zfs_rezget().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1859
Closes #1861
2013-11-14 10:44:03 -08:00
Tim Chase
fd4f76160c Handle concurrent snapshot automounts failing due to EBUSY.
In the current snapshot automount implementation, it is possible for
multiple mounts to attempted concurrently.  Only one of the mounts will
succeed and the other will fail.  The failed mounts will cause an EREMOTE
to be propagated back to the application.

This commit works around the problem by adding a new exit status,
MOUNT_BUSY to the mount.zfs program which is used when the underlying
mount(2) call returns EBUSY.  The zfs code detects this condition and
treats it as if the mount had succeeded.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1819
2013-11-08 10:45:14 -08:00
Massimo Maggi
b695c34ea4 Honor CONFIG_FS_POSIX_ACL kernel option
The required Posix ACL interfaces are only available for kernels
with CONFIG_FS_POSIX_ACL defined.  Therefore, only enable Posix
ACL support for these kernels.  All major distribution kernels
enable CONFIG_FS_POSIX_ACL by default.

If your kernel does not support Posix ACLs the following warning
will be printed at ZFS module load time.

  "ZFS: Posix ACLs disabled by kernel"

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1825
2013-11-05 16:22:05 -08:00
Matthew Ahrens
78e2739d3c 26126 panic system rather than corrupting pool if we hit bug 26100
References:
  delphix/delphix-os@931c8aaab7

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1650
2013-11-05 13:18:26 -08:00
Brian Behlendorf
2517c8ee08 Switch allocations from KM_SLEEP to KM_PUSHPAGE
A couple of kmem_alloc() allocations were using KM_SLEEP in
the sync thread context.  These were accidentally introduced
by the recent set of Illumos patches.  The solution is to
switch to KM_PUSHPAGE.

dsl_dataset_promote_sync() -> promote_hold() -> snaplist_make() ->
kmem_alloc(sizeof (*snap), KM_SLEEP);

dsl_dataset_user_hold_sync() -> dsl_onexit_hold_cleanup() ->
kmem_alloc(sizeof (*ca), KM_SLEEP)

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:26:14 -08:00
Saso Kiselkov
1ca546b338 Illumos #3995
3995 Memory leak of compressed buffers in l2arc_write_done

References:
  https://illumos.org/issues/3995

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1688
Issue #1775
2013-11-05 12:26:00 -08:00
George Wilson
43a696ed38 Illumos #4168, #4169, #4170
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfault
4170 zhack leaves pool in ACTIVE state
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4168
  https://www.illumos.org/issues/4169
  https://www.illumos.org/issues/4170
  illumos/illumos-gate@7fdd916c47

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:25:44 -08:00
Matthew Ahrens
92bc214c2e Illumos #4082
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/4082
  illumos/illumos-gate@5253393b09

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:25:26 -08:00
George Wilson
ac72fac3ea Illumos #3954, #4080, #4081
3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3954
  https://www.illumos.org/issues/4080
  https://www.illumos.org/issues/4081
  illumos/illumos-gate@22e30981d8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:25:01 -08:00
Matthew Ahrens
a169a625a6 Illumos #4046
4046 dsl_dataset_t ds_dir->dd_lock is highly contended
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4046
  illumos/illumos-gate@b62969f868

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. This commit removed dsl_dataset_namelen in Illumos, but that
   appears to have been removed from ZFSOnLinux in an earlier commit.
2013-11-05 12:24:24 -08:00
Matthew Ahrens
b663a23d36 Illumos #4047
4047 panic from dbuf_free_range() from dmu_free_object() while
     doing zfs receive
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4047
  illumos/illumos-gate@713d6c2088

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The exported symbol dmu_free_object() was renamed to
   dmu_free_long_object() in Illumos.
2013-11-05 12:23:35 -08:00
Matthew Ahrens
46ba1e59d3 Illumos #3996
3996 want a libzfs_core API to rollback to latest snapshot
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <andyjstormont@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3996
  illumos/illumos-gate@a7027df17f

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:23:11 -08:00
George Wilson
5d1f7fb647 Illumos #3956, #3957, #3958, #3959, #3960, #3961, #3962
3956 ::vdev -r should work with pipelines
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3956
  https://www.illumos.org/issues/3957
  https://www.illumos.org/issues/3958
  https://www.illumos.org/issues/3959
  https://www.illumos.org/issues/3960
  https://www.illumos.org/issues/3961
  https://www.illumos.org/issues/3962
  illumos/illumos-gate@b4952e17e8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting notes:

1. zfs_dbgmsg_print() is only used in userland. Since we do not have
   mdb on Linux, it does not make sense to make it available in the
   kernel. This means that a build failure will occur if any future
   kernel patch depends on it. However, that is unlikely given that
   this functionality was added to support zdb.

2. zfs_dbgmsg_print() is only invoked for -VVV or greater log levels.
   This preserves the existing behavior of minimal noise when running
   with -V, and -VV.

3. In vdev_config_generate() the call to nvlist_alloc() was not
   changed to fnvlist_alloc() because we must pass KM_PUSHPAGE in
   the txg_sync context.
2013-11-05 12:23:05 -08:00
George Wilson
621dd7bb2c Illumos #3949, #3950, #3952, #3953
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3949
  https://www.illumos.org/issues/3950
  https://www.illumos.org/issues/3951
  https://www.illumos.org/issues/3952
  illumos/illumos-gate@2c1e2b4414

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The deadman thread was removed from ztest during the original
   port because it depended on Solaris thr_create() interface.
   This functionality should be reintroduced using the more
   portable pthreads.
2013-11-05 12:17:07 -08:00
Matthew Ahrens
383fc4a997 Illumos #3955
3955 ztest failure: assertion refcount_count(&tx->tx_space_written) +
     delta <= tx->tx_space_towrite
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3955
  illumos/illumos-gate@be9000cc67

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:16:14 -08:00
Steven Hartland
9554185d90 Illumos #3973
3973 zfs_ioc_rename alters passed in zc->zc_name
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3973
  illumos/illumos-gate@a0c1127b14

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:15:50 -08:00
Matthew Ahrens
ea97f8ce35 Illumos #3834
3834 incremental replication of 'holey' file systems is slow
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3834
  illumos/illumos-gate@ca48f36f20

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:15:00 -08:00
Matthew Ahrens
2883cad5b7 Illumos #3836
3836 zio_free() can be processed immediately in the common case
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3836
  illumos/illumos-gate@9cb154a3c9

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-05 12:14:56 -08:00
Matthew Ahrens
498877baf5 Illumos #3112, #3113, #3114
3112 ztest does not honor ZFS_DEBUG
3113 ztest should use watchpoints to protect frozen arc bufs
3114 some leaked nvlists in zfsdev_ioctl

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matt Amdur <Matt.Amdur@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Eric Schrock <eric.schrock@delphix.com>

References:
  https://www.illumos.org/issues/3112
  https://www.illumos.org/issues/3113
  https://www.illumos.org/issues/3114
  illumos/illumos-gate@cd1c8b85eb

The /proc/self/cmd watchpoint interface is specific to Solaris.
Therefore, the #3113 implementation was reworked to use the more
portable mprotect(2) system call.  When the pages are watched they
are marked read-only for protection.  Any write to the protected
address range immediately trigger a SIGSEGV.  The pages are marked
writable again when they are unwatched.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
2013-11-05 12:14:48 -08:00
George Wilson
03c6040bee Illumos #3236
3236 zio nop-write
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@80901aea8e
  https://www.illumos.org/issues/3236

Porting Notes

1. This patch is being merged dispite an increased instance of
   https://www.illumos.org/issues/3113 being triggered by ztest.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
2013-11-05 12:14:21 -08:00
Keith M Wesolowski
831baf06ef Illumos #3875
3875 panic in zfs_root() after failed rollback
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/3875
  illumos/illumos-gate@91948b51b8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 11:27:41 -08:00
Matthew Ahrens
1958067629 Illumos #3888
3888 zfs recv -F should destroy any snapshots created since
     the incremental source
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3888
  illumos/illumos-gate@34f2f8cf94

Porting notes:

1. Commit 1fde1e3720 wrapped a
   declaration in dsl_dataset_modified_since_lastsnap in ASSERTV().
   The ASSERTV() and local variable have been removed to avoid an
   unused variable warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Richard Yao <ryao@gentoo.org>
Issue #1775
2013-11-04 11:18:14 -08:00
Keith M Wesolowski
96c2e96193 Illumos #3894
3894 zfs should not allow snapshot of inconsistent dataset
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/3894
  illumos/illumos-gate@ca48f36f20

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 11:18:14 -08:00
Matthew Ahrens
1a077756e8 Illumos #3829
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
Reviewed by: Matt Amdur <matt.amdur@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3829
  illumos/illumos-gate@bb6e70758d

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 11:18:14 -08:00
Steven Hartland
95fd54a1c5 Illumos #3740
3740 Poor ZFS send / receive performance due to snapshot
     hold / release processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3740
  illumos/illumos-gate@a7a845e4bf

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. 13fe019870 introduced a merge conflict
   in dsl_dataset_user_release_tmp where some variables were moved
   outside of the preprocessor directive.

2. dea9dfefdd747534b3846845629d2200f0616dad made the previous merge
   conflict worse by switching KM_SLEEP to KM_PUSHPAGE. This is notable
   because this commit refactors the code, adding a new KM_SLEEP
   allocation. It is not clear to me whether this should be converted
   to KM_PUSHPAGE.

3. We had a merge conflict in libzfs_sendrecv.c because of copyright
   notices.

4. Several small C99 compatibility fixed were made.
2013-11-04 11:17:48 -08:00
Will Andrews
d09f25dc66 Illumos #3744
3744 zfs shouldn't ignore errors unmounting snapshots
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3744
  illumos/illumos-gate@fc7a6e3fef

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. There is no clear way to distinguish between a failure when we
   tried to unmount the snapdir of a zvol (which does not exist)
   and the failure when we try to unmount a snapdir of a dataset,
   so the changes to zfs_unmount_snap() were dropped in favor of
   an altered Linux function that unconditionally returns 0.
2013-11-04 10:55:25 -08:00
Will Andrews
3a84951d7d Illumos #3743
3743 zfs needs a refcount audit
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3743
  illumos/illumos-gate@b287be1ba8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Will Andrews
d3cc8b152e Illumos #3742
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3742
  illumos/illumos-gate@f717074149

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The change to zfs_vfsops.c was dropped because it involves
   zfs_mount_label_policy, which does not exist in the Linux port.
2013-11-04 10:55:25 -08:00
Will Andrews
e49f1e20a0 Illumos #3741
3741 zfs needs better comments
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3741
  illumos/illumos-gate@3e30c24aee

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Martin Matuska
b1118acbb1 Illumos #3699, #3739
3699 zfs hold or release of a non-existent snapshot does not output error
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Eric Shrock <eric.schrock@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3699
  https://www.illumos.org/issues/3739
  illumos/illumos-gate@013023d4ed

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Adam Leventhal
63fd3c6cfd Illumos #3582, #3584
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
    https://www.illumos.org/issues/3582
    illumos/illumos-gate@0689f76

Ported by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:55:25 -08:00
Mark Shellenbaum
c1fabe7961 6977619 NULL pointer deference in sa_handle_get_from_db()
References:
  illumos/illumos-gate@44bffe012c

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-11-04 10:54:48 -08:00
Mark Shellenbaum
c0ebc844c7 6939941 problem with moving files in zfs
References:
  illumos/illumos-gate@d39ee142a9

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. This commit was so old that only two lines applied to the modern
   code base.
2013-11-04 10:53:18 -08:00