Commit Graph

525 Commits

Author SHA1 Message Date
Matthew Ahrens
13fe019870 Illumos #3464
3464 zfs synctask code needs restructuring
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3464
  illumos/illumos-gate@3b2aab1880

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1495
2013-09-04 16:01:24 -07:00
Matthew Ahrens
6f1ffb0665 Illumos #2882, #2883, #2900
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2882
  https://www.illumos.org/issues/2883
  https://www.illumos.org/issues/2900
  illumos/illumos-gate@4445fffbbb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1293

Porting notes:

WARNING: This patch changes the user/kernel ABI.  That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules.  Ensure you load the matching kernel
modules from master after updating the utilities.  Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:

  $ zpool list
  failed to read pool configuration: bad address
  no pools available

  $ zfs list
  no datasets available

Add zvol minor device creation to the new zfs_snapshot_nvl function.

Remove the logging of the "release" operation in
dsl_dataset_user_release_sync().  The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos.  This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).

Squash some "may be used uninitialized" warning/erorrs.

Fix some printf format warnings for %lld and %llu.

Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.

Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
2013-09-04 15:49:00 -07:00
Brian Behlendorf
6a7c0ccca4 Use directory xattrs for symlinks
There is currently a subtle bug in the SA implementation which
can crop up which prevents us from safely using multiple variable
length SAs in one object.

Fortunately, the only existing use case for this are symlinks with
SA based xattrs.  Therefore, until the root cause in the SA code
can be identified and fixed we prevent adding SA xattrs to symlinks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1468
2013-08-22 13:30:44 -07:00
Brian Behlendorf
c273d60d80 Revert "Evict meta data from ghost lists + l2arc headers"
This reverts commit fadd0c4da1 which
introduced a regression in honoring the meta limit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Close #1660
2013-08-22 12:15:37 -07:00
Richard Yao
0f37d0c8be Linux 3.11 compat: fops->iterate()
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.

To handle this the code was reworked to use the new ->iterate
interface.  Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.

Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions.  Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.

Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function.  The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names.  Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1653
Issue #1591
2013-08-15 16:19:07 -07:00
Brian Behlendorf
34e143323e Fix z_wr_iss_h zio_execute() import hang
Because we need to be more frugal about our stack usage under
Linux.  The __zio_execute() function was modified to re-dispatch
zios to a ZIO_TASKQ_ISSUE thread when we're in a context which
is known to be stack heavy.  Those two contexts are the sync
thread and what ever thread is performing spa initialization.

Unfortunately, this change introduced an unlikely bug which can
result in a zio being re-dispatched indefinitely and never being
executed.  If during spa initialization we handle a zio with
ZIO_PRIORITY_NOW it will be moved to the high priority queue.
When __zio_execute() is called again for the zio it will mis-
interpret the context and re-dispatch it again.  The system
will get stuck spinning re-dispatching the zio and making no
forward progress.

To fix this rare issue __zio_execute() has been updated not
to re-dispatch zios on either the ZIO_TASKQ_ISSUE or
ZIO_TASKQ_ISSUE_HIGH task queues.

In practice this issue was rarely reported and can usually
be fixed by rebooting the system and importing the pool again.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1455
2013-08-15 15:20:36 -07:00
Matthew Ahrens
cb682a173a Illumos #3618 ::zio dcmd does not show timestamp data
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  http://www.illumos.org/issues/3618
  illumos/illumos-gate@c55e05cb35

Notes on porting to ZFS on Linux:

The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.

One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:

    zfs_vdev_time_shift = 6
        to
    zfs_vdev_time_shift = 29

The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.

(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)

Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1643
2013-08-12 16:46:50 -07:00
Richard Yao
570d6edf1d Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKS
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1589
2013-08-09 15:31:52 -07:00
Brian Behlendorf
fadd0c4da1 Evict meta data from ghost lists + l2arc headers
When the meta limit is exceeded the ARC evicts some meta data
buffers from the mfu+mru lists.  Unfortunately, for meta data
heavy workloads it's possible for these buffers to accumulate
on the ghost lists if arc_c doesn't exceed arc_size.

To handle this case arc_adjust_meta() has been entended to
explicitly evict meta data buffers from the ghost lists in
proportion to what was evicted from the mfu+mru lists.

If this is insufficient we request that the VFS release
some inodes and dentries.  This will result in the release
of some dnodes which are counted as 'other' metadata.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-09 10:06:12 -07:00
Brian Behlendorf
68121a03da Allow arc_evict_ghost() to only evict meta data
The default behavior of arc_evict_ghost() is to start by evicting
data buffers.  Then only if the requested number of bytes to evict
cannot be satisfied by data buffers move on to meta data buffers.

This is ideal for honoring arc_c since it's preferable to keep the
meta data cached.  However, if we're trying to free memory from the
arc to honor the meta limit it's a problem because we will need to
discard all the data to get to the meta data.

To avoid this issue the arc_evict_ghost() is now passed a fourth
argumented describing which buffer type to start with.  The
arc_evict() function already behaves exactly like this for a
same reason so this is consistent with the existing code.

All existing callers have been updated to pass ARC_BUFC_DATA so
this patch introduces no functional change.  New callers may
pass ARC_BUFC_METADATA to skip immediately to evicting meta
data leaving the normal data untouched.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-09 10:06:08 -07:00
Saso Kiselkov
3a17a7a99a Illumos #3137 L2ARC compression
3137 L2ARC compression
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@aad02571bc
  https://www.illumos.org/issues/3137
  http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set.  This allows the legacy behavior
to be preserved.

Ported by: James H <james@kagisoft.co.uk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1379
2013-08-08 13:27:21 -07:00
Richard Yao
c11a12bc3b Return -1 from arc_shrinker_func()
This is analogous to SPL commit zfsonlinux/spl@b9b3715.  While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1579
2013-08-08 09:20:56 -07:00
Richard Yao
8170d28126 Return correct type and offset from zfs_readdir
zfs_readdir() is used by getdents(), which provides a list of all files
in directory, their types and an offset that be used by llseek() to seek
to the next directory entry.

On Solaris, the first two directory entries "." and ".." respectively
have offsets 1 and 2 on ZFS while the other files have rather large
numbers. Currently, ZFSOnLinux is  giving "." offset 0 and all other
entries large numbers. The first entry's next entry offset points to
itself, which causes software that uses llseek() in conjunction with
getdents() for filesystem navigation to enter an infinite loop.  The
offsets used for each directory entry are filesystem specific on all
platforms, so we can fix this by adopting the Solaris behavior.

Also, we currently report each directory entry as having type 0 (???).
This is not wrong, but we can do better. getdents() on Solaris does not
appear to provide this information, but it does on Linux and Mac OS X
do. ZFS provides easy access to type information in zfs_readdir(), so
this patch provides this as well.

Reported-by: Andrey <andrey@kudinov.su>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1624
2013-08-07 16:16:43 -07:00
George Wilson
c61f97f426 Illumos #3639 zpool.cache should skip over readonly pools
3639 zpool.cache should skip over readonly pools
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  illumos/illumos-gate@fb02ae0252
  https://www.illumos.org/issues/3639

Normally we don't list pools that are imported read-only in the cache
file, however you can accidentally get one into the cache file by
importing and exporting a read-write pool while a read-only pool is
imported:

$ zpool import -o readonly test1
$ zpool import test2
$ zpool export test2
$ zdb -C

This is a problem because if the machine reboots we import all pools in
the cache file as read-write.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-07 16:13:56 -07:00
Brian Behlendorf
78d7a5d780 Write dirty inodes on close
When the property atime=on is set operations which only access
and inode do cause an atime update.  However, it turns out that
dirty inodes with updated atimes are only written to disk when
the inodes get evicted from the cache.  Somewhat surprisingly
the source suggests that this isn't a ZoL specific issue.

This behavior may in part explain why zfs's reclaim logic has
been observed to be slow.  When reclaiming inodes its likely
that they have a dirty atime which will force a write to disk.

Obviously we don't want to force a write to disk for every
atime update, these needs to be batched.  The right way to
do this is to fully implement the .dirty_inode and .write_inode
callbacks.  However, to do that right requires proper unification
of some fields in the znode/inode.  Then we could just mark the
inode dirty and leave it to the VFS to call .write_inode
periodically.

Until that work gets done we have to settle for some middle
ground.  The simplest and safest thing we can do for now is
to write the dirty inode on last close.  This should prevent
the majority of inodes in the cache from having dirty atimes
and not drastically increase the number of writes.

Some rudimentally testing to show how long it takes to drop
500,000 inodes from the cache shows promising results.  This
is as expected because we're no longer do lots of IO as part
of the eviction, it was done earlier during the close.

w/out patch: ~30s to drop 500,000 inodes with drop_caches.
with patch:  ~3s to drop 500,000 inodes with drop_caches.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-07 16:11:19 -07:00
Brian Behlendorf
57b650b86f Export additional dmu symbols
The dmu_prefetch, dmu_free_long_range, dmu_free_object,
dmu_prealloc, dmu_write_policy, and dmu_sync symbols have
been exported so they may be used by other modules.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-01 09:48:07 -07:00
Nathaniel Clark
7d63721118 dmu_tx: Fix possible NULL pointer dereference
dmu_tx_hold_object_impl can return NULL on error.  Check for this
condition prior to dereferencing pointer.  This can only occur if
the passed object was invalid or unallocated.

Signed-off-by: Nathaniel Clark <Nathaniel.Clark@misrule.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1610
2013-08-01 09:48:07 -07:00
Richard Yao
cb543e6b5e Remove b_thawed from arc_buf_hdr_t
The code involving b_thawed appears to be dead, so lets discard it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
2013-08-01 09:48:07 -07:00
Richard Yao
3f4058cd15 Remove arc_data_buf_alloc()/arc_data_buf_free()
These functions are used in neither Illumos nor ZFSOnLinux. They appear
to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove
them.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
2013-08-01 09:48:07 -07:00
Richard Yao
4edbd2f79a Remove zio_alloc_arena
We declare zio_alloc_arena using extern, but it does not appear to exist
anywhere in the code. This permits undefined behavior, so lets remove
it.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
2013-08-01 09:48:06 -07:00
Brian Behlendorf
bce45ec9fb Make arc+l2arc module options writable
The l2arc module options can be made safely writable.  This allows
the options to be changed without unloading/loading the modules.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-30 15:40:20 -07:00
Brian Behlendorf
c93504f03a Change l2arc_norw default to zero
These days modern SSDs can efficiently service concurrent reads
and writes.  When this flag was added that wasn't really the
case for a variety of SSD controllers.  But now we can set the
default value to take advantage of this parallelism and only
disable this as needed for specific troublesome hardware.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-29 22:05:32 -07:00
Ying Zhu
6e1d7276c9 Fix inaccurate arcstat_l2_hdr_size calculations
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:

1. When allocate a l2arc_buf_hdr_t we add its memory consumption
   instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
   the buffer only stays in l2arc we should also add the memory
   its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
   arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
   in step 1 and the same for transfering arc bufs from l2arc only
   state.

The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:

1. Fdisked a SATA disk into two partitions, one partition for zpool
   storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
   prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
   in /proc/spl/kstat/zfs/arcstats. After the reading ended the
   l2_hdr_size and l2_size were shown like this:

      l2_size             4       4403780608
      l2_hdr_size         4       0

   which was weird.

4. After applying this patch and reran step 1-3, the results were
   as following:

      l2_size             4       4306443264
      l2_hdr_size         4       535600

   these numbers made sense, on 64-bit systems the
   sizeof(l2arc_buf_hdr_t) is 16 bytes.  Assue all blocks cached by
   l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
   blocks are equal-sized, the theoretical result will be a little
   bigger, as we can see.

Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:

probe module("zfs").function("arc_chage_state")
{
	if ($new_state == $arc_l2_only)
		printf("change arc buf to arc_l2_only\n")
}

It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only.  I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-29 22:05:26 -07:00
Brian Behlendorf
dba1d70566 Fix arc_adapt() spinning in iterate_supers_type()
The iterate_supers_type() function which was introduced in the
3.0 kernel was supposed to provide a safe way to call an arbitrary
function on all super blocks of a specific type.  Unfortunately,
because a list_head was used a bug was introduced which made it
possible for iterate_supers_type() to get stuck spinning on a
super block which was just deactivated.

This can occur because when the list head is removed from the
fs_supers list it is reinitialized to point to itself.  If the
iterate_supers_type() function happened to be processing the
removed list_head it will get stuck spinning on that list_head.

The bug was fixed in the 3.3 kernel by converting the list_head
to an hlist_node.  However, to resolve the issue for existing
3.0 - 3.2 kernels we detect when a list_head is used.  Then to
prevent the spinning from occurring the .next pointer is set to
the fs_supers list_head which ensures the iterate_supers_type()
function will always terminate.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1045
Closes #861
Closes #790
2013-07-17 09:28:06 -07:00
Brian Behlendorf
c9ada6d5a0 Fix read-only pool hang on unmount
During mount a filesystem dataset would have the MS_RDONLY bit
incorrectly cleared even if the entire pool was read-only.
There is existing to code to handle this case but it was being run
before the property callbacks were registered.  To resolve the
issue we move this read-only code after the callback registration.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1338
2013-07-17 09:22:23 -07:00
Brian Behlendorf
76351672c2 Fix zfsctl_expire_snapshot() deadlock
It is possible for an automounted snapshot which is expiring to
deadlock with a manual unmount of the snapshot.  This can occur
because taskq_cancel_id() will block if the task is currently
executing until it completes.  But it will never complete because
zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which
zfsctl_expire_snapshot() must acquire.

---------------------- z_unmount/0:2153 ---------------------
  mutex_lock                <blocking on zsb->z_ctldir_lock>
  zfsctl_unmount_snapshot
  zfsctl_expire_snapshot
  taskq_thread

------------------------- zfs:10690 -------------------------
  taskq_wait_id             <waiting for z_unmount to exit>
  taskq_cancel_id
  __zfsctl_unmount_snapshot
  zfsctl_unmount_snapshot   <takes zsb->z_ctldir_lock>
  zfs_unmount_snap
  zfs_ioc_destroy_snaps_nvl
  zfsdev_ioctl
  do_vfs_ioctl

We resolve the deadlock by dropping the zsb->z_ctldir_lock before
calling __zfsctl_unmount_snapshot().  The lock is only there to
prevent concurrent modification to the zsb->z_ctldir_snaps AVL
tree.  Moreover, we're careful to remove the zfs_snapentry_t from
the AVL tree before dropping the lock which ensures no other tasks
can find it.  On failure it's added back to the tree.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes #1527
2013-07-12 10:06:53 -07:00
Brian Behlendorf
556011dbec Improve N-way mirror performance
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.

The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror.  It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them.  In practice this results in the
leaf vdevs being under utilized.

Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO.  This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror.  Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.

In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds.  Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.

The testing results show a significant performance improvement
for all four workloads tested.  The workloads were generated
using the fio benchmark and are as follows.

1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).

               | Pristine              |  With 1461             |
               | Sequential  Random    |  Sequential  Random    |
               | 1MB  4KB    1MB  4KB  |  1MB  4KB    1MB  4KB  |
               | MB/s MB/s   IO/s IO/s |  MB/s MB/s   IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped      | 226  243     11  304  |  222  255     11  299  |
2 2-Way Mirror | 302  324     16  534  |  433  448     23  571  |
2 3-Way Mirror | 429  458     24  714  |  648  648     41  808  |
2 4-Way Mirror | 562  601     36  849  |  816  828     82  926  |

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1461
2013-07-11 13:53:50 -07:00
Prakash Surya
92334b14ec Add new kstat for monitoring time in dmu_tx_assign
This change adds a new kstat to gain some visibility into the amount of
time spent in each call to dmu_tx_assign. A histogram is exported via
a new dmu_tx_assign_histogram-$POOLNAME file. The information contained
in this histogram is the frequency dmu_tx_assign took to complete given
an interval range. For example, given the below histogram file:

    $ cat /proc/spl/kstat/zfs/dmu_tx_assign_histogram-tank
    12 1 0x01 32 1536 19792068076691 20516481514522
    name                            type data
    1 us                            4    859
    2 us                            4    252
    4 us                            4    171
    8 us                            4    2
    16 us                           4    0
    32 us                           4    2
    64 us                           4    0
    128 us                          4    0
    256 us                          4    0
    512 us                          4    0
    1024 us                         4    0
    2048 us                         4    0
    4096 us                         4    0
    8192 us                         4    0
    16384 us                        4    0
    32768 us                        4    1
    65536 us                        4    1
    131072 us                       4    1
    262144 us                       4    4
    524288 us                       4    0
    1048576 us                      4    0
    2097152 us                      4    0
    4194304 us                      4    0
    8388608 us                      4    0
    16777216 us                     4    0
    33554432 us                     4    0
    67108864 us                     4    0
    134217728 us                    4    0
    268435456 us                    4    0
    536870912 us                    4    0
    1073741824 us                   4    0
    2147483648 us                   4    0

one can see most calls to dmu_tx_assign completed in 32us or less, but a
few outliers did not. Specifically, 4 of the calls took between 262144us
and 131072us. This information is difficult, if not impossible, to gather
without this change.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1584
2013-07-11 13:53:44 -07:00
Brian Behlendorf
bf89c19914 Log pool suspension warnings to the console
In the event that a pool gets suspended log this information to
the console.  This is critical information and we want to make
sure it gets logged.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1555
2013-07-10 15:15:52 -07:00
Brian Behlendorf
abc41ac7c7 Use GFP_NOIO in vdev_disk_io_flush()
To avoid a potential deadlock when using a zvol as a swap
device prevent vdev_disk_io_flush() from performing IO during
the bio_alloc().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1508
2013-07-10 14:12:21 -07:00
Ying Zhu
b4f7f10527 Improve code in arc_buf_remove_ref
When we remove references of arc bufs in the arc_anon state we
needn't take its header's hash_lock, so postpone it to where we
really need it to avoid unnecessary invocations of function buf_hash.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1557
2013-07-09 11:53:28 -07:00
Shen Yan
8e07b99b2f Update zio.c
The cv_wait_io is used to account io time instead of cv_wait.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1566
2013-07-09 10:41:46 -07:00
Brian Behlendorf
31455ab130 Add zfs_autoimport_disable tunable
There are times when it is desirable for zfs to not automatically
populate the spa namespace at module load time using the pools
in the /etc/zfs/zpool.cache file.  The zfs_autoimport_disable
module option has been added to control this behavior.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #330
2013-07-09 10:11:19 -07:00
Chris Dunlop
a1d9543a39 3.10 API change: block_device_operations->release() returns void
Linux kernel commit torvalds/linux@db2a144 changed the return type
of block_device_operations->release() to void.  Detect the expected
prototype and defined our callout accordingly.

Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1494
2013-07-08 15:41:57 -07:00
Brian Behlendorf
91604b298c Open pools asynchronously after module load
One of the side effects of calling zvol_create_minors() in
zvol_init() is that all pools listed in the cache file will
be opened.  Depending on the state and contents of your pool
this operation can take a considerable length of time.

Doing this at load time is undesirable because the kernel
is holding a global module lock.  This prevents other modules
from loading and can serialize an otherwise parallel boot
process.  Doing this after module inititialization also
reduces the chances of accidentally introducing a race
during module init.

To ensure that /dev/zvol/<pool>/<dataset> devices are
still automatically created after the module load completes
a udev rules has been added.  When udev notices that the
/dev/zfs device has been create the 'zpool list' command
will be run.  This then will cause all the pools listed
in the zpool.cache file to be opened.

Because this process in now driven asynchronously by udev
there is the risk of problems in downstream distributions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #756
Issue #1020
Issue #1234
2013-07-03 09:24:38 -07:00
Richard Yao
2a3871d4bc Cleanup zvol initialization code
The following error will occur on some (possibly all) kernels
because blk_init_queue() will try to take the spinlock before
we initialize it.

  BUG: spinlock bad magic on CPU#0, zpool/4054
   lock: 0xffff88021a73de60, .magic: 00000000,
   .owner: <none>/-1, .owner_cpu: 0
  Pid: 4054, comm: zpool Not tainted 3.9.3 #11
  Call Trace:
   [<ffffffff81478ef8>] spin_dump+0x8c/0x91
   [<ffffffff81478f1e>] spin_bug+0x21/0x26
   [<ffffffff812da097>] do_raw_spin_lock+0x127/0x130
   [<ffffffff8147d851>] _raw_spin_lock_irq+0x21/0x30
   [<ffffffff812c2c1e>] cfq_init_queue+0x1fe/0x350
   [<ffffffff812aacb8>] elevator_init+0x78/0x140
   [<ffffffff812b2677>] blk_init_allocated_queue+0x87/0xb0
   [<ffffffff812b26d5>] blk_init_queue_node+0x35/0x70
   [<ffffffff812b271e>] blk_init_queue+0xe/0x10
   [<ffffffff8125211b>] __zvol_create_minor+0x24b/0x620
   [<ffffffff81253264>] zvol_create_minors_cb+0x24/0x30
   [<ffffffff811bd9ca>] dmu_objset_find_spa+0xea/0x510
   [<ffffffff811bda71>] dmu_objset_find_spa+0x191/0x510
   [<ffffffff81253ea2>] zvol_create_minors+0x92/0x180
   [<ffffffff811f8d80>] spa_open_common+0x250/0x380
   [<ffffffff811f8ece>] spa_open+0xe/0x10
   [<ffffffff8122817e>] pool_status_check.part.22+0x1e/0x80
   [<ffffffff81228a55>] zfsdev_ioctl+0x155/0x190
   [<ffffffff8116a695>] do_vfs_ioctl+0x325/0x5a0
   [<ffffffff8116a950>] sys_ioctl+0x40/0x80
   [<ffffffff814812c9>] ? do_page_fault+0x9/0x10
   [<ffffffff81483929>] system_call_fastpath+0x16/0x1b
   zd0: unknown partition table

We fix this by calling spin_lock_init before blk_init_queue.

The manner in which zvol_init() initializes structures is
suspectible to a race between initialization and a probe on
a zvol. We reorganize zvol_init() to prevent that.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-07-03 09:23:35 -07:00
Pawel Jakub Dawidek
526af78550 Call zvol_create_minors() in spa_open_common() when initializing pool
There is an extremely odd bug that causes zvols to fail to appear on
some systems, but not others. Recently, I was able to consistently
reproduce this issue over a period of 1 month. The issue disappeared
after I applied this change from FreeBSD.

This is from FreeBSD's pool version 28 import, which occurred in
revision 219089.

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #441
Issue #599
2013-07-03 09:22:44 -07:00
George Wilson
294f68063b Illumos #3498 panic in arc_read()
3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@1b912ec710
  https://www.illumos.org/issues/3498

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1249
2013-07-02 13:34:31 -07:00
Matthew Ahrens
96b89346c0 Illumos #3122 zfs destroy filesystem should prefetch blocks
3122 zfs destroy filesystem should prefetch blocks
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@b4709335aa
  https://www.illumos.org/issues/3122

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1565
2013-07-02 13:34:02 -07:00
Cyril Plisko
29dee3ee9a Add zfs_sync_pass_* tunable parameters
Commit 55d85d5a8c (backport of
the upstream changes) replaced three hardcoded constants:

    #define SYNC_PASS_DEFERRED_FREE 2 /* defer frees after this pass */
    #define SYNC_PASS_DONT_COMPRESS 4 /* don't compress after this pass */
    #define SYNC_PASS_REWRITE       1 /* rewrite new bps after this pass */

with a tunable parameters:

    int zfs_sync_pass_deferred_free = 2; /* defer frees starting in this pass */
    int zfs_sync_pass_dont_compress = 5; /* don't compress starting in this pass */
    int zfs_sync_pass_rewrite = 2;       /* rewrite new bps starting in this pass */

This commit makes these tunables available as module parameters
in Linux.  They should only be used for performance analysis
because changing them can result in subtle and pathological
performance problems.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1562
2013-07-02 09:34:18 -07:00
Li Dongyang
802e7b5feb Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.

Tested with xfstests 285 and 286.  All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.

Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.

An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1384
2013-07-02 09:24:43 -07:00
Matthew Ahrens
cf91b2b6b2 Readd zfs_holey() from OpenSolaris
This patch restores the zfs_holey() function from OpenSolaris.
This was removed by commit 3558fd7 because it wasn't clear we
had a use for it in ZoL.  However, this functionality is a
prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #1384
2013-07-02 09:24:18 -07:00
shenyan1
0a6bef26ec kmem_zalloc(..., KM_SLEEP) will never fail
By definitition these allocations will never fail.  For
consistency with the rest of the code remove this dead error
handling code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1558
2013-07-01 14:51:48 -07:00
Tim Chase
ab68b6e5db Fix zfs_sb_teardown/zfs_resume_fs NULL dereference
Fix a pair of conditions in which a concurrent umount can cause
NULL pointer dereferences:

* zfs_sb_teardown - prevent a NULL dereference by not calling
                    dmu_objset_pool with a null z_os.

* zfs_resume_fs - don't try to unmount with a null z_os.  This
                  change makes the ZoL code more consistent
                  with both Illumos and FreeBSD.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1543
2013-07-01 14:51:45 -07:00
Ying Zhu
c12936b141 Fix module probe failure on 32-bit systems
Previous commit 7ef5e54e2e caused
module probe failure on 32-bit systems, dmesg showed

  Unknown symbol __moddi3

This was caused by the modulo operation 'gethrtime() % tqs->stqs_count'
in the committed code.  Instead of implementing __moddi3 for all 32-bit
systems, Behlendorf advised we can just cast the return value of
gethrtime() into a uint64_t, since gethrtime does not return negative
value on all circumstances we need not care about the potential overflow.

Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1551
2013-06-27 10:01:25 -07:00
Brian Behlendorf
88c283952f Return -EOPNOTSUPP for ZFS_IOC_{GET|SET}FLAGS
Until these hooks are fully implemented return the expected
-EOPNOTSUPP error to indicate they are not functional.  This
allows test suites such as xfstests to cleanly skip testing
this functionality until it's implemented.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #229
2013-06-26 15:20:13 -07:00
Matthew Ahrens
df4474f92d Illumos #3805 arc shouldn't cache freed blocks
3805 arc shouldn't cache freed blocks
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Reviewed by: Will Andrews <will@firepipe.net>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@6e6d5868f5
  https://www.illumos.org/issues/3805

ZFS should proactively evict freed blocks from the cache.

On dcenter, we saw that we were caching ~256GB of metadata, while the
pool only had <4GB of metadata on disk.  We were wasting about half the
system's RAM (252GB) on blocks that have been freed.

Even though these freed blocks will never be used again, and thus will
eventually be evicted, this causes us to use memory inefficiently for 2
reasons:

1. A block that is freed has no chance of being accessed again, but will
be kept in memory preferentially to a block that was accessed before it
(and is thus older) but has not been freed and thus has at least some
chance of being accessed again.

2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)

The user data vs metadata split is somewhat arbitrary, and the primary
control on how much memory is used to cache data vs metadata is to
simply try to keep the proportion the same as it has been in the past
(each bucket "evicts against" itself).  The secondary control is to
evict data before evicting metadata.

Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has more
recently accessed, still-allocated blocks.  Data in the useful bucket
(with still-allocated blocks) may be evicted in preference to data in
the useless bucket (with old, freed blocks).

On dcenter, we saw that the MFU metadata bucket was 230MB, while the MFU
data bucket was 27GB and the MRU metadata bucket was 256GB.  However,
the vast majority of data in the MRU metadata bucket (256GB) was freed
blocks, and thus useless.  Meanwhile, the MFU metadata bucket (230MB)
was constantly evicting useful blocks that will be soon needed.

The problem of cache segmentation is a larger problem that needs more
investigation.  However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.

Ported-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1503
2013-06-20 09:55:52 -07:00
George Wilson
e51be06697 Illumos #3552, #3564
3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not condensing)
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  illumos/illumos-gate@16a4a80742
  https://www.illumos.org/issues/3552
  https://www.illumos.org/issues/3564

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1513
2013-06-19 16:22:39 -07:00
Madhav Suresh
c99c90015e Illumos #3006
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
     argument is zero

Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@fb09f5aad4
  https://illumos.org/issues/3006

Requires:
  zfsonlinux/spl@1c6d149feb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1509
2013-06-19 15:14:10 -07:00
Brian Behlendorf
0377189b88 Only check directory xattr on ENOENT
When SA xattrs are enabled only fallback to checking the directory
xattrs when the name is not found as a SA xattr.  Otherwise, the SA
error which should be returned to the caller is overwritten by the
directory xattr errors.  Positive return values indicating success
will also be immediately returned.

In the case of #1437 the ERANGE error was being correctly returned
by zpl_xattr_get_sa() only to be overridden with ENOENT which was
returned by the subsequent unnessisary call to zpl_xattr_get_dir().

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1437
2013-05-10 12:24:56 -07:00