Commit Graph

559 Commits

Author SHA1 Message Date
Yuri Pankov
7011fb6004 Illumos #3517
3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
Reviewed by: Albert Lee <trisk@nexenta.com>
Reviewed by: Jeffry Molanus <jeffry.molanus@nexenta.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3517
  illumos/illumos-gate@efb4a871d8

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-31 14:57:59 -07:00
Matthew Ahrens
d1fada1e6d Illumos #3603, #3604: bobj improvements
3603 panic from bpobj_enqueue_subobj()
3604 zdb should print bpobjs more verbosely
3871 GCC 4.5.3 does not like issue 3604 patch
Reviewed by: Henrik Mattson <henrik.mattson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3603
  https://www.illumos.org/issues/3604
  https://www.illumos.org/issues/3871
  illumos/illumos-gate@d04756377d

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Note that the patch from Illumos issue 3871 is not accepted into Illumos
at the time of this writing. It is something that I wrote when porting
this. Documentation is in the Illumos issue.
2013-10-31 14:57:51 -07:00
Matthew Ahrens
24a64651b4 Illumos #3588
3588 provide zfs properties for logical (uncompressed) space
     used and referenced
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/3588
  illumos/illumos-gate@77372cb0f3

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-31 10:16:11 -07:00
George Wilson
c2e42f9d53 Illumos #3578, #3579
3578 transferring the freed map to the defer map should be constant time
3579 ztest trips assertion in metaslab_weight()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Elling <richard.elling@dey-sys.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/3578
  https://www.illumos.org/issues/3579
  illumos/illumos-gate@9eb57f7f3f

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-31 09:23:40 -07:00
George Wilson
23c0a1333c Illumos #3561, #3116
3561 arc_meta_limit should be exposed via kstats
3116 zpool reguid may log negative guids to internal SPA history
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3561
  https://www.illumos.org/issues/3116
  illumos/illumos-gate@20128a0826

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:

1. The spa change was accidentally included in the libzfs_core merge.

2. "Add missing arcstats" (1834f2d8b7)
   already implemented these kstats a few years ago.
2013-10-31 09:23:40 -07:00
Matthew Ahrens
330847ff36 Illumos #3537
3537 want pool io kstats

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Sa?o Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  http://www.illumos.org/issues/3537
  illumos/illumos-gate@c3a6601

Ported by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting Notes:

1. The patch was restructured to take advantage of the existing
   spa statistics infrastructure.  To accomplish this the kstat
   was moved in to spa->io_stats and the init/destroy code moved
   to spa_stats.c.

2. The I/O kstat was simply named <pool> which conflicted with the
   pool directory we had already created.  Therefore it was renamed
   to <pool>/io

3. An update handler was added to allow the kstat to be zeroed.
2013-10-31 09:16:03 -07:00
George Wilson
a117a6d66e Illumos #3522
3522 zfs module should not allow uninitialized variables
Reviewed by: Sebastien Roy <seb@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3522
  illumos/illumos-gate@d5285cae91

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting notes:

1. ZFSOnLinux had already addressed many of these issues because of
   its use of -Wall. However, the manner in which they were addressed
   differed. The illumos fixes replace the ones previously made in
   ZFSOnLinux to reduce code differences.

2. Part of the upstream patch made a small change to arc.c that might
   address zfsonlinux/zfs#1334.

3. The initialization of aclsize in zfs_log_create() differs because
   vsecp is a NULL pointer on ZFSOnLinux.

4. The changes to zfs_register_callbacks() were dropped because it
   has diverged and needs to be resynced.
2013-10-30 14:51:27 -07:00
Richard Yao
495b25a91a Add missing code to zfs_debug.{c,h}
This is required to make Illumos 3962 merge.

Signed-off-by: Richard Yao <ryao@gentoo.org>
2013-10-29 15:06:18 -07:00
Richard Yao
20f04f08aa Fix incorrect usage of strdup() in zfs_unmount_snap()
Modifying the length of a string returned by strdup() is incorrect
because strfree() is allowed to use strlen() to determine which slab
cache was used to do the allocation.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Richard Yao
8c8417933f Fix order of function calls in zio_free_sync()
The resolution of a merge conflict when merging Illumos #3464 caused us
to invert the order couple of function calls in zio_free_sync() versus
what they are in Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Richard Yao
9cac042cfe Reintroduce uio_prefaultpages()
This was accidentally removed by overzealous commenting.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Massimo Maggi
023699cd62 Posix ACL Support
This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems.  Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set.  The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.

By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property.  Set the property to
'posixacl' to enable them.  If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.

This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).

  http://www.tuxera.com/community/posix-test-suite/

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #170
2013-10-29 14:54:26 -07:00
Brian Behlendorf
fc9e0530c9 Prevent xattr remove from creating xattr directory
Attempting to remove an xattr from a file which does not contain
any directory based xattrs would result in the xattr directory
being created.  This behavior is non-optimal because it results
in write operations to the pool in addition to the expected error
being returned.

To prevent this the CREATE_XATTR_DIR flag is only passed in
zpl_xattr_set_dir() when setting a non-NULL xattr value.  In
addition, zpl_xattr_set() is updated similarly such that it will
return immediately if passed an xattr name which doesn't exist
and a NULL value.

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #170
2013-10-29 13:23:53 -07:00
Richard Yao
c12e3a594a Restructure zfs_readdir() to fix regressions
This does the following:

1. It creates a uint8_t type value, which is initialized to DT_DIR on
dot directories and ZFS_DIRENT_TYPE(zap.za_first_integer) otherwise.
This resolves a regression where we return unintialized values as the
directory entry type on dot directories. This was accidentally
introduced by commit 8170d28126.

2. It restructures zfs_readdir() code to use `uint64_t offset` like
Illumos instead of `loff_t *pos`. This resolves a regression where
negative ZAP cursors were treated as if they were dot directories.

3. It restructures the function to more closely match the structure of
zfs_readdir() on Illumos and removes the unused variable outcount, which
was only used on Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1750
2013-10-29 09:51:59 -07:00
Brian Behlendorf
e0b0ca983d Add visibility in to cached dbufs
Currently there is no mechanism to inspect which dbufs are being
cached by the system.  There are some coarse counters in arcstats
by they only give a rough idea of what's being cached.  This patch
aims to improve the current situation by adding a new dbufs kstat.

When read this new kstat will walk all cached dbufs linked in to
the dbuf_hash.  For each dbuf it will dump detailed information
about the buffer.  It will also dump additional information about
the referenced arc buffer and its related dnode.  This provides a
more complete view in to exactly what is being cached.

With this generic infrastructure in place utilities can be written
to post-process the data to understand exactly how the caching is
working.  For example, the data could be processed to show a list
of all cached dnodes and how much space they're consuming.  Or a
similar list could be generated based on dnode type.  Many other
ways to interpret the data exist based on what kinds of questions
you're trying to answer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
2013-10-25 13:59:40 -07:00
Brian Behlendorf
2d37239a28 Add visibility in to dmu_tx_assign times
This change adds a new kstat to gain some visibility into the
amount of time spent in each call to dmu_tx_assign. A histogram
is exported via the new dmu_tx_assign file. The information
contained in this histogram is the frequency dmu_tx_assign
took to complete given an interval range.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Brian Behlendorf
0b1401ee91 Add visibility in to txg sync behavior
This change is an attempt to add visibility in to how txgs are being
formed on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to txg. These entries are then exported through the kstat
interface, which can then be interpreted in userspace.

For each txg, the following information is exported:

 * Unique txg number (uint64_t)
 * The time the txd was born (hrtime_t)
   (*not* wall clock time; relative to the other entries on the list)
 * The current txg state ((O)pen/(Q)uiescing/(S)yncing/(C)ommitted)
 * The number of reserved bytes for the txg (uint64_t)
 * The number of bytes read during the txg (uint64_t)
 * The number of bytes written during the txg (uint64_t)
 * The number of read operations during the txg (uint64_t)
 * The number of write operations during the txg (uint64_t)
 * The time the txg was closed (hrtime_t)
 * The time the txg was quiesced (hrtime_t)
 * The time the txg was synced (hrtime_t)

Note that while the raw kstat now stores relative hrtimes for the
open, quiesce, and sync times.  Those relative times are used to
calculate how long each state took and these deltas and printed by
output handlers.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Prakash Surya
1421c89142 Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.

For each arc_read call, the following information is exported:

 * A unique identifier (uint64_t)
 * The time the entry was added to the list (hrtime_t)
   (*not* wall clock time; relative to the other entries on the list)
 * The objset ID (uint64_t)
 * The object number (uint64_t)
 * The indirection level (uint64_t)
 * The block ID (uint64_t)
 * The name of the function originating the arc_read call (char[24])
 * The arc_flags from the arc_read call (uint32_t)
 * The PID of the reading thread (pid_t)
 * The command or name of thread originating read (char[16])

From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.

There is still some work to be done, but this should serve as a good
starting point.

Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.

Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Brian Behlendorf
76463d4026 Revert "Add txgs-<pool> kstat file"
This reverts commit e95853a331.
2013-10-25 13:57:25 -07:00
Brian Behlendorf
98ab38d109 Revert "Add new kstat for monitoring time in dmu_tx_assign"
This reverts commit 92334b14ec.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-25 13:57:25 -07:00
Richard Yao
b3c49d3df8 Linux 3.11 compat: Rename LZ4 symbols
Linus Torvalds merged LZ4 into Linux 3.11. This causes a conflict
whenever CONFIG_LZ4_DECOMPRESS=y or CONFIG_LZ4_COMPRESS=y are set in the
kernel's .config. We rename the symbols to avoid the conflict.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1789
2013-10-22 10:12:39 -07:00
Tim Chase
fbcb768c8f Add missing dsl pool configuration lock
The semantics introduced by the restructured sync task of illumos
3464 require this lock when calling dmu_snapshot_list_next().
The pool is locked/unlocked for each iteration to reduce the
chance of long-running locks.

This was accidentally missed when doing the original port because
ZoL's control directory code is Linux-specific and is in a
different file than in illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1785
2013-10-22 08:31:20 -07:00
George Wilson
7a61440761 Illumos #3552
3552 condensing one space map burns 3 seconds of CPU in spa_sync()
     thread (fix race condition)

References:
  https://www.illumos.org/issues/3552
  illumos/illumos-gate@03f8c36688

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting notes:

This fixes an upstream regression that was introduced in commit
zfsonlinux/zfs@e51be06697, which
ported the Illumos 3552 changes. This fix was added to upstream
rather quickly, but at the time of the port, no one spotted it and
the race was rare enough that it passed our regression tests. I
discovered this when comparing our metaslab.c to the illumos
metaslab.c.

Without this change it is possible for metaslab_group_alloc() to
consume a large amount of cpu time.  Since this occurs under a
mutex in a rcu critical section the kernel will log this to the
console as a self-detected cpu stall as follows:

  INFO: rcu_sched self-detected stall on CPU { 0}
  (t=60000 jiffies g=11431890 c=11431889 q=18271)

Closes #1687
Closes #1720
Closes #1731
Closes #1747
2013-10-18 14:34:01 -07:00
Ned Bass
40a806df25 Export symbols dsl_pool_config_{enter,exit}
These are needed by consumers (i.e. Lustre) who wish to use the
dsl_prop_register() interface to register callbacks when pool
properties of interest change.  This interface requires that the
DSL pool configuration lock is held when called.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1762
2013-10-10 16:56:51 -07:00
Brian Behlendorf
222b948059 Fix memory leak false positive in log_internal()
When building the spl with --enable-debug-kmem-tracking a memory
leak is detected in log_internal().  This happens to be a false
positive because the memory was freed using strfree() instead of
kmem_free().  All kmem_alloc()'s must be released with kmem_free()
to ensure correct accounting.

  SPL: kmem leaked 135/5641311 bytes
  address          size  data             func:line
  ffff8800cba7cd80 135   ZZZZZZZZZZZZZZZZ log_internal:456

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-10-09 09:16:36 -07:00
Brian Behlendorf
36342b13d9 Export addition dsl_prop_* symbols
The recent sync task restructuring in 13fe019 introduced several
new symbols which should be exported for use by consumers such
as Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-25 15:44:22 -07:00
Tim Chase
8769db3966 Allocate the ioctl "output" nvlist with KM_PUSHPAGE.
Some ZFS errors such as certain snapshot failures can occur in
the sync task context.  Because they may require additional memory
allocations, the initial nvlist must be allocated with KM_PUSHPAGE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737
2013-09-25 15:44:22 -07:00
Tim Chase
c5322236ec Fix several new KM_SLEEP warnings
A handful of allocations now occur in the sync path and need
to use KM_PUSHPAGE.  These were introduced by commit 13fe019.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737
2013-09-25 15:44:22 -07:00
Brian Behlendorf
cbfa294de4 Fix spa_deadman() TQ_SLEEP warning
The spa_deadman() and spa_sync() functions can both be run in the
spa_sync context and therefore should use TQ_PUSHPAGE instead of
TQ_SLEEP.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1734
Closes #1749
2013-09-25 15:38:44 -07:00
GregorKopka
f9f3f1ef98 Removing unneeded mutex for reading vq_pending_tree size
Locking mutex &vq->vq_lock in vdev_mirror_pending is unneeded:

* no data is modified
* only vq_pending_tree is read
* in case garbage is returned (eg. vq_pending_tree being updated
  while the read is made) the worst case would be that a single
  read could be queued on a mirror side which more busy than thought

The benefit of this change is streamlining of the code path since
it is taken for *every* mirror member on *every* read.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1739
2013-09-25 15:29:45 -07:00
Kohsuke Kawaguchi
77831e1738 Reduce the stack usage of dsl_dataset_remove_clones_key
dataset_remove_clones_key does recursion, so if the recursion goes
deep it can overrun the linux kernel stack size of 8KB. I have seen
this happen in the actual deployment, and subsequently confirmed it by
running a test workload on a custom-built kernel that uses 32KB stack.

See the following stack trace as an example of the case where it would
have run over the 8KB stack kernel:

        Depth    Size   Location    (42 entries)
        -----    ----   --------
  0)    11192      72   __kmalloc+0x2e/0x240
  1)    11120     144   kmem_alloc_debug+0x20e/0x500
  2)    10976      72   dbuf_hold_impl+0x4a/0xa0
  3)    10904     120   dbuf_prefetch+0xd3/0x280
  4)    10784      80   dmu_zfetch_dofetch.isra.5+0x10f/0x180
  5)    10704     240   dmu_zfetch+0x5f7/0x10e0
  6)    10464     168   dbuf_read+0x71e/0x8f0
  7)    10296     104   dnode_hold_impl+0x1ee/0x620
  8)    10192      16   dnode_hold+0x19/0x20
  9)    10176      88   dmu_buf_hold+0x42/0x1b0
 10)    10088     144   zap_lockdir+0x48/0x730
 11)     9944     128   zap_cursor_retrieve+0x1c4/0x2f0
 12)     9816     392   dsl_dataset_remove_clones_key.isra.14+0xab/0x190
 13)     9424     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 14)     9032     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 15)     8640     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 16)     8248     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 17)     7856     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 18)     7464     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 19)     7072     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 20)     6680     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 21)     6288     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 22)     5896     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 23)     5504     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 24)     5112     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 25)     4720     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 26)     4328     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 27)     3936     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 28)     3544     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 29)     3152     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 30)     2760     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 31)     2368     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 32)     1976     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 33)     1584     392   dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
 34)     1192     232   dsl_dataset_destroy_sync+0x311/0xf60
 35)      960      72   dsl_sync_task_group_sync+0x12f/0x230
 36)      888     168   dsl_pool_sync+0x48b/0x5c0
 37)      720     184   spa_sync+0x417/0xb00
 38)      536     184   txg_sync_thread+0x325/0x5b0
 39)      352      48   thread_generic_wrapper+0x7a/0x90
 40)      304     128   kthread+0xc0/0xd0
 41)      176     176   ret_from_fork+0x7c/0xb0

This change reduces the stack usage in dsl_dataset_remove_clones_key
by allocating structures in heap, not in stack.  This is not a fundamental
fix, as one can create an arbitrary large data set that runs over any
fixed size stack, but this will make the problem far less likely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Closes #1726
2013-09-25 15:18:32 -07:00
Brian Behlendorf
34d5a5fd03 Fix zpl_mknod() return values
The zpl_mknod() function was incorrectly negating its return value.
This doesn't cause any problems in the success case, but it does
prevent us from returning the correct error code for a failure.
The implementation of this function is now consistent with all
the other zpl_* functions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1717
2013-09-13 13:31:24 -07:00
Brian Behlendorf
17897ce2c8 Fix uninitialized variables
When compiling on an ARM device using gcc 4.7.3 several variables
in the zfs_obj_to_path_impl() function were flagged as uninitialized.
To resolve the warnings explicitly initialize them to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1716
2013-09-13 13:31:24 -07:00
Tim Chase
4cf652e5d4 Fix dmu_objset_find_dp() KM_SLEEP warning
After the restructuring in 13fe019 The 'zfs rename' command will
result in a KM_SLEEP being called in the sync context.  This may
deadlock due to reclaim so it was changed to KM_PUSHPAGE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1711
2013-09-11 11:49:32 -07:00
Matthew Ahrens
13fe019870 Illumos #3464
3464 zfs synctask code needs restructuring
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3464
  illumos/illumos-gate@3b2aab1880

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1495
2013-09-04 16:01:24 -07:00
Matthew Ahrens
6f1ffb0665 Illumos #2882, #2883, #2900
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>

References:
  https://www.illumos.org/issues/2882
  https://www.illumos.org/issues/2883
  https://www.illumos.org/issues/2900
  illumos/illumos-gate@4445fffbbb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1293

Porting notes:

WARNING: This patch changes the user/kernel ABI.  That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules.  Ensure you load the matching kernel
modules from master after updating the utilities.  Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:

  $ zpool list
  failed to read pool configuration: bad address
  no pools available

  $ zfs list
  no datasets available

Add zvol minor device creation to the new zfs_snapshot_nvl function.

Remove the logging of the "release" operation in
dsl_dataset_user_release_sync().  The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos.  This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).

Squash some "may be used uninitialized" warning/erorrs.

Fix some printf format warnings for %lld and %llu.

Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.

Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
2013-09-04 15:49:00 -07:00
Brian Behlendorf
6a7c0ccca4 Use directory xattrs for symlinks
There is currently a subtle bug in the SA implementation which
can crop up which prevents us from safely using multiple variable
length SAs in one object.

Fortunately, the only existing use case for this are symlinks with
SA based xattrs.  Therefore, until the root cause in the SA code
can be identified and fixed we prevent adding SA xattrs to symlinks.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1468
2013-08-22 13:30:44 -07:00
Brian Behlendorf
c273d60d80 Revert "Evict meta data from ghost lists + l2arc headers"
This reverts commit fadd0c4da1 which
introduced a regression in honoring the meta limit.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Close #1660
2013-08-22 12:15:37 -07:00
Richard Yao
0f37d0c8be Linux 3.11 compat: fops->iterate()
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.

To handle this the code was reworked to use the new ->iterate
interface.  Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.

Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions.  Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.

Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function.  The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names.  Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1653
Issue #1591
2013-08-15 16:19:07 -07:00
Brian Behlendorf
34e143323e Fix z_wr_iss_h zio_execute() import hang
Because we need to be more frugal about our stack usage under
Linux.  The __zio_execute() function was modified to re-dispatch
zios to a ZIO_TASKQ_ISSUE thread when we're in a context which
is known to be stack heavy.  Those two contexts are the sync
thread and what ever thread is performing spa initialization.

Unfortunately, this change introduced an unlikely bug which can
result in a zio being re-dispatched indefinitely and never being
executed.  If during spa initialization we handle a zio with
ZIO_PRIORITY_NOW it will be moved to the high priority queue.
When __zio_execute() is called again for the zio it will mis-
interpret the context and re-dispatch it again.  The system
will get stuck spinning re-dispatching the zio and making no
forward progress.

To fix this rare issue __zio_execute() has been updated not
to re-dispatch zios on either the ZIO_TASKQ_ISSUE or
ZIO_TASKQ_ISSUE_HIGH task queues.

In practice this issue was rarely reported and can usually
be fixed by rebooting the system and importing the pool again.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1455
2013-08-15 15:20:36 -07:00
Matthew Ahrens
cb682a173a Illumos #3618 ::zio dcmd does not show timestamp data
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  http://www.illumos.org/issues/3618
  illumos/illumos-gate@c55e05cb35

Notes on porting to ZFS on Linux:

The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.

One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:

    zfs_vdev_time_shift = 6
        to
    zfs_vdev_time_shift = 29

The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.

(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)

Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).

Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1643
2013-08-12 16:46:50 -07:00
Richard Yao
570d6edf1d Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKS
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1589
2013-08-09 15:31:52 -07:00
Brian Behlendorf
fadd0c4da1 Evict meta data from ghost lists + l2arc headers
When the meta limit is exceeded the ARC evicts some meta data
buffers from the mfu+mru lists.  Unfortunately, for meta data
heavy workloads it's possible for these buffers to accumulate
on the ghost lists if arc_c doesn't exceed arc_size.

To handle this case arc_adjust_meta() has been entended to
explicitly evict meta data buffers from the ghost lists in
proportion to what was evicted from the mfu+mru lists.

If this is insufficient we request that the VFS release
some inodes and dentries.  This will result in the release
of some dnodes which are counted as 'other' metadata.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-09 10:06:12 -07:00
Brian Behlendorf
68121a03da Allow arc_evict_ghost() to only evict meta data
The default behavior of arc_evict_ghost() is to start by evicting
data buffers.  Then only if the requested number of bytes to evict
cannot be satisfied by data buffers move on to meta data buffers.

This is ideal for honoring arc_c since it's preferable to keep the
meta data cached.  However, if we're trying to free memory from the
arc to honor the meta limit it's a problem because we will need to
discard all the data to get to the meta data.

To avoid this issue the arc_evict_ghost() is now passed a fourth
argumented describing which buffer type to start with.  The
arc_evict() function already behaves exactly like this for a
same reason so this is consistent with the existing code.

All existing callers have been updated to pass ARC_BUFC_DATA so
this patch introduces no functional change.  New callers may
pass ARC_BUFC_METADATA to skip immediately to evicting meta
data leaving the normal data untouched.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-09 10:06:08 -07:00
Saso Kiselkov
3a17a7a99a Illumos #3137 L2ARC compression
3137 L2ARC compression
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  illumos/illumos-gate@aad02571bc
  https://www.illumos.org/issues/3137
  http://wiki.illumos.org/display/illumos/L2ARC+Compression

Notes for Linux port:

A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set.  This allows the legacy behavior
to be preserved.

Ported by: James H <james@kagisoft.co.uk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1379
2013-08-08 13:27:21 -07:00
Richard Yao
c11a12bc3b Return -1 from arc_shrinker_func()
This is analogous to SPL commit zfsonlinux/spl@b9b3715.  While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1579
2013-08-08 09:20:56 -07:00
Richard Yao
8170d28126 Return correct type and offset from zfs_readdir
zfs_readdir() is used by getdents(), which provides a list of all files
in directory, their types and an offset that be used by llseek() to seek
to the next directory entry.

On Solaris, the first two directory entries "." and ".." respectively
have offsets 1 and 2 on ZFS while the other files have rather large
numbers. Currently, ZFSOnLinux is  giving "." offset 0 and all other
entries large numbers. The first entry's next entry offset points to
itself, which causes software that uses llseek() in conjunction with
getdents() for filesystem navigation to enter an infinite loop.  The
offsets used for each directory entry are filesystem specific on all
platforms, so we can fix this by adopting the Solaris behavior.

Also, we currently report each directory entry as having type 0 (???).
This is not wrong, but we can do better. getdents() on Solaris does not
appear to provide this information, but it does on Linux and Mac OS X
do. ZFS provides easy access to type information in zfs_readdir(), so
this patch provides this as well.

Reported-by: Andrey <andrey@kudinov.su>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1624
2013-08-07 16:16:43 -07:00
George Wilson
c61f97f426 Illumos #3639 zpool.cache should skip over readonly pools
3639 zpool.cache should skip over readonly pools
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  illumos/illumos-gate@fb02ae0252
  https://www.illumos.org/issues/3639

Normally we don't list pools that are imported read-only in the cache
file, however you can accidentally get one into the cache file by
importing and exporting a read-write pool while a read-only pool is
imported:

$ zpool import -o readonly test1
$ zpool import test2
$ zpool export test2
$ zdb -C

This is a problem because if the machine reboots we import all pools in
the cache file as read-write.

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-07 16:13:56 -07:00
Brian Behlendorf
78d7a5d780 Write dirty inodes on close
When the property atime=on is set operations which only access
and inode do cause an atime update.  However, it turns out that
dirty inodes with updated atimes are only written to disk when
the inodes get evicted from the cache.  Somewhat surprisingly
the source suggests that this isn't a ZoL specific issue.

This behavior may in part explain why zfs's reclaim logic has
been observed to be slow.  When reclaiming inodes its likely
that they have a dirty atime which will force a write to disk.

Obviously we don't want to force a write to disk for every
atime update, these needs to be batched.  The right way to
do this is to fully implement the .dirty_inode and .write_inode
callbacks.  However, to do that right requires proper unification
of some fields in the znode/inode.  Then we could just mark the
inode dirty and leave it to the VFS to call .write_inode
periodically.

Until that work gets done we have to settle for some middle
ground.  The simplest and safest thing we can do for now is
to write the dirty inode on last close.  This should prevent
the majority of inodes in the cache from having dirty atimes
and not drastically increase the number of writes.

Some rudimentally testing to show how long it takes to drop
500,000 inodes from the cache shows promising results.  This
is as expected because we're no longer do lots of IO as part
of the eviction, it was done earlier during the close.

w/out patch: ~30s to drop 500,000 inodes with drop_caches.
with patch:  ~3s to drop 500,000 inodes with drop_caches.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-07 16:11:19 -07:00
Brian Behlendorf
57b650b86f Export additional dmu symbols
The dmu_prefetch, dmu_free_long_range, dmu_free_object,
dmu_prealloc, dmu_write_policy, and dmu_sync symbols have
been exported so they may be used by other modules.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-08-01 09:48:07 -07:00