Commit Graph

1800 Commits

Author SHA1 Message Date
Brian Behlendorf
340dfbe193 Change VERIFY to ASSERT in mutex_destroy()
There have been multiple reports of 'zdb' tripping the VERIFY in
mutex_destroy() because pthread_mutex_destroy() returns EBUSY.

Exactly how this can happen still needs to be explained, but this
doesn't strictly need to be fatal for non-debug builds.  Therefore,
this patch converts the VERIFY to an ASSERT until the root cause
is determined and resolved.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2027
2015-02-11 13:58:40 -08:00
Andrey Vesnovaty
5f15fa2216 Fix readdir for .zfs/snapshot directory
dmu_snapshot_list_next stores the index of the next snapshot entry to the offp
argument, which zpl_snapdir_iterate then uses for the dir_emit. This
result in an off-by-one error. Therefore a temporary variable should be
used.

This was a regression introduced in commit zfsonlinux/zfs@0f37d0c.

Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2930
2015-02-10 16:34:30 -08:00
Brian Behlendorf
3941503c0a Retire zio_cons()/zio_dest()
The zio_cons() constructor and zio_dest() destructor don't exist
in the upstream Illumos code.  They were introduced as a workaround
to avoid issue #2523.  Since this issue has now been resolved this
code is being reverted to bring ZoL back in sync with Illumos.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:09:49 -08:00
Brian Behlendorf
6442f3cfe3 Retire zio_bulk_flags
Long ago the zio_bulk_flags module parameter was introduced to
facilitate debugging and profiling the zio_buf_caches.  Today
this code works well and there's no compelling reason to keep
this functionality.  In fact it's preferable to revert this so
the code is more consistent with other ZFS implementations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #3063
2015-02-10 16:08:49 -08:00
Jörg Thalheim
534759fad3 Linux 3.19 compat: file_inode was added
struct access f->f_dentry->d_inode was replaced by accessor function
file_inode(f)

Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3084
2015-02-10 11:24:51 -08:00
Brian Behlendorf
77aef6f60e Use vmem_alloc() for nvlists
Several of the nvlist functions may perform allocations larger than
the 32k warning threshold.  Convert them to use vmem_alloc() so the
best allocator is used.

Commit efcd79a retired KM_NODEBUG which was used to suppress large
allocation warnings.  Concurrently the large allocation warning threshold
was increased from 8k to 32k.  The goal was to identify the remaining
locations, such as this one, where the allocation can be larger than
32k.  This patch is expected fine tuning resulting for the kmem-rework
changes, see commit 6e9710f.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3057
Closes #3079
Closes #3081
2015-02-10 11:00:08 -08:00
Brian Behlendorf
afe373260e Revert "Don't read space maps during import for readonly pools"
This reverts commit 7fc8c33ede which
accidentally introduced a ztest failure.

ztest: '/usr/sbin/zdb -bcc -d -U /var/tmp/zpool.cache ztest' exit code 2
child exited with code 3

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-02-09 16:56:59 -08:00
Tim Chase
e890dd85a7 Produce a full snapshot list for zfs send -p
In order to accelerate zfs receive operations in the face of many
property-containing snapshots, commit 0574855 changed the header nvlist
("fss") of a send stream to exclude snapshots which aren't part of the
stream.  This, however, would cause zfs receive -F to erroneously remove
snapshots; it would remove any snapshot which wasn't listed in the header
nvlist.

This patch restores the full list of snapshots in fss[<id>[snaps]] but
still suppresses the properties of non-sent snapshots and also removes a
consistency check in which an error is raised if a listed snapshot does
not have any properties in fss[<id>[snapprops]].

The 0574855 commit also introduced a bug in which zfs send -p of a
complete stream (zfs send -p pool/fs@snap) would exclude the snapshot
properties in fss[<id>[snapprops]].  This patch detects the last snapshot
in a series when no "from" snapshot has been specified and includes its
properties.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2907
2015-02-09 16:43:17 -08:00
Brian Behlendorf
7fc8c33ede Don't read space maps during import for readonly pools
Normally when importing a pool the space maps for all top level
vdevs are read from disk.  The space maps will be required latter
when an allocation is performed and free blocks need to be located.

However, if the pool is imported readonly then we are guaranteed
that no allocations can occur.  In this case the space maps need
not be loaded..   A similar argument can be made for the DTLs
(dirty time logs).

Because a pool import will fail if the space maps cannot be read.
The ability to safely ignore them makes it more likely that a
damaged pool can be imported readonly to recover its contents.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2831
2015-02-09 16:43:03 -08:00
Lukas Wunner
bf5efb5c66 Fix Dracut scripts to allow for blanks in pool and dataset names
The ability to use blanks is documented in zpool(8) and implemented
in module/zcommon/zfs_namecheck.c:valid_char().

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3083
2015-02-09 10:08:43 -08:00
Lukas Wunner
293d141ae4 Fix loop in Dracut shutdown script
The shell executes each command of a pipeline in a subshell,
thus $ret always had the same value after the while loop that
it had before the loop (http://mywiki.wooledge.org/BashFAQ/024),
signaling success even if some of the zpools could not be exported.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3083
2015-02-09 10:08:05 -08:00
Justin T. Gibbs
33b4de513e Illumos 5311 - traverse_dnode may report success when it should not
5311 traverse_dnode may report success when it should not
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@spectralogic.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/2a89c2c
  https://www.illumos.org/issues/5311

Ported by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2970
2015-02-06 12:07:15 -08:00
Ned Bass
a62d1b02e3 Fix SA header size accounting
The functions sa_find_sizes() and sa_build_layout() fail to account
for the additional 2 bytes of SA header space when calculating whether
a variable size attribute might spill over. They may consequently
determine that an attribute will fit in the bonus buffer along with a
spill block pointer, when in reality the attribute would be partially
overwritten by the spill block pointer if spill over occurs. This also
causes an inconsistency between the SA header size and the number of
variable size attributes in the layout, tripping an assertion when
debugging is on. The following reproducer demonstrates the problem.

  ln -s $(perl -e 'print "z" x 20') file
  setfattr -h -n trusted.foo -v $(perl -e 'print "z" x 200') file

Even though sa_find_sizes() computes the index of the attribute where
spill-over will occur, sa_build_layouts() discards the result and
recomputes it itself. As it turns out, both functions get it wrong.
Since this computation is awkward and, as history has shown, easy to
screw up, let's just do it in one place. This patch fixes the bug in
sa_find_sizes() and updates sa_build_layout() to use the result
computed there.

Also improve the comments in sa_find_sizes().

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3070
2015-02-06 09:26:46 -08:00
Brian Behlendorf
e2c4acde55 Skip evicting dbufs when walking the dbuf hash
When a dbuf is in the DB_EVICTING state it may no longer be on the
dn_dbufs list.  In which case it's unsafe to call DB_DNODE_ENTER.
Therefore, any dbuf which is found in this safe must be skipped.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2553
Closes #2495
2015-02-06 09:24:28 -08:00
Chunwei Chen
aa506dcb3d Fix build error when make deb
After 53698a4, the following error occurs when make deb.

  CCLD     zed
../../lib/libzfs/.libs/libzfs.so: undefined reference to `get_system_hostid'

Add libzpool.la to zed/Makefile.am to fix this

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3080
2015-02-06 09:16:32 -08:00
Chunwei Chen
53698a453d Read spl_hostid module parameter before gethostid()
If spl_hostid is set via module parameter, it's likely different from
gethostid(). Therefore, the userspace tool should read it first before
falling back to gethostid().

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3034
2015-02-04 16:44:53 -08:00
Tim Chase
aa2ef419e4 Spurious ENOMEM returns when reading dbufs kstat
Commit 7b2d78a046 fixed some improper uses
of snprintf(), however, in __dbuf_stats_hash_table_data() the return
value of snprintf is propagated to the caller.  This caused spurious
ENOMEM errors when reading the dbufs kstat.

This commit causes the actual number of characters written to be returned.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3072
2015-02-04 16:35:26 -08:00
avg
037763e44e fix l2arc compression buffers leak
Commit log from FreeBSD:

    We have observed that arc_release() can be called concurrently with a
    l2arc in-flight write.  Also, we have observed that arc_hdr_destroy()
    can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE
    flag in similar circumstances.

    Previously the l2arc headers would be freed while leaking their
    associated compression buffers.  Now the buffers are placed on
    l2arc_free_on_write list for delayed freeing.  This is similar to
    what was already done to arc buffers that were supposed to be freed
    concurrently with in-flight writes of those buffers.

    In addition to fixing the discovered leaks this change also adds
    some protective code to assert that a compression buffer associated
    with a l2arc header is never leaked.

    A new kstat l2_cdata_free_on_write is added.  It keeps a count
    of delayed compression buffer frees which previously would have
    been leaks.

    Tested by:  Vitalij Satanivskij <satan@ukr.net> et al
    Requested by:       many
    MFC after:  2 weeks
    Sponsored by:       HybridCluster / ClusterHQ

References:
    https://illumos.org/issues/5222
    https://github.com/freebsd/freebsd/commit/b98f85d
    http://thread.gmane.org/gmane.os.freebsd.current/155757/focus=155781
    http://lists.open-zfs.org/pipermail/developer/2014-January/000455.html
    http://lists.open-zfs.org/pipermail/developer/2014-February/000523.html

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3029
2015-02-03 16:54:16 -08:00
Brian Behlendorf
19ea3d25df Use zio buffers in zil_itx_create()
The zil_itx_create() function uses the vmem_alloc() allocator for
its buffers because when logging a write that buffer may be as large
as 64K.  This is non-optimal because we may need to allocate many of
of these buffers and this interface has the potential to be slow.
Instead, use zio_data_buf_alloc() which is specifically designed to
be able to efficiently allocate a wide range of buffer sizes.

In addition, do some cleanup and use the zil_itx_destroy() function
to always free an itx structure.  This way we're always sure the
right allocation functions are used.  Notice that in the current
code kmem_free() and vmem_free() were both used.  This happened to
work because these wrappers map to the same internal SPL function.

This was identified as a potential problem when a low-end memory
constrained system began logging the following warnings.  There
was no deadlock here just repeated allocation failures resulting
in increased latency.

Possible memory allocation deadlock: size=65792 lflags=0x42d0
Pid: 20118, comm: kvm Tainted: P           O 3.2.0-0.bpo.4-amd64
Call Trace:
 [<ffffffffa040b834>] ? spl_kmem_alloc_impl+0x115/0x127 [spl]
 [<ffffffffa040b84f>] ? spl_kmem_alloc_debug+0x9/0x36 [spl]
 [<ffffffffa05d8a0b>] ? zil_itx_create+0x2d/0x59 [zfs]
 [<ffffffffa05c71e6>] ? zfs_log_write+0x13a/0x2f0 [zfs]
 [<ffffffffa05d41bc>] ? zfs_write+0x85b/0x9bb [zfs]
 [<ffffffffa05e37ec>] ? zpl_aio_write+0xca/0x110 [zfs]
 [<ffffffff811088e5>] ? do_sync_readv_writev+0xa3/0xde
 [<ffffffff81108f41>] ? do_readv_writev+0xaf/0x125
 [<ffffffff81109055>] ? sys_pwritev+0x55/0x9a
 [<ffffffff813721d2>] ? system_call_fastpath+0x16/0x1b

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3059
2015-02-02 11:20:41 -08:00
Chris Dunlap
2c41df5bf8 Cleanup _zed_event_add_nvpair()
When _zed_event_add_var() was updated to be the common routine
for adding zedlet environment variables, an additional snprintf()
was added to the processing of each nvpair.  This commit changes
_zed_event_add_nvpair() to directly call _zed_event_add_var()
for nvpair non-array types, thereby removing a superfluous call to
snprintf().  For consistency, the helper functions for converting
nvpair array types are similarly adjusted to add variables.

The _zed_event_value_is_hex() and _zed_event_add_var() functions have
been moved up in the file since forward declarations are not used,
but no changes have been made to these functions.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3042
2015-01-30 14:46:44 -08:00
Chris Dunlap
854f30a91f Protect against adding duplicate strings in ZED
The zed_strings container stores strings in an AVL, but does not
check for duplicate strings being added.  Within the AVL, strings
are indexed by the string value itself.  avl_add() requires the node
being added must not already exist in the tree, and will assert()
if this is not the case.

This should not cause problems in practice.  ZED uses this container
in two places.  In zed_conf.c, it is used to store the names of
enabled zedlets as zed scans the zedlet directory listing; duplicate
entries cannot occur here since duplicate names cannot occur within
a directory.  In zed_event.c, it is used to store the environment
variables (as "NAME=VALUE" strings) that will be passed to zedlets;
duplicate strings here should never happen unless there is a bug
resulting in a duplicate nvpair or environment variable.

This commit protects against adding a duplicate to a zed_strings
container by first checking for the string being added, and removing
the previous entry should one exist.  This implements a "last one
wins" policy.

This commit also changes the prototype for zed_strings_add() to allow
the string key (by which it is indexed in the AVL) to differ from
the string value.  By adding zedlet environment variables using the
variable name as the key, multiple adds for the same variable name
will result in only the last value being stored.

Finally, this commit routes all additions of zedlet environment
variables through the updated _zed_event_add_var().  This ensures
all zedlet environment variable names are properly converted.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3042
2015-01-30 14:46:17 -08:00
Brian Behlendorf
0365064a97 Handle closing an unopened ZVOL
Thank to commit a4430fce69 we're
now correctly returning EROFS when opening a zvol on a read-only
pool.  Unfortunately, it looks like this causes us to trigger
some unexpected behavior by __blkdev_get().

In the failure case it's possible __blkdev_get() will call
__blkdev_put() for a bdev which was never successfully opened.
This results in us trying to close the device again and hitting
the NULL dereference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:14 -08:00
Brian Behlendorf
a127e841de Add zvol_open() error handling for readonly property
Rather than ASSERT when for some reason the readonly property of
a zvol can't be read cleanly handle the failure.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1343
2015-01-30 14:44:06 -08:00
Richard Yao
f9f431cd28 Use (void) memcpy(), not (void *) memcpy()
This was caught by Clang.  Clearly the intent of this code was
to explicitly ignore the return value.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3054
2015-01-30 09:43:04 -08:00
Brian Behlendorf
6466b61db6 Make zpool import -d|-c behave consistently
When importing pools with zpool import -aN there is inconsistent
behavior between '-d /dev/disk/by-id' (or another path) and
'-c /etc/zfs/zpool.cache'.

The difference in behavior is caused by zpool_find_import_cached()
returning an empty nvlist_t when there are no pools to import but
zpool_find_import_impl() returns NULL for the same situation. The
behavior of zpool_find_import_cached() is arguably more correct
because it allows returning NULL to be used for an error case and
not an empty set.

This change resolves the issue by updating get_configs() such that
it returns an empty set instead of NULL when no config is found.
The updated behavior will now always return 0 for this case.

$ zpool import -aN; echo $?
no pools available to import
0

$ zpool import -aN -d /var/tmp/; echo $?
no pools available to import
0

$ zpool import -aN -c /etc/zfs/zpool.cache; echo $?
no pools available to import
0

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2080
2015-01-28 11:12:31 -08:00
Brian Behlendorf
a485efc4cd Merge branch 'arc_summary_draft_v2'
Add a port of arc_summary.py to ZFS on Linux, arc_summary.py is a
standard tool in FreeBSD and Illumos. The version of the script used
for the port originally came from FreeNAS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2920
2015-01-28 11:08:11 -08:00
Kyle Blatter
a644585549 Replace sysctl summary with tunables summary.
The original script displayed tunable parameters using sysctl calls.
This patch modifies this by displaying tunable parameters found in
/sys/modules/zfs/parameters/. modinfo calls are used to capture
descriptions.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
2015-01-28 11:05:53 -08:00
Kyle Blatter
65dc685a1d Force all lines to be 80 columns
Ensure this script conforms to the projects style guidelines
by limiting line length to 80 columns.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
2015-01-28 11:03:14 -08:00
Kyle Blatter
6427f39fe0 Add a help option with usage information
Add a basic help option and usage description which is consistent
with arcstat.py and dbufstat.py.  This also adds support for long
opts.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
2015-01-28 11:00:11 -08:00
Kyle Blatter
3bfc7f1b20 Refactor arc_summary to simplify -p processing
The -p option is used to specify a specific page of output to be
displayed. If omitted, all output pages would be displayed.

arc_summary, as it stood, had really kludgy processing code for
processing the -p option. It relied on a try-except block which was
treated as an if statement and in normal operation would fail any time a
user didn't specify the -p option on the command line. When the
exception was thrown, the script would then display all output pages.
This happened whether the -p option was omitted or malformed. Thus, in
the principle use case, an exception would be raised in order to run the
script as desired. The same except code would be called regardless of
the exception, however, and malformed -p arguments would also cause the
script to execute.

Additionally, this required the function which handles the case where
all output pages were to be displayed, _call_all, to be potentially
called from several locations within main.

This commit refactors the option processing code to simplify it and make
it easier to catch runtime errors in the script. This is done by
specializing the try-except block to only have an exception when the -p
argument is malformed. When the -p option is correctly selected by the
user, it calls a function in the unSub array directly, which will only
display one page of output.

Finally in the context of this refactoring the page breaks have been
removed.  Pages seem to have been added into the output in the FreeNAS
version of the script. This patch removes pages from the output to more
closely resemble the freebsd version of the script.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
2015-01-28 10:59:19 -08:00
cburroughs
53dc1139e7 Modified arc_summary.py to run on linux
1) Comment out stat sections whose kstats are not currently available
2) Port most of arc_summary to use spl kstats
3) Enable l2arc stats
5) Include compressed l2size
4) Minor style fixes / cleanup

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
2015-01-28 10:27:44 -08:00
cburroughs
edd5b80d07 Add arc_summary.py from FreeNAS
The arc_summary script is a useful utility for administrators on
other ZFS platforms.  It provides a quick and easy way to get a high
level view of the current ARC state.

Historically this was a perl script but it was rewritten in python
for FreeNAS.  We've decided to adopt the python version instead of
the perl version for a few reasons.

1) ZoL has no existing perl dependencies, but it does have a python
   dependency for scripts such as arcstat.py and dbufstat.py.  Using
   python for arc_summary.py helps us minimize dependencies.

2) Most major Linux distributions already depend heavily on python
   for their core infrastructure.  This means it's very likely to
   be available even very early in the boot process.

Original source:
  https://github.com/freenas/freenas/blob/master/gui/tools/arc_summary.py

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cburroughs <chris.burroughs@gmail.com>
Signed-off-by: Kyle Blatter <kyleblatter@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
2015-01-28 09:33:00 -08:00
Tim Chase
b0cf0676c0 Fix removal of SA in sa_modify_attrs()
The sa_modify_attrs() function can add, remove or replace an SA.
The main loop in the function uses the index "i" to iterate over the
existing SAs and uses the index "j" for writing them into a new buffer
via SA_ADD_BULK_ATTR().  The write index, "j" is incremented on remove
(SA_REMOVE) operations which leads to a corruption in the new SA buffer.
This patch remove the increment for SA_REMOVE operations.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3028
2015-01-21 16:35:14 -08:00
Richard Yao
841c9d43c7 Use kmem_vasprintf() in log_internal()
An attempt to debug zfsonlinux/zfs#2781 revealed that this code could be
simplified by using kmem_asprintf(). It is not clear that switching to
kmem_asprintf() addresses zfsonlinux/zfs#2781. However, switching to
kmem_asprintf() is cleanup that simplifies debugging such that it would
become clear that this is a bug in glibc should the issue persist.

It also brings this function almost back in sync with Illumos.  This
was possible due to the recently reworked kmem code which allows us
to use KM_SLEEP in the same fashion as Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2791
Issue #2781
2015-01-21 15:30:24 -08:00
Tim Chase
3c832b8cc1 Linux 3.12 compat: split shrinker has s_shrink
The split count/scan shrinker callbacks introduced in 3.12 broke the
test for HAVE_SHRINK, effectively disabling the per-superblock shrinkers.

This patch re-enables the per-superblock shrinkers when the split shrinker
callbacks have been detected.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2975
2015-01-20 14:07:59 -08:00
Brian Behlendorf
6e9710f7c3 Merge branch 'kmem-rework'
The core motivation behind these changes is to minimize the
memory management differences between ZFS on Linux and other
platforms.  This simplifies the process of porting changes to
Linux from other platforms.  This is good for code quality
and is expected to reduce the number of defects accidentally
introduced due to porting.  The following key Linux specific
changes have been reverted.

* KM_PUSHPAGE changed back to KM_SLEEP.  All contexts where
  it is unsafe to perform IO have been marked with PF_FSTRANS.
  This context specific mechanism is now used exclusively
  and the KM_PUSHPAGE mechanism has been retired.

* The KM_NODEBUG flag has been retired.  Allocations larger
  than 32K should use vmem_alloc()/vmem_free().  Depending
  on the size of the allocation either kmalloc() or vmalloc()
  will be used internally, but no warning will be printed.

* Pre-allocated vdev IO buffers and the dedicated SA spill
  block cache have been retired.  It is now safe and reliable
  to allocate buffers of the needed size without fear of
  deadlocking.  This reduces our memory footprint and paves
  the way for larger block sizes.

Depends on zfsonlinux/spl#414.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2918
2015-01-16 14:58:46 -08:00
Brian Behlendorf
81971b137a Revert "SA spill block cache"
The SA spill_cache was originally introduced to avoid the need to
perform large kmem or vmem allocations.  Instead a small dedicated
cache of preallocated SA buffers was kept.

This solution was viable while the maximum block size was limited
to 128K.  But with the planned increase of the maximum block size
to 16M callers need to migrate to the zio_buf_alloc().  However,
they should be aware this interface is expected to change again
once the zio buffers are fully backed by scatter-gather lists.

Alternately, if the callers know these buffers will never be large
or be infrequently accessed they may kmem_alloc() or vmem_alloc()
the needed temporary space.

This change has the additional benegit of bringing the code back
inline with the upstream Illumos source.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf
285b29d959 Revert "Pre-allocate vdev I/O buffers"
Commit 86dd0fd added preallocated I/O buffers.  This is no longer
required after the recent kmem changes designed to make our memory
allocation interfaces behave more like those found on Illumos.  A
deadlock in this situation is no longer possible.

However, these allocations still have the potential to be expensive.
So a potential future optimization might be to perform then KM_NOSLEEP
so that they either succeed of fail quicky.  Either case is acceptable
here because we can safely abort the aggregation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf
60e1eda929 Add kmem_cache.h include to default context
As part of the spl kmem/vmem refactoring the kmem_cache_* functions
were split in to their own kmem_cache.h header.  This was done in
part so that kmem_* consumers would not be forced to include the
kmem_cache_* functions which mask several Linux SLAB/SLAB functions.

Because of this we now much explicitly include kmem_cache.h in the
zfs_context.h.  However, consumers such as Lustre which need access
to the KM_FLAGS but not the kmem_cache_* functions can now safely
just include kmem.h.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:28 -08:00
Brian Behlendorf
79c76d5b65 Change KM_PUSHPAGE -> KM_SLEEP
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:26 -08:00
Brian Behlendorf
efcd79a883 Retire KM_NODEBUG
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate.  The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.

A careful reader will notice that not all callers have been changed
to vmem_alloc().  Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k.  This is desirable because it minimizes the need
for Linux specific code changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:40:32 -08:00
Richard Yao
71f8548ea4 Use is_vmalloc_addr() in vdev_disk.c
The initial port of ZFS to Linux required a way to identify virtual
memory to make IO to virtual memory backed slabs work, so kmem_virt()
was created. Linux 2.6.25 introduced is_vmalloc_addr(), which is
logically equivalent to kmem_virt(). Support for kernels before 2.6.26
was later dropped and more recently, support for kernels before Linux
2.6.32 has been dropped. We retire kmem_virt() in favor of
is_vmalloc_addr() to cleanup the code.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf
92119cc259 Mark IO pipeline with PF_FSTRANS
In order to avoid deadlocking in the IO pipeline it is critical that
pageout be avoided during direct memory reclaim.  This ensures that
the pipeline threads can always make forward progress and never end
up blocking on a DMU transaction.  For this very reason Linux now
provides the PF_FSTRANS flag which may be set in the process context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:28:05 -08:00
Brian Behlendorf
d958324f97 Fix zfs_putpage() lock inversion (again)
This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range().  Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read().  The page must be unlocked before taking
the range lock.  This patch corrects that issue.

In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976
2015-01-08 16:09:41 -08:00
Ned Bass
33b6dbbc51 Document zfs_flags module parameter
Add a table describing the debugging flags that can be set in the zfs_flags
module parameter.  Also change the module_param type to 'uint' so users aren't
shown a negative value. The updated man page text is reproduced below for
convenience.

zfs_flags (int)
            Set  additional debugging flags. The following flags may be
            bitwise-or'd together.

            +-------------------------------------------------------+
            |Value   Symbolic Name                                  |
            |        Description                                    |
            +-------------------------------------------------------+
            |    1   ZFS_DEBUG_DPRINTF                              |
            |        Enable dprintf entries in the debug log.       |
            +-------------------------------------------------------+
            |    2   ZFS_DEBUG_DBUF_VERIFY *                        |
            |        Enable extra dbuf verifications.               |
            +-------------------------------------------------------+
            |    4   ZFS_DEBUG_DNODE_VERIFY *                       |
            |        Enable extra dnode verifications.              |
            +-------------------------------------------------------+
            |    8   ZFS_DEBUG_SNAPNAMES                            |
            |        Enable snapshot name verification.             |
            +-------------------------------------------------------+
            |   16   ZFS_DEBUG_MODIFY                               |
            |        Check for illegally modified ARC buffers.      |
            +-------------------------------------------------------+
            |   32   ZFS_DEBUG_SPA                                  |
            |        Enable spa_dbgmsg entries in the debug log.    |
            +-------------------------------------------------------+
            |   64   ZFS_DEBUG_ZIO_FREE                             |
            |        Enable verification of block frees.            |
            +-------------------------------------------------------+
            |  128   ZFS_DEBUG_HISTOGRAM_VERIFY                     |
            |        Enable extra spacemap histogram verifications. |
            +-------------------------------------------------------+
            * Requires debug build.

            Default value: 0.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2988
2015-01-07 15:50:49 -08:00
Ned Bass
4e30e68caf Don't use AC_LANG_SOURCE for conftest.h source
Using AC_LANG_SOURCE with some versions of autoconf is problematic if
the given source is to be written to a header file. Such versions assume
the contents are to be written to conftest.c and generate shell code to
that effect. The contents of the test program to detect support for
Linux tracepoints were consequently malformed (containing the source for
conftest.h) so the build system incorrectly disabled tracepoints
support. Fix this in ZFS_LINUX_TRY_COMPILE_HEADER by passing the header
source directly to ZFS_LINUX_COMPILE_IFELSE.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
2015-01-06 16:53:30 -08:00
Ned Bass
49ee64e5e6 Remove duplicate typedefs from trace.h
Older versions of GCC (e.g. GCC 4.4.7 on RHEL6) do not allow duplicate
typedef declarations with the same type. The trace.h header contains
some typedefs to avoid 'unknown type' errors for C files that haven't
declared the type in question. But this causes build failures for C
files that have already declared the type. Newer versions of GCC (e.g.
v4.6) allow duplicate typedefs with the same type unless pedantic error
checking is in force. To support the older versions we need to remove
the duplicate typedefs.

Removal of the typedefs means we can't built tracepoints code using
those types unless the required headers have been included. To
facilitate this, all tracepoint event declarations have been moved out
of trace.h into separate headers. Each new header is explicitly included
from the C file that uses the events defined therein. The trace.h header
is still indirectly included form zfs_context.h and provides the
implementation of the dprintf(), dbgmsg(), and SET_ERROR() interfaces.
This makes those interfaces readily available throughout the code base.
The macros that redefine DTRACE_PROBE* to use Linux tracepoints are also
still provided by trace.h, so it is a prerequisite for the other
trace_*.h headers.

These new Linux implementation-specific headers do introduce a small
divergence from upstream ZFS in several core C files, but this should
not present a significant maintenance burden.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2953
2015-01-06 16:53:24 -08:00
Brian Behlendorf
74328ee18f Fix zfs_putpage() lock inversion
There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock.  To
prevent this we must always manipulate the writeback bit while
holding the range lock.  The exact deadlock is as follows:

------ Process A ------        ------ Process B ------
zpl_writepages                 zpl_fallocate
write_cache_pages              zpl_fallocate_common
zpl_putpage                    zfs_space
zfs_putpage (set bit)          zfs_freesp
zfs_range_lock (wait on lock)  zfs_free_range (take lock)
[has not yet initiated I/O,    truncate_inode_pages_range
the bit will not be cleared]   wait_on_page_writeback (wait on bit)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976
2014-12-22 09:31:56 -08:00
Ned Bass
2d9d57b0fb vdev_id: use mawk-compatible regular expression
Slot mapping in vdev_id doesn't work on systems using mawk as the 'awk'
alternative. A regular expression in map_slot() contains an unquoted
empty string following the alternation (|) operator, which results in an
"missing operand" error with mawk. The solution is to rearrange the
expression so the alternation has two operands.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/pkg-zfs#136
Closes zfsonlinux/zfs#2965
2014-12-19 12:05:16 -08:00
Brian Behlendorf
5c7afad448 Fix cstyle issue from c66989b
Commit c66989b accidentally introduced a cstyle issue which went
unnoticed.  This tiny patch corrects that oversight.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-12-19 12:00:14 -08:00