Commit Graph

1467 Commits

Author SHA1 Message Date
Adam Leventhal
64dbba3679 Illumos 5174 - add sdt probe for blocked read in dbuf_read()
5174 add sdt probe for blocked read in dbuf_read()
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5174
  https://github.com/illumos/illumos-gate/commit/f6164ad

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2710
2014-09-22 14:20:25 -07:00
Ned Bass
cfd3549a53 Remove obsolete comment about guard pages
Remove an obsolete comment that refers to code removed by commit
79c6e4c4. The code and comment related to space consumed by guard
pages in user-space stacks, which we no longer take into account.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2722
2014-09-22 12:22:16 -07:00
Chris Dunlap
5043deaa40 Remove reverse indentation from zed comments.
Remove all occurrences of reverse indentation from zed comments for
consistency within the project code base.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2695
2014-09-22 12:17:53 -07:00
Matthew Ahrens
6d9036f350 Illumos 5140 - message about "%recv could not be opened" is printed when booting after crash
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/projects/illumos-gate//issues/5140
  https://github.com/illumos/illumos-gate/commit/2243853

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2676
2014-09-18 15:04:59 -07:00
Turbo Fredriksson
71bd064555 Document environment variables for zdb, zfs, zinject and zpool.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2691
2014-09-18 15:01:03 -07:00
Brian Behlendorf
2d50158343 Fix z_teardown_inactive_lock deadlock
When rolling back a mounted filesystem zfs_suspend() is called
which acquires the z_teardown_inactive_lock.  This lock can not
be dropped until the filesystem has been rolled back and resumed
in zfs_resume_fs().

Therefore, we must not call iput() under this lock because it
may result in the inode->evict() handler being called which also
takes this lock.  Instead use zfs_iput_async() to ensure dropping
the last reference is deferred and runs in a safe context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2670
2014-09-11 09:11:45 -07:00
Tim Chase
52dd454d05 Document the "readonly" pool property
This documentation is based FreeBSD's zpool(8) man page.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2682
2014-09-09 11:35:46 -07:00
Tim Chase
223df0161f Implement fallocate FALLOC_FL_PUNCH_HOLE
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2).  Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation.  It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.

Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.

Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2619
2014-09-08 13:52:25 -07:00
George Wilson
4f68d7878f Illumos 5117 - spacemap reallocation can cause corruption
5117 space map reallocation can cause corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/projects/illumos-gate/issues/5117
  https://github.com/illumos/illumos-gate/commit/e503a68

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2662
2014-09-08 09:42:39 -07:00
Brian Behlendorf
ceb49b0acd Add object type checking to zap_lockdir()
If a non-ZAP object is passed to zap_lockdir() it will be treated
as a valid ZAP object.  This can result in zap_lockdir() attempting
to read what it believes are leaf blocks from invalid disk locations.
The SCSI layer will eventually generate errors for these bogus IOs
but the caller will hang in zap_get_leaf_byblk().

The good news is that is a situation which can not occur unless the
pool has been damaged.  The bad news is that there are reports from
both FreeBSD and Solaris of damaged pools.  Specifically, there are
normal files in the filesystem which reference another normal file
as their parent.

Since pools like this are known to exist the zap_lockdir() function
has been updated to verify the type of the object.  If a non-ZAP
object has been passed it EINVAL will be returned immediately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2597
Issue #2602
2014-09-08 09:15:38 -07:00
Richard Yao
cd3939c5f0 Linux AIO Support
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility.  We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #223
Closes #2373
2014-09-05 15:11:43 -07:00
ilovezfs
1ca56e6033 Fragmentation should display as '-' if spacemap_histogram=disabled
When com.delphix:spacemap_histogram is disabled, the value of
fragmentation was printing as 18446744073709551615 (UINT64_MAX),
when it should print as '-'.

The issue was caused by a small mistake during the merge of
"4980 metaslabs should have a fragmentation metric."

upstream: https://github.com/illumos/illumos-gate/commit/2e4c998
ZoL: https://github.com/zfsonlinux/zfs/commit/f3a7f66

The problem is in zpool_get_prop_literal, where the handling of the
pool property ZPOOL_PROP_FRAGMENTATION was added to wrong the
section. In particular, ZPOOL_PROP_FRAGMENTATION should not be in
the section where zpool_get_state(zhp) == POOL_STATE_UNAVAIL, but
lower down after it's already been determined that the pool is in
fact available, which is where upstream illumos correctly has had
it.

Thanks to lundman for helping to track down this bug.

Signed-off-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2664
2014-09-05 09:19:01 -07:00
Alex Reece
f38dfec3fd Illumos 5049 - panic when removing log device
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Mattew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov@gmail.com>
Approved by: Rich Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5049
  https://github.com/illumos/illumos-gate/commit/2986efa

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2636
2014-09-05 09:06:27 -07:00
Stanislav Seletskiy
2078f21015 Fix invalid locking order in rename operation
This commit should prevent a deadlock on dp_config_rwlock when
running `zfs rename` by ensuring zvol_rename_minors() is not
called under this lock.

Signed-off-by: Stanislav Seletskiy <s.seletskiy@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2652.
Closes #2525.
2014-09-04 09:50:46 -07:00
alteriks
4f6a14798d Import zfs pools after cryptsetup
The zfs-import-cache.service and zfs-import-scan.service should
should be started after cryptsetup to ensure all LUKS devices have
been opened.

Signed-off-by: alteriks <alteriks@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1474
2014-09-04 09:50:45 -07:00
Alexey Smirnoff
0dfc732416 Change the default 'zfs_dedup_prefetch' value to '0'
This gives a huge performance improvement in operations with deduped
datasets especially when the bottleneck is the amount of ram
available for zfs.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2639
2014-09-04 09:50:45 -07:00
Dan Swartzendruber
287be44f53 Improve handling of filesystem versions
Change mount code to diagnose filesystem versions that
are not supported by the current implementation.

Change upgrade code to do likewise and refuse to upgrade
a pool if any filesystems on it are a version which is
not supported by the current implementation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Dan Swartzendruber <dswartz@druber.com>
Closes: #2616
2014-09-03 09:17:14 -07:00
louwrentius
bcd9624d0f Change delimiter for ZED email scripts
When the ZED_EMAIL_INTERVAL_SECS="3600" option is set in zed.rc
configuration file then notification emails should be rate limited.

Rate limiting is accomplished by maintaining a colon delimited state
file which includes the device name.  Unfortunately there are valid
device names which include a colon and therefore prevent the rate
limiting for working properly.  For this reason the delimiter has
been changed to a semi-colon.

Signed-off-by: louwrentius <louwrentius@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes #2645
2014-09-02 14:18:54 -07:00
Ralf Ertzinger
76c3a61642 Change startup mode of ZED
Change the startup mode of ZED to non-forking. While systemd can
track processes that detach from the terminal just fine, running
processes in non-forking mode is the preferred mode of operation.

Also remove user/group definitions as root/root is the default.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2252
2014-09-02 14:18:53 -07:00
Chris Dunlap
8125fb7190 Cleanup zed logging
This is a set of minor cleanup changes related to zed logging:
- Remove the program identity prefix from messages written to stderr
  since systemd already prepends this output with the program name.
- Replace the copy of the program identity string with a ptr reference.
- Replace "pid" with "PID" for consistency in comments & strings.
- Rename the zed_log.c struct _ctx component "level" to "priority".
- Add the LOG_PID option for messages written to syslog.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2252
2014-09-02 14:18:53 -07:00
Chris Dunlap
5a8855b716 Fix race condition with zed pidfile creation
When the zed is started as a forking daemon (by default),
a race-condition exists where the parent process can terminate before
the pidfile has been created by the grandchild process.  When invoked
as a Type=forking systemd service, this can result in the following:

  systemd[1]: Starting ZFS Event Daemon (zed)...
  systemd[1]: PID file /var/run/zed.pid not readable (yet?) after start.

This commit adds a daemonize pipe to allow the grandchild process to
signal the parent process that initialization is complete (and the
pidfile has been created).  The parent process will wait for this
notification before exiting.

Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2252
2014-09-02 14:18:53 -07:00
Brian Behlendorf
6c3c3387b1 Add a missing > to AUTHORS
An email address in the AUTHORS file was missing its trailing >.
This patch fixes that typo.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-09-02 14:18:53 -07:00
Turbo Fredriksson
c3f8dc2a48 Add a pkgconfig file
Providing a pkg-config file makes is easy for 3rd party applications
to link against the libzfs libraries.  It also allows the libzfs
developers to modify the list of required libraries and cflags
without breaking existing applications.

The following example illustrates how pkg-config can be used:

cc `pkg-config --cflags --libs libzfs` -o myapp myapp.c

/*
 * myapp.c
 */
void main()
{
	libzfs_handle_t *hdl;

	hdl = libzfs_init();
	if (hdl)
		libzfs_fini(hdl);
}

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #585
2014-08-28 07:59:43 -07:00
Brian Behlendorf
9ad656b2d0 Retire HAVE_IOCTL_* configure checks
The HAVE_IOCTL_* configure checks were originally added for
compatibility with an ancient version of glibc.  This support
and additional complexity is no longer needed and is therefore
being removed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Closes #585
2014-08-28 07:45:54 -07:00
Matthew Ahrens
dea377c0d9 Illumos 4970-4974 - extreme rewind enhancements
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4970
  https://www.illumos.org/issues/4971
  https://www.illumos.org/issues/4972
  https://www.illumos.org/issues/4973
  https://www.illumos.org/issues/4974
  https://github.com/illumos/illumos-gate/commit/e42d205

Notes:
    This set of patches adds a set of tunable parameters for the
    "extreme rewind" mode of pool import which allows control over
    the traversal performed during such an import.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2598
2014-08-26 16:29:57 -07:00
Matthew Ahrens
49ddb31506 Illumos 5034 - ARC's buf_hash_table is too small
5034 ARC's buf_hash_table is too small

Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>

References:
  https://www.illumos.org/issues/5034
  https://github.com/illumos/illumos-gate/commit/63e911b

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2615
2014-08-26 16:14:49 -07:00
Andrew Hamilton
d09a99f96b 2493 change efi_rescan() to wait longer
Change efi_rescan() to loop 10 times instead of 5 on EBUSY and
to sleep at the end of each loop. This helps with some instances
where the kernel does not reload the partition table fast enough
for ZFS to detect.

Signed-off-by: Andrew Hamilton <ahamilto@tjhsst.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2493
2014-08-26 16:12:37 -07:00
Isaac Huang
0426c16804 Fixed memory leaks in zevent handling
Some nvlist_t could be leaked in error handling paths.
Also make sure cb argument to zfs_zevent_post() cannnot
be NULL.

Signed-off-by: Isaac Huang <he.huang@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2158
2014-08-20 10:45:16 -07:00
Matthew Ahrens
bd089c5477 Illumos 4631 - zvol_get_stats triggering too many reads
4631 zvol_get_stats triggering too many reads

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4631
  https://github.com/illumos/illumos-gate/commit/bbfa8ea

Ported-by: Boris Protopopov <bprotopopov@hotmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2612
Closes #2480
2014-08-20 09:17:00 -07:00
Richard Yao
2fe5011008 Drive database update
The Intel DC S3500 and Intel DC S3700 are optimized to handle 4KB
sectors well despite of their 8KB page sizes, so we move them to a new
category for enterprise drives where they will receive ashift=12. They
are joined by the Intel 730 series, which uses the same disk controller,
as well as a San Disk enterprise drive. The drive IDs for these two were
obtained by myself with the drive_id utility. The drive ID for the 240GB
Intel 730 model was extrapolated from the drive ID for the 480GB model.

Lastly, we also add some Western Digital mobile drives.  ryuo in
\#zfsonlinux on freenode obtained "ATA     WDC WD2500BEVT-0" from
running drive_id on his own hardware. The additional drives in that
family were extrapolated from that identifer.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2601
2014-08-18 10:09:03 -07:00
Tim Chase
8b0a0840b4 Don't upgrade a metaslab when the pool is not writable
Illumos 4982 added code to metaslab_fragmentation() to proactively update
space maps when the spacemap_histogram feature is enabled.  This should
only happen when the pool is writeable.

References:
  https://www.illumos.org/issues/4982
  https://github.com/illumos/illumos-gate/commit/2e4c998

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:47:19 -07:00
George Wilson
f3a7f6610f Illumos 4976-4984 - metaslab improvements
4976 zfs should only avoid writing to a failing non-redundant top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature is enabled
4983 need to collect metaslab information via mdb
4984 device selection should use fragmentation metric
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/4976
  https://www.illumos.org/issues/4978
  https://www.illumos.org/issues/4979
  https://www.illumos.org/issues/4980
  https://www.illumos.org/issues/4981
  https://www.illumos.org/issues/4982
  https://www.illumos.org/issues/4983
  https://www.illumos.org/issues/4984
  https://github.com/illumos/illumos-gate/commit/2e4c998

Notes:
    The "zdb -M" option has been re-tasked to display the new metaslab
    fragmentation metric and the new "zdb -I" option is used to control
    the maximum number of in-flight I/Os.

    The new fragmentation metric is derived from the space map histogram
    which has been rolled up to the vdev and pool level and is presented
    to the user via "zpool list".

    Add a number of module parameters related to the new metaslab weighting
    logic.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2595
2014-08-18 08:40:49 -07:00
Turbo Fredriksson
f67d709080 Create an 'overlay' property
Add a new 'overlay' property (default 'off') that controls whether the
filesystem should be mounted even if the mountpoint is busy or if it
should fail with a 'mountpoint not empty'.

Doing overlay mounts is the default mount behavior on Linux, but not
in ZFS. It have been decided that following the ZFS behavior should
be the default, but this overlay allows for site administrator to
override this decision on a per-dataset basis.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #2503
2014-08-15 13:39:19 -07:00
Richard Yao
194e56234a Include sys/taskq.h in linux/vfs_compat.h
We should have included sys/taskq.h directly because we use the taskq
code here, but we instead had files that included sys/taskq.h also
include sys/kmem.h, which happened to include sys/taskq.h. sys/kmem.h no
longer does this, so we must define the include as we should
have done in the first place.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2411
2014-08-14 12:38:17 -07:00
Evan Susarret
dc55d79101 Correct autodetection of bootfs property
Remove lines that contain only a hyphen (match '^-$' instead of '-').

I had a root fs with a hyphen in the name (fedora/ROOT/Fedora20-Dev),
it was not detected because sed eliminated that line of output from
'zpool list -Ho bootfs'.

Signed-off-by: Evan Susarret <evansus@gmail.com>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2196
2014-08-13 16:53:06 -07:00
Alec Salazar
449e75d768 Avoid PAGESIZE redefinition
Add #ifndef PAGESIZE to avoid redefinition warning on platforms
where this value is already provided.

Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2588
2014-08-13 10:36:09 -07:00
Alec Salazar
22a11a5b5a Replace __va_list with va_list
Most of the code base already uses va_list, which is specified by
iso-c. gcc/glibc provides 'typedef __gnuc_va_list va_list'. and
when not using gcc/glibc we can't expect to find __gnuc_va_list.

Signed-off-by: Alec Salazar <alec.j.salazar@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2588
2014-08-13 10:35:00 -07:00
Brian Behlendorf
0d5c500d6c Revert "Revert "Revert "Fix unlink/xattr deadlock"""
This reverts commit 7973e46 which brings the basic flow of the
code back in line with the other ZFS implementations.  This
was possible due to the following related changes.

e89260a Directory xattr znodes hold a reference on their parent
6f9548c Fix deadlock in zfs_zget()
0a50679 Add zfs_iput_async() interface
4dd1893 Avoid 128K kmem allocations in mzap_upgrade()

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
Closes #2058
Closes #2128
Closes #2240
2014-08-11 16:12:36 -07:00
Brian Behlendorf
0a50679ce9 Add zfs_iput_async() interface
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks.  When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
2014-08-11 16:11:43 -07:00
Brian Behlendorf
4dd18932ba Avoid 128K kmem allocations in mzap_upgrade()
As originally implemented the mzap_upgrade() function will
perform up to SPA_MAXBLOCKSIZE allocations using kmem_alloc().
These large allocations can potentially block indefinitely
if contiguous memory is not available.  Since this allocation
is done under the zap->zap_rwlock it can appear as if there is
a deadlock in zap_lockdir().  This is shown below.

The optimal fix for this would be to rework mzap_upgrade()
such that no large allocations are required.  This could be
done but it would result in us diverging further from the other
implementations.  Therefore I've opted against doing this
unless it becomes absolutely necessary.

Instead mzap_upgrade() has been updated to use zio_buf_alloc()
which can reliably provide buffers of up to SPA_MAXBLOCKSIZE.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Close #2580
2014-08-11 16:10:32 -07:00
Brian Behlendorf
50b25b2187 Avoid dynamic allocation of 'search zio'
As part of commit e8b96c6 the search zio used by the
vdev_queue_io_to_issue() function was moved to the heap
to minimize stack usage.  Functionally this is fine, but
to maximize performance it's best to minimize the number
of dynamic allocations.

To avoid this allocation temporary space for the search
zio has been reserved in the vdev_queue structure.  All
access must be serialized through the vq_lock.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2572
2014-08-11 08:44:54 -07:00
Brian Behlendorf
ab6f407faa Use KM_PUSHPAGE in dsl_dataset_rollback_check()
The dsl_dataset_rollback_check() function is executed in the
txg_sync context.  To prevent a potential deadlock due to direct
memory reclaim it must use KM_PUSHPAGE.  This was introduced by
the recent 'zfs bookmark' features, commit da53684.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Eric Dillmann <eric@jave.fr>
Closes #2569
2014-08-06 16:09:28 -07:00
Turbo Fredriksson
1ffe90c5d3 Add bash completions by Aneurin Price.
These can be manually installed as needed by end users.  They
have been added to the repository so they can be kept up to date
with the latest code.

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1588
2014-08-06 15:03:28 -07:00
Matthew Ahrens
5dbd68a352 Illumos 4914 - zfs on-disk bookmark structure should be named *_phys_t
4914 zfs on-disk bookmark structure should be named *_phys_t

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/4914
  https://github.com/illumos/illumos-gate/commit/7802d7b

Porting notes:

There were a number of zfsonlinux-specific uses of zbookmark_t which
needed to be updated.  This should reduce the likelihood of further
problems like issue #2094 from occurring.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2558
2014-08-06 14:48:41 -07:00
Matthew Ahrens
1fa8f795d5 Illumos 4881 - zfs send performance regression with embedded data
4881 zfs send performance degradation when embedded block pointers
     are encountered

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4881
  https://github.com/illumos/illumos-gate/commit/06315b7

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2547
2014-08-06 14:44:10 -07:00
Saso Kiselkov
3bec585e6c Illumos 4897 - Space accounting mismatch in L2ARC/zpool
4897 Space accounting mismatch in L2ARC/zpool

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

From the illumos issue tracker:

	L2ARC vdev space usage statistics are calculated as the delta
	between the maximum and minimum vdev offset ever written to
	by the L2ARC fill thread, but do not inform the user of how
	much space in between these two offsets is actually taken up by
	cached buffers. This fix changes that so that vdev space usage
	stats on L2ARC devices accurately track the volume of buffers
	stored on them, allowing users to see the exact L2ARC usage in
	"zpool iostat -v".

References:
  https://www.illumos.org/issues/4897
  https://github.com/illumos/illumos-gate/commit/3038a2b

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2555
2014-08-06 13:44:10 -07:00
Matthew Ahrens
fbeddd60b7 Illumos 4390 - I/O errors can corrupt space map when deleting fs/vol
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4390
  https://github.com/illumos/illumos-gate/commit/7fd05ac

Porting notes:

Previous stack-reduction efforts in traverse_visitb() caused a fair
number of un-mergable pieces of code.  This patch should reduce its
stack footprint a bit more.

The new local bptree_entry_phys_t in bptree_add() is dynamically-allocated
using kmem_zalloc() for the purpose of stack reduction.

The new global zfs_free_leak_on_eio has been defined as an integer
rather than a boolean_t as was the case with the related zfs_recover
global.  Also, zfs_free_leak_on_eio's definition has been inserted into
zfs_debug.c for consistency with the existing definition of zfs_recover.
Illumos placed it in spa_misc.c.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2545
2014-08-04 11:50:52 -07:00
Matthew Ahrens
9b67f60560 Illumos 4757, 4913
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/4757
  https://www.illumos.org/issues/4913
  https://github.com/illumos/illumos-gate/commit/5d7b4d4

Porting notes:

For compatibility with the fastpath code the zio_done() function
needed to be updated.  Because embedded-data block pointers do
not require DVAs to be allocated the associated vdevs will not
be marked and therefore should not be unmarked.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2544
2014-08-01 14:28:05 -07:00
Matthew Ahrens
faf0f58c69 Illumos 3835 zfs need not store 2 copies of all metadata
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

Description from Matt Ahrens's bug report at Delphix:

    Add a new zfs property, "redundant_metadata" which can have values
    "all" or "most".  The default will be "all", which is the current
    behavior.  Setting to "most" will cause us to only store 1 copy of
    level-1 indirect blocks of user data files.

Additional notes:

    The new man page section for this property states

        "The exact behavior of which metadata blocks
         are stored redundantly may change in future releases."

    and:

        "When set to most, ZFS stores an extra copy of most types of
         metadata. This can improve performance of random writes,
         because less metadata must be written."

    The current implementation is as described above in Matt's blog.
    It is controlled by a new global integer
    "zfs_redundant_metadata_most_ditto_level", currently initialized
    to 2. When "redundant_metadata" is set to "most", only indirect
    blocks of the specified level and higher will have additional ditto
    blocks created.

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2542
2014-07-31 09:49:34 -07:00
Tim Chase
603cb25ca5 zed needs libzfs_core
As of a recent group of Illumos/Delphix updates, zed needs libzfs_core
in order to resolve lzc_get_bookmarks() and likely other functions
going forward.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2534
2014-07-31 09:49:01 -07:00