6214 zpools going south
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
References:
https://www.illumos.org/issues/6214http://cr.illumos.org/~webrev/sensille/6214_zpools_going_south/
Porting Notes:
Reintroduce b_compress to the l2arc_buf_hdr_t. In commit b9541d6
the compression flags were moved to the generic b_flags in the
arc_buf_hdr_t. This is a problem because l2arc_compress_buf()
may manipulate the compression flags and this can only be done
safely under the hash lock which is not held. See Illumos 6214
for a detailed analysis of the race.
HDR_GET_COMPRESS() macro was removed from arc_buf_info().
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3757
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume. Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
zfsonlinux/zfs@e20cd6f7a8 caused us to
lose IO accounting on zvols. When I originally wrote that last year, the
symbols we needed to maintain IO accounting were GPL exported, but
torvalds/linux@394ffa503b provided
suitable symbols for restoring this functionality 4 months later. We
can call them to restore the IO accounting on Linux 3.19 and later as
well as any older kernels where that patch is backported.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3741
Internally ZFS keeps a small log to facilitate debugging. By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.
$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp message
1441141408 spa=tank async request task=1
1441141408 txg 70 open pool version 5000; software version 5000/5; ...
1441141409 spa=tank async request task=32
1441141409 txg 72 import pool version 5000; software version 5000/5; ...
1441141414 command: lt-zpool import tank
Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log. As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.
$ ZFS_DEBUG=on ./cmd/ztest/ztest -V
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3728
This patch is based on the previous work done by @andrey-ve and
@yshui. It triggers the automount by using kern_path() to traverse
to the known snapshout mount point. Once the snapshot is mounted
NFS can access the contents of the snapshot.
Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server. To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally. System administators desiring this
functionally must explicitly enable it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2797Closes#1655Closes#616
Prevents NFS client from detection of different fileids of snapshot root dentry
before & after snapshot mount.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Linux 2.6.36 introduced REQ_SECURE to indicate when discards *must* be
processed, such that we cannot do optimizations like block alignment.
Consequently, the discard semantics prior to 2.6.36 require us to always
process unaligned discards. Previously, we would do this optimization
regardless. This patch changes things to correctly restrict this
optimization to situations where REQ_SECURE exists, but is not included
in the flags.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Reclaim in the traverse prefetch thread, which is run on the system
taskq, can overrun the stack.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#3733
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#985Closes#3351
Commit 49ddb31506 added the
zfs_arc_average_blocksize parameter to allow control over the size of
the arc hash table. The dbuf hash table's size should be determined
similarly.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3721
Allow for easy turning of a pools reserved free space. Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve. Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3724
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3712
Before read locking z_teardown_inactive_lock, we need to check if we have
already had write lock on it. Otherwise, we would deadlock on ourself when
doing rollback:
zfs_ioc_rollback
->zfs_suspend_fs (z_teardown_inactive_lock, RW_WRITER)
->zfs_resume_fs->zfs_rezget->zfs_iput_async->iput-> ...
->zfs_inactive (z_teardown_inactive_lock, RW_READER)
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2869
The misc_deregister() function was changed to a void return type.
Rather than add compatibility code to detect this change simply
ignore the return code on all kernels. It was only used to log
an informational error message of no real value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Re-factor the .zfs/snapshot auto-mouting code to take in to account
changes made to the upstream kernels. And to lay the groundwork for
enabling access to .zfs snapshots via NFS clients. This patch makes
the following core improvements.
* All actively auto-mounted snapshots are now tracked in two global
trees which are indexed by snapshot name and objset id respectively.
This allows for fast lookups of any auto-mounted snapshot regardless
without needing access to the parent dataset.
* Snapshot entries are added to the tree in zfsctl_snapshot_mount().
However, they are now removed from the tree in the context of the
unmount process. This eliminates the need complicated error logic
in zfsctl_snapshot_unmount() to handle unmount failures.
* References are now taken on the snapshot entries in the tree to
ensure they always remain valid while a task is outstanding.
* The MNT_SHRINKABLE flag is set on the snapshot vfsmount_t right
after the auto-mount succeeds. This allows to kernel to unmount
idle auto-mounted snapshots if needed removing the need for the
zfsctl_unmount_snapshots() function.
* Snapshots in active use will not be automatically unmounted. As
long as at least one dentry is revalidated every zfs_expire_snapshot/2
seconds the auto-unmount expiration timer will be extended.
* Commit torvalds/linux@bafc9b7 caused snapshots auto-mounted by ZFS
to be immediately unmounted when the dentry was revalidated. This
was a consequence of ZFS invaliding all snapdir dentries to ensure that
negative dentries didn't mask new snapshots. This patch modifies the
behavior such that only negative dentries are invalidated. This solves
the issue and may result in a performance improvement.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3589Closes#3344Closes#3295Closes#3257Closes#3243Closes#3030Closes#2841
There's no metadata to write to disk for ctldir inodes. So we check if
a inode belongs to the ctldir in zpl_commit_metadata, and returns
immediately if it is.
Signed-off-by: Andrey Vesnovaty <andrey.vesnovaty@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #2797
zvols should not be an entropy source for the kernel. Disable it to be
consistent with the upstream kernel.
torvalds/linux@b277da0a8a
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3713
Add a missing space to the zfs_vdev_sync_write_min_active module
parameter description.
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3714
As part of the stack reduction effort in
50b25b2187, a zio_t containing a taskq_ent
was added to struct vdev_queue which itself is part of struct vdev.
The taskq entry should be initialized as is currently done in zio_create()
for newly-created bare zio_t object. The rationale is the same as is
described in f467b05a26.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3709
If the ZAP object containing a snapshot map is corrupted due to an
unrecoverable checksum error or otherwise, dsl_dataset_name() will
normally panic the system due to its VERIFY.
This patch attempts to allow a recovery avenue from such situations by
manufacturing a descriptive snapshot name and then ignoring the error.
Scrubbing a pool with this type of corruption will then show the affected
object in the error list rather than panicking.
The recovery code is only enabled when the zfs_recover module parameter
is set.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3705
Since ZoL allows large blocks to be used by volumes, unlike upstream
illumos, the feature flag must be checked prior to volume creation.
This is critical because unlike filesystems, volumes will create a
object which uses large blocks as part of the create. Therefore, it
cannot be safely checked in zfs_check_settable() after the dataset
can been created.
In addition this patch updates the relevant error messages to use
zfs_nicenum() to print the maximum blocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3591
When support for large blocks was added DMU_MAX_ACCESS was increased
to allow for blocks of up to 16M to fit in a transaction handle.
This had the side effect of increasing the max_hw_sectors_kb for
volumes, which are scaled off DMU_MAX_ACCESS, to 64M from 10M.
This is an issue for volumes which by default use an 8K block size
because it results in dmu_buf_hold_array_by_dnode() allocating a
64K array for the dbufs. The solution is to restore the maximum
size to ~10M. This patch specifically changes it to 16M which is
close enough.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3684
Starting from Linux 4.1 allows iov_iter with bio_vec to be passed into
iter_read/iter_write. Notably, the loop device will pass bio_vec to backend
filesystem. However, current ZFS code assumes iovec without any check, so it
will always crash when using loop device.
With the restructured uio_t, we can safely pass bio_vec in uio_t with UIO_BVEC
set. The uio* functions are modified to handle bio_vec case separately.
The const uio_iov causes some warning in xuio related stuff, so explicit
convert them to non const.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3511Closes#3640
The spa_config_write() function relies on the classic method of
making sure updates to the /etc/zfs/zpool.cache file are atomic.
It writes out a temporary version of the file and then uses
vn_rename() to switch it in to place. This way there can never
exist a partial version of the file, it's all or nothing.
Conceptually this is a good strategy and it makes good sense
for platforms where it's easy to do a rename within the kernel.
Unfortunately, Linux is not one of those platforms. Even doing
basic I/O to a file system from within the kernel is strongly
discouraged. In order to support this at all the vn_rename()
implementation ends up being complex and fragile. So fragile
that recent Linux 4.2 changes have broken it.
While it is possible to update vn_rename() to work with the
latest kernels a better long term strategy is to stop using
vn_rename() entirely. Then all this complex, fragile code can
be removed. Achieving this is straight forward because
config_write() is the only consumer of vn_rename().
This patch reworks spa_config_write() to update the cache file
in place. The file will be truncated, written out, and then
synced to disk. If an error is encountered the file will be
unlinked leaving the system in a consistent state.
This does expose a tiny tiny tiny window where a system could
crash at exactly the wrong moment could leave a partially written
cache file. However, this is highly unlikely because the cache
file is 1) infrequently updated, 2) only a few kilobytes in size,
and 3) written with a single vn_rdwr() call.
If this were to somehow happen it poses no risk to pool. Simply
removing the cache file will allow the pool to be imported cleanly.
Going forward this will be even less of an issue as we intend to
disable the use of a cache file by default.
Bottom line not using vn_rename() allows us to make ZoL more
robust against upstream kernel changes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3653
Failing to lookup a name in the spa_ddt_stat_object should not result
in a panic in ddt_object_load(). The error can be safely returned to
the caller for handling resulting in a useful user error message.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3370
Without the parenthesis, this particular ASSERT will evaluate to
"(RW_READER == (!zap->zap_ismicro && fatreader)) ? RW_READER : lti"
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3685
This brings the behavior of arc_memory_throttle() back in sync with
illumos. The updated memory throttling policy roughly goes like this:
* Never throttle if more than 10% of memory is free. This threshold
is configurable with the zfs_arc_lotsfree_percent module option.
* Minimize any throttling of kswapd even when free memory is below
the set threshold. Allow it to write out pages as quickly as
possible to help alleviate the memory pressure.
* Delay all other threads when free memory is below the set threshold
in order to avoid compounding the memory pressure. Buffers will be
evicted from the ARC to reduce the issue.
The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3637
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages. This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim. Conceptually
this brings arc_available_memory() back in sync with illumos.
It is also desirable that the target amount of free memory be tunable
on a system. While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.
zfs_arc_sys_free - The target number of bytes the ARC should leave
as free memory on the system. This value can
checked in /proc/spl/kstat/zfs/arcstats and
setting this module option will override the
default value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3637
The zvol_threads module option should be bounded to a reasonable
range. The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most. The default value of 32 is a reasonable
default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3614
Commit d958324 fixed the deadlock between page lock and range lock by
unlocking the page lock before acquiring the range lock. However,
this created a new issue #3075.
The problem is that if we can't set the write back bit before releasing
the page lock. Then other processes will be unaware that the page is
under active write back. They may therefore truncate the page,
invalidate the page, or not honor the sync semantics.
To workaround this problem we re-dirty the page before dropping the
page lock. While this doesn't prevent the page from being truncated
it does ensure it won't be invalidated. Then the range lock and the
page lock are reacquired in the correct deadlock-free order.
Once both locks are safely held the page state can be rechecked. If
all is well and the page is in the expect state the dirty bit can be
removed, the write back bit set, and the page removed from the skip
count. If not the page will be handled as appropriate.
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3075
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority. Non-I/O filesystem
processes run with the default priority. ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active. The priorities have been updated to the following:
$ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta'
- TS 10743 19 -20 [spl_kmem_cache]
- TS 10744 19 -20 [spl_system_task]
- TS 10745 19 -20 [spl_dynamic_tas]
- TS 10764 19 0 [dbu_evict]
- TS 10765 19 0 [arc_prune]
- TS 10766 19 0 [arc_reclaim]
- TS 10767 19 0 [arc_user_evicts]
- TS 10768 19 0 [l2arc_feed]
- TS 10769 39 0 [z_unmount]
- TS 10770 39 -20 [zvol]
- TS 11011 39 -20 [z_null_iss]
- TS 11012 39 -20 [z_null_int]
- TS 11013 39 -20 [z_rd_iss]
- TS 11014 39 -20 [z_rd_int_0]
- TS 11022 38 -19 [z_wr_iss]
- TS 11023 39 -20 [z_wr_iss_h]
- TS 11024 39 -20 [z_wr_int_0]
- TS 11032 39 -20 [z_wr_int_h]
- TS 11033 39 -20 [z_fr_iss_0]
- TS 11041 39 -20 [z_fr_int]
- TS 11042 39 -20 [z_cl_iss]
- TS 11043 39 -20 [z_cl_int]
- TS 11044 39 -20 [z_ioctl_iss]
- TS 11045 39 -20 [z_ioctl_int]
- TS 11046 39 -20 [metaslab_group_]
- TS 11050 19 0 [z_iput]
- TS 11121 38 -19 [z_wr_iss]
Note that under Linux the meaning of a processes priority is inverted
with respect to illumos. High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.
In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted. This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.
This patch depends on https://github.com/zfsonlinux/spl/pull/466.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3607
A NULL should never be passed as the dnode_t pointer to the function
dmu_free_long_range_impl(). Regardless, because we have a reported
occurrence of this let's add some error handling to catch this.
Better to report a reasonable error to caller than panic the system.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3445
Address minor differences in style between upstream and ZoL. This
patch contains no functional differences and is solely designed to
minimize the delta from upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
Commit d962d5d didn't quite properly resolve the HDR_L2ONLY_SIZE
accounting. Accounting is now performed only in the constructor
and destructor which is a nice simplification. It should have
been removed the from create and destroy functions. This brings
up back in sync with upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
Originally removed because it wasn't required under Linux. However,
there may still be some utility in signaling the arc reclaim thread
under Linux via reclaim. This should already have happened by other
means but it's not harmless and reduces another point of divergence
with upstream.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
Commit f521ce1 removed the minimum value for "arc_p" allowing it to
drop to zero or grow to "arc_c". This was done to improve specific
workload which constantly dirties new "metadata" but also frequently
touches a "small" amount of mfu data (e.g. mkdir's).
This change may still be desirable but it needs to be re-investigated.
in the context of the recent ARC changes from upstream. Therefore
this code is being restored to facilitate benchmarking. By setting
"zfs_arc_p_min_shift=64" we easily compare the performance.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
References:
https://www.illumos.org/issues/5817https://github.com/illumos/illumos-gate/commit/2fd872a
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
5445 Add more visibility via arcstats; specifically arc_state_t
stats and differentiate between "data" and "metadata"
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://www.illumos.org/issues/5445https://github.com/illumos/illumos-gate/commit/4076b1b
Porting Notes:
This patch is an improved version of cc7f677 which was previously
merged in ZoL. This patch incorporates the additional improvements
which were made upstream.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
References:
https://www.illumos.org/issues/5376https://github.com/illumos/illumos-gate/commit/2ec99e3
Porting Notes:
The good news is that many of the recent changes made upstream to the
ARC tackled issues previously observed by ZoL with similar solutions.
The bad news is those solution weren't identical to the ones we applied.
This patch is designed to split the difference and apply as much of the
upstream work as possible.
* The arc_available_memory() function was removed previous in ZoL but
due to the upstream changes it makes sense to add it back. This function
has been customized for Linux so that it can be used to determine a low
memory. This provides the same basic functionality as the illumos version
allowing us to minimize changes through the rest of the code base. The
exact mechanism used to detect a low memory state remains unchanged so
this change isn't a significant as it might first appear.
* This patch includes the long standing fix for arc_shrink() which was
originally proposed in #2167. Since there were related changes to this
function it made sense to include that work.
* The arc_init() function has been re-factored. As before it sets sane
default values for the ARC but then calls arc_tuning_update() to apply
user specific tuning made via module options. The arc_tuning_update()
function is then called periodically by the arc_reclaim_thread() to
apply changes to the tunings made during normal operation.
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3616Closes#2167
When an inode is detected with invalid mode bits the safe thing to
do is panic the system. This indicates a problem with the contents
of a dnode and it should never be possible. This is the default
behavior.
Unfortunately, due to flaws in the system attribute (SA) implementation
(on all platforms) it was possible that ZFS could create a damaged dnode.
This was a rare issue which only impacted dnodes which used a spill
block. Normally only symlinks and files with ACLs would require a
spill block. However, if the dataset had the xattr=sa property set
and extended attributes were used this problem could occur.
As of the 0.6.4 tag the root cause of this issue has been fixed. For
pools which are exhibiting this damage the 'zfs_recover=1' module option
may be set. This will cause ZFS to interpret the dnode with invalid
mode bits as a normal file. This may allow the files to be accessed
for recovery purposes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3548
Build products from an out of tree build should be written
relative to the build directory. Sources should be referred
to by their locations in the source directory.
This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile. This enables the following:
$ mkdir build
$ cd build
$ ../configure \
--with-spl=$HOME/src/git/spl/ \
--with-spl-obj=$HOME/src/git/spl/build
$ make -s
This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.
Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
Makefile.am:00: but option 'subdir-objects' is disabled
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1082
After a successful write the inode must be updated under the range
lock. If it is updated after dropping the lock there exists a race
where the znode and inode wile disagree about the file size. This
could result in narrow window of time where read(2) is able to access
data beyond what fstat(2) reports as the file size.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3601
As of Linux 4.2 the kernel has completely retired the nameidata
structure. One of the few remaining consumers of this interface
were the follow_link() and put_link() callbacks.
This patch adds the required checks to configure to detect the
interface change and updates the functions accordingly. Migrating
to the simple_follow_link() interface was considered but was decided
against ironically due to the increased complexity.
It also should be noted that the kernel follow_link() and put_link()
interfaces changes several times after 4.1 and but before 4.2. This
means there is a narrow range of kernel commits which never appear
in an official tag of the Linux kernel which ZoL will not build.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3596
Linux 4.2 commit torvalds/linux@dac5621 renamed bio->bi_cnt to
bio->__bi_cnt. Because this value is only used once in a block of
debug code it simplest just to remove the PANIC. To my knowledge
this debugging has never been hit or proved useful so this is no
great loss.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes#3596
Many key internal functions pass system return codes that are safe to
return to userland. In the case of ddi_copyin(9F), an error passes -1
and the documentation states very clearly that drivers should pass
EFAULT to userland when this happens.
http://illumos.org/man/9F/ddi_copyin
This does not happen in the ZFS source code. I believe it should be
changed to pass EFAULT. I caught this when writing man pages for the
libzfs_core API.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3575
Translate zio requests with ZIO_PRIORITY_SYNC_READ and
ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting
READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns
out to have a pronounced effect when writing to an SSD-based SLOG.
When WRITE_SYNC is not set (WRITE is set instead), the block trace
for a SLOG device looks as follows:
...
130,96 0 3 0.008968390 0 C W 830464 + 136 [0]
130,96 0 4 0.011999161 0 C W 830720 + 136 [0]
130,96 0 5 0.023955549 0 C W 831744 + 136 [0]
130,96 0 6 0.024337663 19775 A W 832000 + 136 <- (130,97) 829952
130,96 0 7 0.024338823 19775 Q W 832000 + 136 [z_wr_iss/6]
130,96 0 8 0.024340523 19775 G W 832000 + 136 [z_wr_iss/6]
130,96 0 9 0.024343187 19775 P N [z_wr_iss/6]
130,96 0 10 0.024344120 19775 I W 832000 + 136 [z_wr_iss/6]
130,96 0 11 0.026784405 0 UT N [swapper] 1
130,96 0 12 0.026805339 202 U N [kblockd/0] 1
130,96 0 13 0.026807199 202 D W 832000 + 136 [kblockd/0]
130,96 0 14 0.026966948 0 C W 832000 + 136 [0]
130,96 3 1 0.000449358 19788 A W 829952 + 136 <- (130,97) 827904
130,96 3 2 0.000450951 19788 Q W 829952 + 136 [z_wr_iss/19]
130,96 3 3 0.000453212 19788 G W 829952 + 136 [z_wr_iss/19]
130,96 3 4 0.000455956 19788 P N [z_wr_iss/19]
130,96 3 5 0.000457076 19788 I W 829952 + 136 [z_wr_iss/19]
130,96 3 6 0.002786349 0 UT N [swapper] 1
...
Here the 130,197 is the partition created on the log device when adding it
to the pool, whereas the base device is 130,96. As one can see, the writes
to the SLOG are not marked synchronous (the S is missing next to W), and
the queue unplugs occur based on the timer (UT event) resulting in slightly
over 2 msec latency of writes. This results in a sub-par performance of
single stream synchronous writes (limited by latency of the SLOG).
When the WRITE_SYNC is set, a similar trace looks as follows:
...
130,96 4 1 0.000000000 70714 A WS 4280576 + 136 <- (130,97) 4278528
130,96 4 2 0.000000832 70714 Q WS 4280576 + 136 [(null)]
130,96 4 3 0.000002109 70714 G WS 4280576 + 136 [(null)]
130,96 4 4 0.000003394 70714 P N [(null)]
130,96 4 5 0.000003846 70714 I WS 4280576 + 136 [(null)]
130,96 4 6 0.000004854 70714 D WS 4280576 + 136 [(null)]
130,96 5 1 0.000354487 70713 A WS 4280832 + 136 <- (130,97) 4278784
130,96 5 2 0.000355072 70713 Q WS 4280832 + 136 [(null)]
130,96 5 3 0.000356383 70713 G WS 4280832 + 136 [(null)]
130,96 5 4 0.000357635 70713 P N [(null)]
130,96 5 5 0.000358088 70713 I WS 4280832 + 136 [(null)]
130,96 5 6 0.000359191 70713 D WS 4280832 + 136 [(null)]
130,96 0 76 0.000159539 0 C WS 4280576 + 136 [0]
130,96 16 85 0.000742108 70718 A WS 4281088 + 136 <- (130,97) 4279040
130,96 16 86 0.000743197 70718 Q WS 4281088 + 136 [z_wr_iss/15]
130,96 16 87 0.000744450 70718 G WS 4281088 + 136 [z_wr_iss/15]
130,96 16 88 0.000745817 70718 P N [z_wr_iss/15]
130,96 16 89 0.000746705 70718 I WS 4281088 + 136 [z_wr_iss/15]
130,96 16 90 0.000747848 70718 D WS 4281088 + 136 [z_wr_iss/15]
130,96 0 77 0.000604063 0 C WS 4280832 + 136 [0]
130,96 0 78 0.000899858 0 C WS 4281088 + 136 [0]
As one can see, all the writes are synchronous (WS), and I/O completions
(e.g. from issue I to completion C) take 160-250 usec, or about 10x faster.
Since WRITE_SYNC or READ_SYNC flags are among several factors that are
considered when processing bio requests, it seems prudent to mark all the
zio requests of synchronous priority with the READ/WRITE_SYNC flags to make
them eligible for consideration as such by the Linux block I/O layer.
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3529