A handful of allocations now occur in the sync path and need
to use KM_PUSHPAGE. These were introduced by commit 13fe019.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1746
Issue #1737
The spa_deadman() and spa_sync() functions can both be run in the
spa_sync context and therefore should use TQ_PUSHPAGE instead of
TQ_SLEEP.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1734Closes#1749
Locking mutex &vq->vq_lock in vdev_mirror_pending is unneeded:
* no data is modified
* only vq_pending_tree is read
* in case garbage is returned (eg. vq_pending_tree being updated
while the read is made) the worst case would be that a single
read could be queued on a mirror side which more busy than thought
The benefit of this change is streamlining of the code path since
it is taken for *every* mirror member on *every* read.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1739
dataset_remove_clones_key does recursion, so if the recursion goes
deep it can overrun the linux kernel stack size of 8KB. I have seen
this happen in the actual deployment, and subsequently confirmed it by
running a test workload on a custom-built kernel that uses 32KB stack.
See the following stack trace as an example of the case where it would
have run over the 8KB stack kernel:
Depth Size Location (42 entries)
----- ---- --------
0) 11192 72 __kmalloc+0x2e/0x240
1) 11120 144 kmem_alloc_debug+0x20e/0x500
2) 10976 72 dbuf_hold_impl+0x4a/0xa0
3) 10904 120 dbuf_prefetch+0xd3/0x280
4) 10784 80 dmu_zfetch_dofetch.isra.5+0x10f/0x180
5) 10704 240 dmu_zfetch+0x5f7/0x10e0
6) 10464 168 dbuf_read+0x71e/0x8f0
7) 10296 104 dnode_hold_impl+0x1ee/0x620
8) 10192 16 dnode_hold+0x19/0x20
9) 10176 88 dmu_buf_hold+0x42/0x1b0
10) 10088 144 zap_lockdir+0x48/0x730
11) 9944 128 zap_cursor_retrieve+0x1c4/0x2f0
12) 9816 392 dsl_dataset_remove_clones_key.isra.14+0xab/0x190
13) 9424 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
14) 9032 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
15) 8640 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
16) 8248 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
17) 7856 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
18) 7464 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
19) 7072 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
20) 6680 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
21) 6288 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
22) 5896 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
23) 5504 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
24) 5112 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
25) 4720 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
26) 4328 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
27) 3936 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
28) 3544 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
29) 3152 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
30) 2760 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
31) 2368 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
32) 1976 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
33) 1584 392 dsl_dataset_remove_clones_key.isra.14+0x10c/0x190
34) 1192 232 dsl_dataset_destroy_sync+0x311/0xf60
35) 960 72 dsl_sync_task_group_sync+0x12f/0x230
36) 888 168 dsl_pool_sync+0x48b/0x5c0
37) 720 184 spa_sync+0x417/0xb00
38) 536 184 txg_sync_thread+0x325/0x5b0
39) 352 48 thread_generic_wrapper+0x7a/0x90
40) 304 128 kthread+0xc0/0xd0
41) 176 176 ret_from_fork+0x7c/0xb0
This change reduces the stack usage in dsl_dataset_remove_clones_key
by allocating structures in heap, not in stack. This is not a fundamental
fix, as one can create an arbitrary large data set that runs over any
fixed size stack, but this will make the problem far less likely.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Kohsuke Kawaguchi <kk@kohsuke.org>
Closes#1726
The zpl_mknod() function was incorrectly negating its return value.
This doesn't cause any problems in the success case, but it does
prevent us from returning the correct error code for a failure.
The implementation of this function is now consistent with all
the other zpl_* functions.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1717
When compiling on an ARM device using gcc 4.7.3 several variables
in the zfs_obj_to_path_impl() function were flagged as uninitialized.
To resolve the warnings explicitly initialize them to zero.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1716
c38367c73f was meant to eliminate runtime
function pointer modifications in autotools checks because they were
prone to false negatives on kernels hardened by the PaX project.
Unfortunately, I missed the xattr_handler and super_block->s_bdi
autotools checks. Recent changes to PaX constified
xattr_handler->get/set, which lead me to discover this oversight.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1433
After the restructuring in 13fe019 The 'zfs rename' command will
result in a KM_SLEEP being called in the sync context. This may
deadlock due to reclaim so it was changed to KM_PUSHPAGE.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1711
2882 implement libzfs_core
2883 changing "canmount" property to "on" should not always remount dataset
2900 "zfs snapshot" should be able to create multiple, arbitrary snapshots at once
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Dan Kruchinin <dan.kruchinin@gmail.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/2882https://www.illumos.org/issues/2883https://www.illumos.org/issues/2900illumos/illumos-gate@4445fffbbb
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1293
Porting notes:
WARNING: This patch changes the user/kernel ABI. That means that
the zfs/zpool utilities built from master are NOT compatible with
the 0.6.2 kernel modules. Ensure you load the matching kernel
modules from master after updating the utilities. Otherwise the
zfs/zpool commands will be unable to interact with your pool and
you will see errors similar to the following:
$ zpool list
failed to read pool configuration: bad address
no pools available
$ zfs list
no datasets available
Add zvol minor device creation to the new zfs_snapshot_nvl function.
Remove the logging of the "release" operation in
dsl_dataset_user_release_sync(). The logging caused a null dereference
because ds->ds_dir is zeroed in dsl_dataset_destroy_sync() and the
logging functions try to get the ds name via the dsl_dataset_name()
function. I've got no idea why this particular code would have worked
in Illumos. This code has subsequently been completely reworked in
Illumos commit 3b2aab1 (3464 zfs synctask code needs restructuring).
Squash some "may be used uninitialized" warning/erorrs.
Fix some printf format warnings for %lld and %llu.
Apply a few spa_writeable() changes that were made to Illumos in
illumos/illumos-gate.git@cd1c8b8 as part of the 3112, 3113, 3114 and
3115 fixes.
Add a missing call to fnvlist_free(nvl) in log_internal() that was added
in Illumos to fix issue 3085 but couldn't be ported to ZoL at the time
(zfsonlinux/zfs@9e11c73) because it depended on future work.
There is currently a subtle bug in the SA implementation which
can crop up which prevents us from safely using multiple variable
length SAs in one object.
Fortunately, the only existing use case for this are symlinks with
SA based xattrs. Therefore, until the root cause in the SA code
can be identified and fixed we prevent adding SA xattrs to symlinks.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1468
This reverts commit fadd0c4da1 which
introduced a regression in honoring the meta limit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Close#1660
This implements vdev_bdev_database_check(). It alters the detected
sector size of any device listed in a database of drives known to lie
about their physical sector sizes.
This is based on "6931570 Add flash devices' VID/PID to disk table to
advertising 4K physical sector size" from Open Solaris and on
sg_simple4.c from sg3_utils. About two dozen lines are taken from
sg_simple4.c, which is GPLv2 licensed. However, sg_simple4.c is
analogous to a Hello World program and is safe for us to use. We
requested that Douglas Gilbert, the author of sg_simple4.c, confirm that
this is the case. A cutdown version of his response is as follows:
```
I would consider a SCSI INQUIRY example using the Linux sg
driver interface (also written by me) as the equivalent of an
"hello world" program in C.
```
The database was created with the help of the freenode and ZFSOnLinux
communities.
Some notes:
1. The following drives both were confirmed to lie via reports in IRC
and they contain capacity information in their identifiers:
INTEL SSDSA2M080
INTEL SSDSA2M160
M4-CT256M4SSD2
WDC WD15EARS-00S
WDC WD15EARS-00Z
WDC WD20EARS-00M
The identifiers for different capacity models were extrapolated and
added under the assumption that those models also lie. Google was used
to verify that the extrapolated drive identifiers existed prior to their
inclusion.
2. The OCZ-VERTEX2 3.5 identifer applies to two drives that differ
solely in page size (and slightly in capacity). One uses 4096-byte pages
and the other uses 8192-byte pages. Both are set to use 8192-byte pages.
We could detect the page size by checking the capacity, but that would
unnecessarily complicate the code.
3. It is possible for updated drive firmware to correctly report the
sector size. There were reports of a few advanced format drives doing
that. One report stated that the vendor changed the identification
string while another was unclear on this. Both reports involved WDC
models.
4. Google was used to determine the size of pages in the listed flash
devices. Reports of 8192-byte pages took precedence over reports of
4096-byte pages.
5. Devices behind USB adapters can have their identification strings
altered. Identification strings obtained across USB adapters are
omitted and no attempt is made to correct for alterations made by USB
adapters when doing comparisons against the database. Two entries in the
Open Solaris database that appear to have been altered by a USB
adapter were omitted.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1652
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.
To handle this the code was reworked to use the new ->iterate
interface. Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.
Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions. Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.
Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function. The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names. Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1653
Issue #1591
Because we need to be more frugal about our stack usage under
Linux. The __zio_execute() function was modified to re-dispatch
zios to a ZIO_TASKQ_ISSUE thread when we're in a context which
is known to be stack heavy. Those two contexts are the sync
thread and what ever thread is performing spa initialization.
Unfortunately, this change introduced an unlikely bug which can
result in a zio being re-dispatched indefinitely and never being
executed. If during spa initialization we handle a zio with
ZIO_PRIORITY_NOW it will be moved to the high priority queue.
When __zio_execute() is called again for the zio it will mis-
interpret the context and re-dispatch it again. The system
will get stuck spinning re-dispatching the zio and making no
forward progress.
To fix this rare issue __zio_execute() has been updated not
to re-dispatch zios on either the ZIO_TASKQ_ISSUE or
ZIO_TASKQ_ISSUE_HIGH task queues.
In practice this issue was rarely reported and can usually
be fixed by rebooting the system and importing the pool again.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1455
reading the file, but instead use libzfs_mnttab_find() which does
the nessesary freopen() for us.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498
For the same reasons it's used in libzfs_init(), this was just
overlooked because zinject gets minimal use.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1498
When /etc/mtab is updated on Linux it's done atomically with
rename(2). A new mtab is written, the existing mtab is unlinked,
and the new mtab is renamed to /etc/mtab. This means that we
must close the old file and open the new file to get the updated
contents. Using rewind(3) will just move the file pointer back
to the start of the file, freopen(3) will close and open the file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1611
3098 zfs userspace/groupspace fail without saying why when run as non-root
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/3098illumos/illumos-gate@70f56fa693
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1596
3618 ::zio dcmd does not show timestamp data
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
http://www.illumos.org/issues/3618illumos/illumos-gate@c55e05cb35
Notes on porting to ZFS on Linux:
The original changeset mostly deals with mdb ::zio dcmd.
However, in order to provide the requested functionality
it modifies vdev and zio structures to keep the timing data
in nanoseconds instead of ticks. It is these changes that
are ported over in the commit in hand.
One visible change of this commit is that the default value
of 'zfs_vdev_time_shift' tunable is changed:
zfs_vdev_time_shift = 6
to
zfs_vdev_time_shift = 29
The original value of 6 was inherited from OpenSolaris and
was subotimal - since it shifted the raw tick value - it
didn't compensate for different tick frequencies on Linux and
OpenSolaris. The former has HZ=1000, while the latter HZ=100.
(Which itself led to other interesting performance anomalies
under non-trivial load. The deadline scheduler delays the IO
according to its priority - the lower priority the further
the deadline is set. The delay is measured in units of
"shifted ticks". Since the HZ value was 10 times higher,
the delay units were 10 times shorter. Thus really low
priority IO like resilver (delay is 10 units) and scrub
(delay is 20 units) were scheduled much sooner than intended.
The overall effect is that resilver and scrub IO consumed
more bandwidth at the expense of the other IO.)
Now that the bookkeeping is done is nanoseconds the shift
behaves correctly for any tick frequency (HZ).
Ported-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1643
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.
We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1589
When the meta limit is exceeded the ARC evicts some meta data
buffers from the mfu+mru lists. Unfortunately, for meta data
heavy workloads it's possible for these buffers to accumulate
on the ghost lists if arc_c doesn't exceed arc_size.
To handle this case arc_adjust_meta() has been entended to
explicitly evict meta data buffers from the ghost lists in
proportion to what was evicted from the mfu+mru lists.
If this is insufficient we request that the VFS release
some inodes and dentries. This will result in the release
of some dnodes which are counted as 'other' metadata.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The default behavior of arc_evict_ghost() is to start by evicting
data buffers. Then only if the requested number of bytes to evict
cannot be satisfied by data buffers move on to meta data buffers.
This is ideal for honoring arc_c since it's preferable to keep the
meta data cached. However, if we're trying to free memory from the
arc to honor the meta limit it's a problem because we will need to
discard all the data to get to the meta data.
To avoid this issue the arc_evict_ghost() is now passed a fourth
argumented describing which buffer type to start with. The
arc_evict() function already behaves exactly like this for a
same reason so this is consistent with the existing code.
All existing callers have been updated to pass ARC_BUFC_DATA so
this patch introduces no functional change. New callers may
pass ARC_BUFC_METADATA to skip immediately to evicting meta
data leaving the normal data untouched.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
3137 L2ARC compression
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@nexenta.com>
References:
illumos/illumos-gate@aad02571bchttps://www.illumos.org/issues/3137http://wiki.illumos.org/display/illumos/L2ARC+Compression
Notes for Linux port:
A l2arc_nocompress module option was added to prevent the
compression of l2arc buffers regardless of how a dataset's
compression property is set. This allows the legacy behavior
to be preserved.
Ported by: James H <james@kagisoft.co.uk>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1379
This is analogous to SPL commit zfsonlinux/spl@b9b3715. While
we don't have clear evidence of systems getting caught here
indefinately like in the SPL this ensures that it will never
happen.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1579
zfs_readdir() is used by getdents(), which provides a list of all files
in directory, their types and an offset that be used by llseek() to seek
to the next directory entry.
On Solaris, the first two directory entries "." and ".." respectively
have offsets 1 and 2 on ZFS while the other files have rather large
numbers. Currently, ZFSOnLinux is giving "." offset 0 and all other
entries large numbers. The first entry's next entry offset points to
itself, which causes software that uses llseek() in conjunction with
getdents() for filesystem navigation to enter an infinite loop. The
offsets used for each directory entry are filesystem specific on all
platforms, so we can fix this by adopting the Solaris behavior.
Also, we currently report each directory entry as having type 0 (???).
This is not wrong, but we can do better. getdents() on Solaris does not
appear to provide this information, but it does on Linux and Mac OS X
do. ZFS provides easy access to type information in zfs_readdir(), so
this patch provides this as well.
Reported-by: Andrey <andrey@kudinov.su>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1624
3639 zpool.cache should skip over readonly pools
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Basil Crow <basil.crow@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
illumos/illumos-gate@fb02ae0252https://www.illumos.org/issues/3639
Normally we don't list pools that are imported read-only in the cache
file, however you can accidentally get one into the cache file by
importing and exporting a read-write pool while a read-only pool is
imported:
$ zpool import -o readonly test1
$ zpool import test2
$ zpool export test2
$ zdb -C
This is a problem because if the machine reboots we import all pools in
the cache file as read-write.
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When the property atime=on is set operations which only access
and inode do cause an atime update. However, it turns out that
dirty inodes with updated atimes are only written to disk when
the inodes get evicted from the cache. Somewhat surprisingly
the source suggests that this isn't a ZoL specific issue.
This behavior may in part explain why zfs's reclaim logic has
been observed to be slow. When reclaiming inodes its likely
that they have a dirty atime which will force a write to disk.
Obviously we don't want to force a write to disk for every
atime update, these needs to be batched. The right way to
do this is to fully implement the .dirty_inode and .write_inode
callbacks. However, to do that right requires proper unification
of some fields in the znode/inode. Then we could just mark the
inode dirty and leave it to the VFS to call .write_inode
periodically.
Until that work gets done we have to settle for some middle
ground. The simplest and safest thing we can do for now is
to write the dirty inode on last close. This should prevent
the majority of inodes in the cache from having dirty atimes
and not drastically increase the number of writes.
Some rudimentally testing to show how long it takes to drop
500,000 inodes from the cache shows promising results. This
is as expected because we're no longer do lots of IO as part
of the eviction, it was done earlier during the close.
w/out patch: ~30s to drop 500,000 inodes with drop_caches.
with patch: ~3s to drop 500,000 inodes with drop_caches.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The help output of for zfs set/get says that sync can be one of
standard | always | disabled
but the man pages claim it can be
sync=default | always | disabled
the accepted value is standard, this changes the manpage to give the
correct values.
Signed-off-by: Steven Burgess <sburgess@dattobackup.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1634
The manpage reports fletcher2, but in zio.h ZIO_CHECKSUM_ON_VALUE
is defined to ZIO_CHECKSUM_FLETCHER_4.
Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1628
When the kmod packaging infrastructure was originally added the
dependency on the rpmfusion yum repositories was disabled. This
was done at the time in favour of getting local builds working.
Now the time has come to conditionally re-enable that functionality
so we can properly provide binary kmod packages.
./configure --with-config=srpm
make SRPM_DEFINE_KMOD='--define="repo rpmfusion"' srpm-kmod
mock rebuild zfs-kmod-x.y.z-r.el6.src.rpm
One nice benefit of finishing this work is that the generic and
fedora spl-kmod spec files can be merged again.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The dmu_prefetch, dmu_free_long_range, dmu_free_object,
dmu_prealloc, dmu_write_policy, and dmu_sync symbols have
been exported so they may be used by other modules.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
dmu_tx_hold_object_impl can return NULL on error. Check for this
condition prior to dereferencing pointer. This can only occur if
the passed object was invalid or unallocated.
Signed-off-by: Nathaniel Clark <Nathaniel.Clark@misrule.us>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1610
The code involving b_thawed appears to be dead, so lets discard it.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
These functions are used in neither Illumos nor ZFSOnLinux. They appear
to have been replaced by arc_buf_alloc()/arc_buf_free(), so lets remove
them.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
We declare zio_alloc_arena using extern, but it does not appear to exist
anywhere in the code. This permits undefined behavior, so lets remove
it.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1614
The l2arc module options can be made safely writable. This allows
the options to be changed without unloading/loading the modules.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
These days modern SSDs can efficiently service concurrent reads
and writes. When this flag was added that wasn't really the
case for a variety of SSD controllers. But now we can set the
default value to take advantage of this parallelism and only
disable this as needed for specific troublesome hardware.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Based on the comments in arc.c we know that buffers can exist both
in arc and l2arc, under this circumstance both arc_buf_hdr_t and
l2arc_buf_hdr_t will be allocated. However the current logic only
cares for memory that l2arc_buf_hdr takes up when the buffer's
state transfers from or to arc_l2c_only. This will cause obvious
deviations for illumos's zfs version since the sizeof(l2arc_buf_hdr)
is larger than ZOL's. We can implement the calcuation in the
following simple way:
1. When allocate a l2arc_buf_hdr_t we add its memory consumption
instantly and subtract it when we free or evict the l2arc buf.
2. According to l2arc_hdr_stat_add and l2arc_hdr_stat_remove, if
the buffer only stays in l2arc we should also add the memory
its arc_buf_hdr_t consumes, so we only need to add HDR_SIZE to
arcstat_l2_hdr_size since we already concerned with L2HDR_SIZE
in step 1 and the same for transfering arc bufs from l2arc only
state.
The testbox has 2 4-core Intel Xeon CPUs(2.13GHz), with 16GB memory
and tests were set upped in the following way:
1. Fdisked a SATA disk into two partitions, one partition for zpool
storage and the other one was used as the cache device.
2. Generated some files occupying 14GB altogether in the zpool
prepared in step 1 using iozone.
3. Read them all using md5sum and watched the l2arc related statistics
in /proc/spl/kstat/zfs/arcstats. After the reading ended the
l2_hdr_size and l2_size were shown like this:
l2_size 4 4403780608
l2_hdr_size 4 0
which was weird.
4. After applying this patch and reran step 1-3, the results were
as following:
l2_size 4 4306443264
l2_hdr_size 4 535600
these numbers made sense, on 64-bit systems the
sizeof(l2arc_buf_hdr_t) is 16 bytes. Assue all blocks cached by
l2arc are 128KB, so 535600/16*128*1024=4387635200, since not all
blocks are equal-sized, the theoretical result will be a little
bigger, as we can see.
Since I'm familiar with systemtap instrumentation tool I used it to
examine what had happened. The script looked like this:
probe module("zfs").function("arc_chage_state")
{
if ($new_state == $arc_l2_only)
printf("change arc buf to arc_l2_only\n")
}
It will print out some information each time we call funciton
arc_chage_state if the argument new_state is arc_l2_only. I
gathered the trace logs and found that none of the arc bufs ran
into arc state arc_l2_only when the tests was running, this was
the reason why l2_hdr_size in step 3 was 0. The arc bufs fell into
arc_l2_only when the pool or the filesystem was offlined.
Signed-off-by: Ying Zhu <casualfisher@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Due to an uninitialized variable it was possible for the command
'zpool list -H' to return a non-zero error when there are no pools.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1605
zpool list has the same options for repeating as zpool iostat
has, but that is not documented. This patch adds the documentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The iterate_supers_type() function which was introduced in the
3.0 kernel was supposed to provide a safe way to call an arbitrary
function on all super blocks of a specific type. Unfortunately,
because a list_head was used a bug was introduced which made it
possible for iterate_supers_type() to get stuck spinning on a
super block which was just deactivated.
This can occur because when the list head is removed from the
fs_supers list it is reinitialized to point to itself. If the
iterate_supers_type() function happened to be processing the
removed list_head it will get stuck spinning on that list_head.
The bug was fixed in the 3.3 kernel by converting the list_head
to an hlist_node. However, to resolve the issue for existing
3.0 - 3.2 kernels we detect when a list_head is used. Then to
prevent the spinning from occurring the .next pointer is set to
the fs_supers list_head which ensures the iterate_supers_type()
function will always terminate.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1045Closes#861Closes#790
During mount a filesystem dataset would have the MS_RDONLY bit
incorrectly cleared even if the entire pool was read-only.
There is existing to code to handle this case but it was being run
before the property callbacks were registered. To resolve the
issue we move this read-only code after the callback registration.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1338
It is possible for an automounted snapshot which is expiring to
deadlock with a manual unmount of the snapshot. This can occur
because taskq_cancel_id() will block if the task is currently
executing until it completes. But it will never complete because
zfsctl_unmount_snapshot() is holding the zsb->z_ctldir_lock which
zfsctl_expire_snapshot() must acquire.
---------------------- z_unmount/0:2153 ---------------------
mutex_lock <blocking on zsb->z_ctldir_lock>
zfsctl_unmount_snapshot
zfsctl_expire_snapshot
taskq_thread
------------------------- zfs:10690 -------------------------
taskq_wait_id <waiting for z_unmount to exit>
taskq_cancel_id
__zfsctl_unmount_snapshot
zfsctl_unmount_snapshot <takes zsb->z_ctldir_lock>
zfs_unmount_snap
zfs_ioc_destroy_snaps_nvl
zfsdev_ioctl
do_vfs_ioctl
We resolve the deadlock by dropping the zsb->z_ctldir_lock before
calling __zfsctl_unmount_snapshot(). The lock is only there to
prevent concurrent modification to the zsb->z_ctldir_snaps AVL
tree. Moreover, we're careful to remove the zfs_snapentry_t from
the AVL tree before dropping the lock which ensures no other tasks
can find it. On failure it's added back to the tree.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlap <cdunlap@llnl.gov>
Closes#1527
By adding a dkms_version conditional it's now possible to specify
an exact version of dkms. This is used by the Fedora and EPEL
yum repositories to ensure the patched version of dkms provided
by the repository is installed. The patched version of dkms
ensures that the spl modules are built before the zfs modules.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1466
The read bandwidth of an N-way mirror can by increased by 50%,
and the IOPs by 10%, by more carefully selecting the preferred
leaf vdev.
The existing algorthm selects a perferred leaf vdev based on
offset of the zio request modulo the number of members in the
mirror. It assumes the drives are of equal performance and
that spreading the requests randomly over both drives will be
sufficient to saturate them. In practice this results in the
leaf vdevs being under utilized.
Utilization can be improved by preferentially selecting the leaf
vdev with the least pending IO. This prevents leaf vdevs from
being starved and compensates for performance differences between
disks in the mirror. Faster vdevs will be sent more work and
the mirror performance will not be limitted by the slowest drive.
In the common case where all the pending queues are full and there
is no single least busy leaf vdev a batching stratagy is employed.
Of the N least busy vdevs one is selected with equal probability
to be the preferred vdev for T microseconds. Compared to randomly
selecting a vdev to break the tie batching the requests greatly
improves the odds of merging the requests in the Linux elevator.
The testing results show a significant performance improvement
for all four workloads tested. The workloads were generated
using the fio benchmark and are as follows.
1) 1MB sequential reads from 16 threads to 16 files (MB/s).
2) 4KB sequential reads from 16 threads to 16 files (MB/s).
3) 1MB random reads from 16 threads to 16 files (IOP/s).
4) 4KB random reads from 16 threads to 16 files (IOP/s).
| Pristine | With 1461 |
| Sequential Random | Sequential Random |
| 1MB 4KB 1MB 4KB | 1MB 4KB 1MB 4KB |
| MB/s MB/s IO/s IO/s | MB/s MB/s IO/s IO/s |
---------------+-----------------------+------------------------+
2 Striped | 226 243 11 304 | 222 255 11 299 |
2 2-Way Mirror | 302 324 16 534 | 433 448 23 571 |
2 3-Way Mirror | 429 458 24 714 | 648 648 41 808 |
2 4-Way Mirror | 562 601 36 849 | 816 828 82 926 |
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1461