Authored by: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7500
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/653af1bCloses#5639
When importing a pool with a large number of filesystems within the same
parent filesystem, we see that dmu_objset_find_dp() takes a long time.
It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
and spa_load_verify().
There are several ways to improve performance here:
1. We don't really need to do spa_check_logs() or
spa_ld_claim_log_blocks() if the pool was closed cleanly.
2. spa_load_verify() uses dmu_objset_find_dp() to check that no
datasets have too long of names.
3. dmu_objset_find_dp() is slow because it's doing
zap_value_search() (which is O(N sibling datasets)) to determine
the name of each dsl_dir when it's opened. In this case we
actually know the name when we are opening it, so we can provide
it and avoid the lookup.
This change implements fix#3 from the above list; i.e. make
dmu_objset_find_dp() provide the name of the dataset so that we don't
have to search for it.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Reviewed-by: David Quigley <david.quigley@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7606
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/cac6babCloses#5662
When doing recv and rollback, dsl_dataset_clone_swap_sync_impl will be
called to swap out the ds_objset and do dmu_objset_evict on the old one.
However, currently zv->zv_objset will not be swapped out accordingly, so
if anyone currently holds a fd on the zvol, we risk hitting a use-after-free.
We fix this by introducing the suspend and resume mechanism of zsb to
zv. Before recv or rollback, we use zvol_suspend to block all access to
zv_objset and shut it down. After the recv or rollback, we use zvol_resume
to swap in zv_objset with the new ds_objset and unblock the access.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4866Closes#5609
Porting notes:
- This issue was first fixed in ZoL by commit d862cb0d. That fix was
then modified and an equivalent version of the patch landed in the
upstream code base. For additional details see the discussion in
https://github.com/openzfs/openzfs/pull/24 .
This commit aligns ZoL with OpenZFS codebase.
Authored by: Andriy Gapon <avg@icyb.net.ua>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: George Melikov mail@gmelikov.ru
OpenZFS-issue: https://www.illumos.org/issues/6529
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e7e978bCloses#5606
Two threads send_traverse_thread() and receive_writer_thread() should
end with thread_exit();
Mostly a cosmetic issue under IllumOS.
Authored by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7659
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/a569268Closes#5603
Authored by: Igor Kozhukhov ikozhukhov@gmail.com
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7071
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/25f7d99Closes#5597
Authored by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: George Melikov <mail@gmelikov.ru>
OpenZFS-issue: https://www.illumos.org/issues/7082
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/10e67aaCloses#5596
Fix dmu_object_next() to correctly handle unallocated objects on
large_dnode datasets.
We implement this by scanning the dnode block until we find the correct
offset to be used in dnode_next_offset(). This is necessary because we
can't assume *objectp is a hole even if dmu_object_info() returns
ENOENT.
This fixes a couple of issues with zfs receive on large_dnode datasets.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#5027Closes#5532
Authored by: Andriy Gapon <andriy.gapon@clusterhq.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7181
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/90f2c09Closes#5585
Assuming /bin/cp causes problems on systems where cp is
not in /bin such as NixOS.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Joerg Thalheim <joerg@higgsboson.tk>
Closes#5548
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alex Zhuravlev <bzzz@whamcloud.com>
Closes#5534
Issue #4802
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Joe Stein <jas14@cs.brown.edu>
Ported-by: Don Brady <don.brady@intel.com>
When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.
This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.
The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.
OpenZFS-issue: https://www.illumos.org/issues/7743
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/e2d29d0Closes#5582
This change introduces a new weighting algorithm to improve
metaslab selection. The new weighting algorithm relies on the
SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight
now encodes the type of weighting algorithm used (size-based
vs segment-based).
Porting Notes: The metaslab allocation tracing code is conditionally
removed on linux (dependent on mdb debugger).
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Don Brady <don.brady@intel.com>
OpenZFS-issue: https://www.illumos.org/issues/7303
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bdCloses#5404
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5547Closes#5543
[bio] The req_op enum was changed to req_opf. Update the "Linux 4.8 API"
autotools checks to use an int to determine whether the various REQ_OP
values are defined. This should work properly on kernels >= 4.8.
[bio] bio_set_op_attrs() is now an inline function and can't be detected
with #ifdef. Add a configure check to determine whether bio_set_op_attrs()
is defined. Move the local definition of it from vdev_disk.c to
blkdev_compat.h for consistency with other related compability shims.
[bio] The read/write flags and their modifiers, including WRITE_FLUSH,
WRITE_FUA and WRITE_FLUSH_FUA have been removed from fs.h. Add the new
bio_set_flush() compatibility wrapper to replace VDEV_WRITE_FLUSH_FUA
and set the flags appropriately for each supported kernel version.
[vfs] The generic_readlink() function has been made static. If .readlink
in inode_operations is NULL, generic_readlink() is used.
[zol typo] Completely unrelated to 4.10 compat, fix a typo in the check
for REQ_OP_SECURE_ERASE so that the proper macro is defined:
s/HAVE_REQ_OP_SECURE_DISCARD/HAVE_REQ_OP_SECURE_ERASE/
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#5499
Fix a regression accidentally introduced by e0ab3ab.
Additionally, add a new script zpool_import_014_pos.ksh to
the ZFS test suite to exercise 'zpool import -t' functionality.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#5466Closes#5515
The introduction of parallel zvol prefetch causes deadlock when using
vdev_file.
spa_async->(spa_namespace_lock)->txg_wait_synced->(wait for txg_sync)
txg_sync->zio_wait->(wait for vdev_file_io_fsync on system_taskq)
zvol_prefetch_minors_impl (on system_taskq)->spa_open_common->(wait for spa_namespace_lock)
We fix this by using dedicated taskq for vdev_file. This same change
was originally made in commit bc25c93 but reverted in commit aa9af22
when dynamic taskqs were added.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Closes#5506Closes#5495
When iterating over the input nvlist in dsl_props_set_sync_impl() when we don't
preserve the nvpair name before looking up ZPROP_VALUE, so when we later go to
process it nvpair_name() is always "value" and not the actual property name.
This fixes a couple of bugs in zfs_ioc_recv():
* Received properties were not restored correctly when failing to receive an
incremental send stream
* Received properties were not completely replaced by the new ones when
successfully receiving an incremental send stream
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#5497
This branch contains the following fixes/improvements.
* Fix setting i_flags
* Fix wrong operator in xvattr.h
* Fix fchange macro in zpl_ioctl_setflags()
* Added configure check to use inode_set_flags()
* Added a test case for chattr for better test coverage
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5486Closes#5470Closes#5469
zfs_sb_create would normally takes ownership of zmo, and it will be freed in
zfs_sb_free. However, when zfs_sb_create fails we need to explicit free it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5490Closes#5496
The fchange in zpl_ioctl_setflags was for detecting flag change. However it
was incorrect and would always fail to detect a flag change from set to unset,
causing users without CAP_LINUX_IMMUTABLE to be able to unset flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Adapt avx512bw implementation for use with abd buffers. Mul2 implementation
is rewritten to take advantage of the BW instruction set.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Romain Dolbeau <romain.dolbeau@atos.net>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5477
Fix zfs_xvattr_set to set S_IMMUTABLE and S_APPEND flags correctly.
Reinstate zfs_set_inode_flags and use it when zfs_xvatter_set and also when
setting up inode in zfs_znode_alloc and zfs_rezget.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
User of ida needs to call ida_destroy after using it. Otherwise
ida->free_bitmap and/or other stuff may leak.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5484
This removes two large whitespaces in "modinfo zfs" as well as correcting
a couple typos.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: bunder2015 <omfgbunder@gmail.com>
Closes#5475
Enable picky cstyle checks and resolve the new warnings. The vast
majority of the changes needed were to handle minor issues with
whitespace formatting. This patch contains no functional changes.
Non-whitespace changes are as follows:
* 8 times ; to { } in for/while loop
* fix missing ; in cmd/zed/agents/zfs_diagnosis.c
* comment (confim -> confirm)
* change endline , to ; in cmd/zpool/zpool_main.c
* a number of /* BEGIN CSTYLED */ /* END CSTYLED */ blocks
* /* CSTYLED */ markers
* change == 0 to !
* ulong to unsigned long in module/zfs/dsl_scan.c
* rearrangement of module_param lines in module/zfs/metaslab.c
* add { } block around statement after for_each_online_node
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5465
Don't count '@' for dataset namelen if not a snapshot. This
fixes making a pool unimportable when the dataset namelen
is 255.
Add test file for zfs create name length 255.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5432Closes#5456
CID 154617: Memory - illegal accesses (UNINIT)
The value here just needs to be initialized to make Coverity happy.
When dsize == 0, then value of daiter.iter_mapaddr is irrelevant. That
address won't be accessed, it's only used for some arithmetic. dsize
can be zero either if dabd is null, or if code column is longer than the
current data column.
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes#5437
Speed up import and export speed by:
* Add system delay taskq
* Parallel prefetch zvol dnodes during zvol_create_minors
* Parallel zvol_free during zvol_remove_minors
* Reduce list linear search using ida and hash
Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5433
Enable zio_dva_throttle_enabled=1 by default. Subsequent
testing has been unable to reproduce the suspected regression.
Tested-by: kernelOfTruth kerneloftruth@gmail.com
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov
Reverts #5335Closes#5289Closes#5457
Save and reuse ddt dspace calculation when there have been no ddt changes.
This avoids unnecessary traversal of 168KiB of ddt histograms.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5425
It was observed that even when the txg history is disabled by
setting `zfs_txg_history=0` the txg_sync thread still fetches
the vdev stats unnecessarily.
This patch refactors the code such that vdev_get_stats() is no
longer called when `zfs_txg_history=0`. And it further reduces
the differences between upstream and the ZoL txg_sync_thread()
function.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5412
On some kernel version, blk_cleanup_queue and put_disk will wait for more then
10ms. So a pool with a lot of zvols will easily wait for more then 1 min if we
do zvol_free sequentially.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Requires-spl: refs/pull/588/head
Do parallel prefetch all zvol dnodes before actually creating each individual.
This will greatly reduce the import time when having a lot of zvols and disk
is slow.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
dbuf_read() creates a zio_root() to track and wait for all the
zio's that may happen as part of this call. However, if the blkptr_t
for this buffer is NULL or a hole, we will not create any more zio's,
so this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.
The fix is to only create/destroy the zio_root() in dbuf_read() if
the blkptr is not NULL and not a hole.
Changes sponsored by Intel Corp.
Authored by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Alex Zhuravlev <alexey.zhuravlev@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue openzfs/openzfs#137
Closes#4803Closes#5382
This should be & and not | so is_metadata is set correctly.
Reviewed-by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes#5438
It looks like this was functionality which was added in the
original SA implementation and then never needed. It can
be safely removed now and easily added back if we find a
use for it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5440
zio.h includes zio_impl.h but zio_impl.h also includes zio.h, so the
header files to contain each other. Get rid of the zio_impl.h include
in zio.h and update zio_inject.c to include zio.h instead of zio_impl.h.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5439
Use it for spa_deadman, zpl_posix_acl_free, snapentry_expire.
This free system_taskq from the above long delay tasks, and allow us to do
taskq_wait_outstanding on system_taskq without being blocked forever, making
system_taskq more generic and useful.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
In multiple cases zio_buf_alloc() was used instead of kmem_alloc()
or vmem_alloc(). This was often done because the allocations
could be large and it was easy to use zfs_buf_alloc() for them.
But this isn't ideal for allocations which are small or short
lived. In these cases it is better to use kmem_alloc() or
vmem_alloc(). If possible we want to avoid the case where
we have slabs allocated for kmem caches which are rarely used.
Note for small allocations vmem_alloc() will be internally
converted to kmem_alloc(). Therefore as long as large
allocations are infrequent and short lived the penalty for
using vmem_alloc() is small.
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5409
* Convert ABD to use the Linux Kernel scatterlist implementation
instead of the hand rolled one from illumos.
* Scatter ABDs are preferentially populated with higher order
compound pages from a single zone. Allocation size is
progressively decreased until it can be satisfied without
performing reclaim or compaction.
* An alternate page allocator is provided for kernels older
than 3.6 and for CONFIG_HIGHMEM systems. This allocator
is designed as a fallback for maximum compatibility.
* Extended abdstats to provide visibility in the the allocator.
* Add cached value for PAGESIZE in userspace.
Contributions-by:
Chunwei Chen <david.chen@osnexus.com>
Gvozden Neskovic <neskovic@gmail.com>
Jinshan Xiong <jinshan.xiong@intel.com>
Isaac Huang <he.huang@intel.com>
David Quigley <david.quigley@intel.com>
Brian Behlendorf <behlendorf1@llnl.gov>
Enable vectorized raidz code on ABD buffers. The avx512f,
avx512bw, neon and aarch64_neonx2 are disabled in this commit.
With the exception of avx512bw these implementations are
updated for ABD in the subsequent commits.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
* userspace: aligned buffers. Minimum of 32B alignment is
needed for AVX2. Kernel buffers are aligned 512B or more.
* add abd_get_offset_size() interface
* abd_iter_map(): fix calculation of iter_mapsize
* add abd_raidz_gen_iterate() and abd_raidz_rec_iterate()
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Linux kernel commit 723c038475b78 removed this field.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#5393
CID 147540: unsigned_compare
- Cast nsec to a int32_t to properly detect the expected overflow.
CID 147542: unsigned_compare
- intval can never be less than ZIO_FAILURE_MODE_WAIT which is
defined to be zero. Remove this useless check.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5379
It's used by Lustre to determine if the objset can be upgraded.
The inline version doesn't work because dmu_objset_is_snapshot()
is not exported.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Closes#5385
Linux 3.14 introduces inode->set_acl(). Normally, acl modification will come
from setxattr, which will handle by the acl xattr_handler, and we already
handles that well. However, nfsd will directly calls inode->set_acl or
return error if it doesn't exists.
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed-by: Massimo Maggi <me@massimo-maggi.eu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5371Closes#5375
The phase 2 work primarily entails the Diagnosis Engine and
the Retire Agent modules. It also includes infrastructure
to support a crude FMD environment to host these modules.
The Diagnosis Engine consumes I/O and checksum ereports and
feeds them into a SERD engine which will generate a corres-
ponding fault diagnosis when the SERD engine fires. All the
diagnosis state data is collected into cases, one case per
vdev being tracked.
The Retire Agent responds to diagnosed faults by isolating
the faulty VDEV. It will notify the ZFS kernel module of
the new VDEV state (degraded or faulted). This agent is
also responsible for managing hot spares across pools.
When it encounters a device fault or a device removal it
replaces the device with an appropriate spare if available.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#5343
Currently every calls to zpl_posix_acl_release will schedule a delayed task,
and each delayed task will add a timer. This used to be fine except for
possibly bad performance impact.
However, in Linux 4.8, a new timer wheel implementation[1] is introduced. In
this new implementation, the larger the delay, the less accuracy the timer is.
So when we have a flood of timer from zpl_posix_acl_release, they will expire
at the same time. Couple with the fact that task_expire will do linear search
with lock held. This causes an extreme amount of contention inside interrupt
and would actually lockup the system.
We fix this by doing batch free to prevent a flood of delayed task. Every call
to zpl_posix_acl_release will put the posix_acl to be freed on a lockless
list. Every batch window, 1 sec, the zpl_posix_acl_free will fire up and free
every posix_acl that passed the grace period on the list. This way, we only
have one delayed task every second.
[1] https://lwn.net/Articles/646950/
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Only restrict the maximum zio alloc size to 32-bit kernel space.
The same virtual address space limitations don't apply to user
space. This resolves a memory allocation failure in raidz_test
where it expects to be able to exercises all valid zio sizes.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This is the Fletcher4 algorithm implemented in pure C, but using
multiple counters using algorithms identical to those used for
SSE/NEON and AVX2.
This allows for faster execution on core with strong superscalar
capabilities but weak SIMD capabilities.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes#5317
Linux 3.11 add O_TMPFILE to open(2), which allow creating an unlinked file on
supported filesystem. It's basically doing open(2) and unlink(2) atomically.
The filesystem support is added through i_op->tmpfile. We basically copy the
create operation except we get rid of the link and name related stuff and add
the new node to unlinked set.
We also add support for linkat(2) to link tmpfile. However, since all previous
file operation will skip ZIL, we force a txg_wait_synced to make sure we are
sync safe.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Currently, doing things like fsetxattr(2) on an unlinked file will result in
ENODATA. There's two places that cause this: zfs_dirent_lock and zfs_zget.
The fix in zfs_dirent_lock is pretty straightforward. In zfs_zget though, we
need it to not return error when the zp is unlinked. This is a pretty big
change in behavior, but skimming through all the callers, I don't think this
change would cause any problem. Also there's nothing preventing z_unlinked
from being set after the z_lock mutex is dropped before but before zfs_zget
returns anyway.
The rest of the stuff is to make sure we don't log xattr stuff when owner is
unlinked.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.
avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name>
Closes#5219
On error dsl_prop_get_all_ds() does not free the nvlist it allocates.
This behavior may have been intentional when originally written
but is atypical and often confusing. Since no callers rely on this
behavior the function has been updated to always free the nvlist
on error.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: BearBabyLiu <liu.huang@zte.com.cn>
Closes#5320
On 32-bit Linux systems use vmem_size() to correctly size the ARC
and better determine when IO should be throttle due to low memory.
On 64-bit systems this change has no effect since the virtual
address space available far exceeds the physical memory available.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
A limit of 1TB exists for zvols on 32-bit systems. Update the code
to correctly reflect this limitation in a similar manor as the
OpenZFS implementation.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
Originally the .zfs/snapshot directory was disabled for 32-bit systems
because 64-bit inode numbers were not supported. This is no longer
the case and this functionality can be enabled by default.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347Closes#2002
Add the TASKQID_INVALID macros and update callers to use the macro
instead of testing against 0. There is no functional change
even though the functions in zfs_ctldir.c incorrectly used -1
instead of 0.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5347
Replace magic value 16 with ARRAY_SIZE() to correctly handle
when the sa_legacy_attrs array size changes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Closes#5354
CID 147509: Explicit null dereferenced
- l2arc_sublist_lock is fragile as relied on caller too much.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes#5319
Ubuntu added support for checking inode permissions to lookup_bdev() in kernel
commit 193fb6a2c94fab8eb8ce70a5da4d21c7d4023bee (merged in 4.4.0-6.21).
Upstream bug: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1636517
This patch adds a test for Ubuntu's variant of lookup_bdev() to configure and
calls the function in the correct way.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Closes#5336
Until it can be determined definitively that a performance
regression wasn't introduced accidentally by 3dfb57a this
functionality is being disabled by default. It can be re-
enabled by setting zio_dva_throttle_enabled=1.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5335
Issue #5289
- Fix autoreplace behaviour on statechange-led.sh script.
ZED sends the following events on an auto-replace:
1. statechange: Disk goes UNAVAIL->ONLINE
2. statechange: Disk goes ONLINE->UNAVAIL
3. vdev_attach: Disk goes ONLINE
Events 1-2 happen when ZED first attempts to do an auto-online. When that
fails, ZED then tries an auto-replace, generating the vdev_attach event in #3.
In the previous code, statechange-led was only looking at the UNAVAIL->ONLINE
transition to turn off the LED. It ignored the #2 ONLINE->UNAVAIL transition,
assuming it was just the "old" VDEV going offline. This is problematic, as
a drive can go from ONLINE->UNAVAIL when it's malfunctioning, and we don't want
to ignore that.
This new patch correctly turns on the fault LED every time a drive becomes
UNAVAIL. It also monitors vdev_attach events to trigger turning off the LED
when an auto-replaced disk comes online.
- Remove unnecessary libdevmapper warning with --with-config=kernel
This fixes an unnecessary libdevmapper warning when building
--with-config=kernel. Kernel code does not use libdevmapper, so the warning
is not needed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#2375Closes#5312Closes#5331
Similar to commit a3600a106. Asm files need an explicit note
that they do not require an executable stack.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Jason Zaman <jason@perfinion.com>
Closes#5332
This is caught by kmemleak when running compress_004_pos
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5244Closes#5330
When creating and destroying pools in tight loop it's possible to
exhaust the number of allowed threads on a system. This results
in taskq_create() failling and a NULL dereference.
Resolve the issue by falling back to opening the vdevs all
synchronously.
Reviewed-by: Denys Rtveliashvili <denys@rtveliashvili.name>
Reviewed-by: Håkan Johansson <f96hajo@chalmers.se>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#521
Closes#4637
Previously when a drive faulted, the statechange-led.sh script would lookup
the drive's LED sysfs entry in /sys/block/sd*/device/enclosure_device, and
turn it on. During testing we noticed that if you pulled out a drive, or if
the drive was so badly broken that it no longer appeared to Linux, that the
/sys/block/sd* path would be removed, and the script could not lookup the
LED entry.
To fix this, this patch looks up the disks's more persistent
"/sys/class/enclosure/X:X:X:X/Slot N" LED sysfs path at pool import. It then
passes that path to the statechange-led script to use, rather than having the
script look it up on the fly. This allows the script to turn on/off the slot
LEDs even when the drive is missing.
Closes#5309Closes#2375
This is not useful on micro-architecture with a weak NEON
implementation (only 64 bits); the native version is slower &
the byteswap barely faster than scalar. On A53 or A57, it's
a small improvement on scalar but OK for byteswap.
Results from an A53 system:
0 0 0x01 -1 0 1499068294333000 1499101101878000
implementation native byteswap
scalar 1008227510 755880264
aarch64_neon 1198098720 1044818671
fastest aarch64_neon aarch64_neon
Results from a A57 system:
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation native byteswap
scalar 2302071241 1124873346
aarch64_neon 2542214946 2245570352
fastest aarch64_neon aarch64_neon
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes#5248
The AVL tree compare function requires that either -1, 0, or 1 be
returned. However the strcmp() function only guarantees that a
negative, zero, or positive value is returned. Therefore, the
return value of strcmp() needs to be sanitized with AVL_ISIGN.
This was initially overlooked because the x86_64 implementation
of strcmp() happens to only returns the allowed values. This
was observed on an aarch64 platform which behaves correctly but
differently as described above.
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5311Closes#5313
In torvalds/linux@31051c8 the inode_change_ok() function was
renamed setattr_prepare() and updated to take a dentry ratheri
than an inode. Update the code to call the setattr_prepare()
and add a wrapper function which call inode_change_ok() for
older kernels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Requires-spl: refs/pull/581/head
In Linux 4.9, torvalds/linux@fd50eca, iops->{set,get,remove}xattr and
generic_{set,get,remove}xattr are removed. xattr operations will directly
go through sb->s_xattr.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
In Linux 4.9, torvalds/linux@2773bf0, iops->rename() and iops->rename2() are
merged together into iops->rename(), it now wants flags.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
These operations are dir specific, there's no point putting them in
zpl_inode_operations which is for regular files.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
1. Enable multipath autoreplace support for FMA.
This extends FMA autoreplace to work with multipath disks. This
requires libdevmapper to be installed at build time.
2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online
Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must
be supported by the Linux SES driver for this to work. The enclosure LED
scripts work for multipath devices as well. The scripts will clear the LED
when the fault is cleared.
3. Rate limit ZIO delay and checksum events so as not to flood ZED
ZIO delay and checksum events are rate limited to 5/sec in the zfs module.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#2449Closes#3017Closes#5159
CID 150926: Unchecked return value (CHECKED_RETURN)
- This case cannot occur given the existing taskq implementation
and flags passed to task_dispatch().
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes#5272
Accidentally introduced by 3dfb57a, when building with debugging
disabled several variables are unused. Resolve this by wrapping
them in ASSERTV to remove them for non-debug builds.
Reviewed by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5284
CID 150924: Unchecked return value (CHECKED_RETURN)
- On taskq_dispatch failure the reference must be dropped and
this entry can be safely skipped. This case should be impossible
in the existing implementation but should be handled regardless.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes#5278
OpenZFS 7090 - zfs should throttle allocations
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.
In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.
The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.
OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7Closes#5258
Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress
- Added module params and documentation
- Updated to use optimize AVL cmp macros
The ICP requires destructors to for each crypto module that is added.
These do not necessarily exist in Illumos because they assume that
these modules can never be unloaded from the kernel. Some of this
cleanup code was missed when #4760 was merged, resulting in leaks.
This patch simply fixes that.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #4760Closes#5265
Fix use after free in zfsctl_snapshot_unmount(). Use /usr/bin/env
instead of /bin/sh to fix a shell code injection flaw and allow use
with grsecurity.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Stian Ellingsen <stian@plaimi.net>
Closes#5250Closes#4377
This is as much an upstream compatibility as it's a bit of a performance
gain.
The illumos taskq implemention doesn't allow a TASKQ_THREADS_CPU_PCT type
to be dynamic and in fact enforces as much with an ASSERT.
As to performance, if this taskq is dynamic, it can cause excessive
contention on tq_lock as the threads are created and destroyed because it
can see bursts of many thousands of tasks in a short time, particularly
in heavy high-concurrency zvol write workloads.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#5236
When #4760 was merged tests were added to ensure that the new checksums
were working properly. However, some of the functionality for sha2
functions were not ported over, resulting in some Coverity defects and
code that would be unstable when needed in the future. This patch
simply ports over the missing code and fixes the defects in the
process.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Issue #4760Closes#5251
The following new test cases need to have execute permissions set:
userquota/groupspace_003_pos.ksh
userquota/userquota_013_pos.ksh
userquota/userspace_003_pos.ksh
upgrade/upgrade_userobj_001_pos.ksh
upgrade/setup.ksh
upgrade/cleanup.ksh
The following source files accidentally were marked executable:
lib/libzpool/kernel.c
lib/libshare/nfs.c
lib/libzfs/libzfs_dataset.c
lib/libzfs/libzfs_util.c
tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c
tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c
cmd/zed/zed_exec.c
module/icp/core/kcf_sched.c
module/zfs/dsl_pool.c
module/zfs/arc.c
module/nvpair/nvpair.c
man/man5/zfs-module-parameters.5
Reviewed-by: GeLiXin <ge.lixin@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5241
Call mount and umount via /usr/bin/env instead of /bin/sh in
zfsctl_snapshot_mount() and zfsctl_snapshot_unmount().
This change fixes a shell code injection flaw. The call to /bin/sh
passed the mountpoint unescaped, only surrounded by single quotes. A
mountpoint containing one or more single quotes would cause the command
to fail or potentially execute arbitrary shell code.
This change also provides compatibility with grsecurity patches.
Grsecurity only allows call_usermodehelper() to use helper binaries in
certain paths. /usr/bin/* is allowed, /bin/* is not.
OpenZFS decided that ignore_hole_birth was too imprecise and
incorrect a name (and went with send_holes_without_birth_time).
Rename it in ZoL too, while keeping the name "ignore_hole_birth"
pointing to the same variable for existing consumers.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes#5239
Updating vd->vdev_parent->vdev_nonrot in vdev_open_child()
is a race when vdev_open_child is called for many children
from a task queue.
vdev_open_child() is only called by vdev_open_children(), let
the latter update the parent vdev_nonrot member. The update
was already there, so done twice previously. Thus using the
same logic at the end in vdev_open_children() to update
vdev_nonrot, either we are vdev_uses_zvols() or not.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5162
Fixes ABI issues with fletcher4 code, adds support for
incremental updates, and adds ztest method for testing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#5164
Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().
The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.
We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".
This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.
Upstream bugs: DLPX-44799
Ported by: Ned Bass <bass6@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6988
ZFSonLinux-issue: https://github.com/zfsonlinux/zfs/issues/4642
OpenZFS-commit: unmerged
Porting notes:
- Added curly braces around declaration of userquota_cache_t cache to
quiet compiler warning;
- Handled the userobj accounting the same way it proposed in this path.
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.
ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
Move the synchronization of inode/znode i_flgas/pflags into
the respective internal zfs function. This is mostly
mechanical work and shouldn't introduce any functional
changes.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Issue #227Closes#5223
Init, compute, and fini methods are changed to work on internal context object.
This is necessary because ABI does not guarantee that SIMD registers will be preserved
on function calls. This is technically the case in Linux kernel in between
`kfpu_begin()/kfpu_end()`, but it breaks user-space tests and some kernels that
don't require disabling preemption for using SIMD (osx).
Use scalar compute methods in-place for small buffers, and when the buffer size
does not meet SIMD size alignment.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Combine incrementally computed fletcher4 checksums. Checksums are combined
a posteriori, allowing for parallel computation on chunks to be implemented if
required. The algorithm is general, and does not add changes in each SIMD
implementation.
New test in ztest verifies incremental fletcher computations.
Checksum combining matrix for two buffers `a` and `b`, where `Ca` and `Cb` are
respective fletcher4 checksums, `Cab` is combined checksum, `s` is size of buffer
`b` (divided by sizeof(uint32_t)) is:
Cab[A] = Cb[A] + Ca[A]
Cab[B] = Cb[B] + Ca[B] + s * Ca[A]
Cab[C] = Cb[C] + Ca[C] + s * Ca[B] + s(s+1)/2 * Ca[A]
Cab[D] = Cb[D] + Ca[D] + s * Ca[C] + s(s+1)/2 * Ca[B] + s(s+1)(s+2)/6 * Ca[A]
NOTE: this calculation overflows for larger buffers. Thus, internally, the calculation
is performed on 8MiB chunks.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
In any pool without the extensible dataset feature flag already enabled,
creating a dataset with dedup set to use one of the new checksums would
result in the following panic as soon as any data was added:
panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature,
&refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c
line 390
Inpsection showed that feature->fi_feature was 7, which is the value of
SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum. This commit
adds extensible dataset as a dependency for the sha512, edonr, and skein
feature flags, which prevents the panic.
OpenZFS-issue: https://www.illumos.org/issues/6585
OpenZFS-commit: 892586e8a1
Porting Notes:
This code was originally from Illumos, but I actually ported it from:
openzfsonosx/zfs@b62a652
Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: Tony Hutter <hutter2@llnl.gov>
zio_checksum_to_feature() expects a zio_checksum enum not a raw property
intval, so the new checksums weren't being detected when the
ZIO_CHECKSUM_VERIFY flag got in the way.
Given a pool without feature@sha512,
zfs create -o dedup=sha512 naughty/fivetwelve_noverify_ds
would fail as expected since the raw intval would indeed be equal to
SPA_FEATURE_SHA512.
However,
zfs create -o dedup=sha512,verify naughty/fivetwelve_verify_ds
would incorrectly succeed because ZIO_CHECKSUM_VERIFY would be in the
way, the raw intval would not be a member of the enum, and
zio_checksum_to_feature() would return SPA_FEATURE_NONE, with the result
that spa_feature_is_enabled() would never be called.
This was first detected with edonr, since in that case verify is
required.
This commit clears the ZIO_CHECKSUM_VERIFY flag before calling
zio_checksum_to_feature() using the ZIO_CHECKSUM_MASK and verifies in
zio_checksum_to_feature() that ZIO_CHECKSUM_MASK has been applied by the
caller to attempt to prevent the same bug from occurring again in the
future.
OpenZFS-issue: https://www.illumos.org/issues/6541
OpenZFS-commit: 971640e6aa
Porting notes:
This code was originally from Illumos, but I actually ported it from:
openzfsonosx/zfs@bef06e1
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee
Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:
b5e030c8db
The list of porting changes includes:
- Copied module/icp/include/sha2/sha2.h directly from illumos
- Removed from module/icp/algs/sha2/sha2.c:
#pragma inline(SHA256Init, SHA384Init, SHA512Init)
- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
it now takes in an extra parameter.
- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c
- Added skein & edonr to libicp/Makefile.am
- Added sha512.S. It was generated from sha512-x86_64.pl in Illumos.
- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.
- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
to not #include the non-existant endian.h.
- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
around a compiler warning.
- Fixup test files:
- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
- Remove <note.h> and define NOTE() as NOP.
- Define u_longlong_t
- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
- Rename NULL to 0 in "no test vector" array entries to get around a
compiler warning.
- Remove "for isa in $($ISAINFO); do" stuff
- Add/update Makefiles
- Add some userspace headers like stdio.h/stdlib.h in places of
sys/types.h.
- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.
- Update scripts/zfs2zol-patch.sed
- include <sys/sha2.h> in sha2_impl.h
- Add sha2.h to include/sys/Makefile.am
- Add skein and edonr dirs to icp Makefile
- Add new checksums to zpool_get.cfg
- Move checksum switch block from zfs_secpolicy_setprop() to
zfs_check_settable()
- Fix -Wuninitialized error in edonr_byteorder.h on PPC
- Fix stack frame size errors on ARM32
- Don't unroll loops in Skein on 32-bit to save stack space
- Add memory barriers in sha2.c on 32-bit to save stack space
- Add filetest_001_pos.ksh checksum sanity test
- Add option to write psudorandom data in file_write utility
This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.
As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.
This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.
It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.
The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes#4801
In the default case the function must return to avoid dereferencing
'prov_mech' which will be NULL.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: candychencan <chen.can2@zte.com.cn>
Closes#5134
coverity scan CID:147531,type: Argument cannot be negative
- may copy data with negative size
coverity scan CID:147532,type: resource leaks
- may close a fd which is negative
coverity scan CID:147533,type: resource leaks
- may call pwrite64 with a negative size
coverity scan CID:147535,type: resource leaks
- may call fdopen with a negative fd
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Closes#5176
Cppcheck 1.63 erroneously complains about an uninitialized value
in buf_init(). Newer versions of cppcheck (1.72) handle this
correctly but we'll initialize the value anyway to silence the
warning.
Reviewed-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5203
Avoid calculating (1<<64) if lh_prefix_len == 0. Semantics of the method remain
the same.
Assert (lh_prefix_len > 0) in zap_expand_leaf() to detect possibly the same
problem.
Issue #4883
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Explicitly promote variables to correct type. Undefined behavior is
reported because length of int is not well defined by C standard.
Issue #4883
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Undefined operation is reported by running ztest (or zloop) compiled with GCC
UndefinedBehaviorSanitizer. Error only happens on top level of dnode indirection
with large enough offset values. Logically, left shift operation would work,
but bit shift semantics in C, and limitation of uint64_t, do not produce desired
result.
Issue #5059, #4883
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Without plugging, the default 'noop' scheduler will not merge
the BIOs which are part of a large ZIO.
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#5181
Refactor the code in such a way so that inode->i_mode is being set
at the same time zp->z_mode is being changed. This has the effect of
keeping both in sync without relying on zfs_inode_update.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes#5158
In arc_state_fini() the `arc_l2c_only->arcs_list[*]` multilists
must be destroyed. This accidentally regressed in d3c2ae1c.
Reviewed by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #5151Closes#5152
We must not use d_add_ci if the dentry already has the real name. Otherwise,
d_add_ci()->d_alloc_parallel() will find itself on the lookup hash and wait
on itself causing deadlock.
Tested-by: satmandu
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#5124Closes#5141Closes#5147Closes#5148
coverity scan CID:147633,type: sizeof not portable
coverity scan CID:147637,type: sizeof not portable
coverity scan CID:147638,type: sizeof not portable
coverity scan CID:147640,type: sizeof not portable
In these particular cases sizeof (XX **) happens to be equal to sizeof (X *),
but this is not a portable assumption.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Closes#5144
dbuf_read_impl() returns (SET_ERROR(err)) when err can be 0, which adds
lots of noise in tracing logs.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Isaac Huang <he.huang@intel.com>
Closes#4430Closes#5146
The type of "adjustmnt" was erroneously changed to unsigned when the compressed
ARC code was ported in d3c2ae1c08.
As a result of it being unsigned, the balanced metadata eviction logic
would evict all of the non-metadata.
Reviewed-by: Chris Severance <github.severach@spamgourmet.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: David Quigley <david.quigley@intel.com>
Signed-off-by: Tim Chase <tim@onlight.com>
Closes#5128Closes#5129
Enable ignore_hole_birth by default until all known hole birth bugs
have been resolved and relevant test cases added.
Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4809Closes#5099
Simplify time handling in zfs_setattr by mimicking the logic in
setattr_copy from the linux kernel. In order to achieve this
in the case when ZFS' log is being replayed it is necessary
to unconditionally set the ctime in zfs_replay_setattr.
Also use the timespec_trunc function when assigning values to the
generic inode struct. This is currently a noop since zfs sets
s_time_gran to 1, however in the future rules about precision might
change.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes#4916
ZFS doesn't provide a custom update_time method meaning it delegates
this job to the generic VFS layer. The only time when it needs to
set the various *time values is when the inode is being marshalled
to/from the disk. Do this by moving the relevant code from
zfs_inode_update_impl to zfs_node_alloc and zfs_rezget. As a result
from this change it is no longer necessary to have multiple versions
of the zfs_inode_update function - so just nuke them and leave only
one.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Issue #227Closes#4916
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.
I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.
Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.
Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.
When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.
OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
Several assignments to arc_c had no effect because it is ultimately
initialized to arc_c_max.
This aligns ZoL better with the upstream code which removed these
assignments some time ago.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@onlight.com>
Closes#5081
In case sav->sav_config was NULL the body of the function
would skip the iteration of the l2 cache devices and will
just cleanup the old devices. However, this wasn't very obvious
since the null check was performed after the loop body and after
the old devices were cleaned. Refactor the code so that it's now
obvious when the iteration of the l2cache devices is skipped.
This fixes the following cppcheck warning:
[module/zfs/spa.c:1552]: (error) Possible null pointer dereference: newvdevs
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Closes#5087
Since they're allocated with spa_strdup(), they should be freed with
spa_strfree() so the proper length buffer is freed.
Reviewed-by: Richard Yao <ryao@gentoo.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#5082Closes#5086
This first phase brings over the ZFS SLM module, zfs_mod.c, to handle
auto operations in response to disk events. Disk event monitoring is
provided from libudev and generates the expected payload schema for
zfs_mod. This work leverages the recently added devid and phys_path
strings in the vdev label.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#4673
These allocations can never fail. Leaving the error handling
code here gives the impression they can so it has been removed.
Signed-off-by: luozhengzheng <luo.zhengzheng@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5048
perf: 2.75x faster ddt_entry_compare()
First 256bits of ddt_key_t is a block checksum, which are expected
to be close to random data. Hence, on average, comparison only needs to
look at first few bytes of the keys. To reduce number of conditional
jump instructions, the result is computed as: sign(memcmp(k1, k2)).
Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
which is computed efficiently. Synthetic performance evaluation of
original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
CPU E5-2660 v3:
old 6.85789 s
new 2.49089 s
perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
Compute the result directly instead of using conditionals
perf: zfs_range_compare()
Speedup between 1.1x - 2.5x, depending on compiler version and
optimization level.
perf: spa_error_entry_compare()
`bcmp()` is not suitable for comparator use. Use `memcmp()` instead.
perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
perf: 2.8x faster zil_bp_compare()
perf: 2.8x faster mze_compare()
perf: faster dbuf_compare()
perf: faster compares in spa_misc
perf: 2.8x faster layout_hash_compare()
perf: 2.8x faster space_reftree_compare()
perf: libzfs: faster avl tree comparators
perf: guid_compare()
perf: dsl_deadlist_compare()
perf: perm_set_compare()
perf: 2x faster range_tree_seg_compare()
perf: faster unique_compare()
perf: faster vdev_cache _compare()
perf: faster vdev_uberblock_compare()
perf: faster fuid _compare()
perf: faster zfs_znode_hold_compare()
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Richard Elling <richard.elling@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5033
`zpool get guid,freeing,leaked` shows SOURCE as `default`, it should
be `-` as those props are not editable.
Changed code to not overwrite `src` for `ZPOOL_PROP_VERSION`, so it
stays `ZPROP_SRC_NONE`. Make src const to avoid future mistakes
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4170
zfsctl_snapdir_inactive is defined in zfs-0.6.3. In zfs-0.6.5.7
this is declaration remains even though the implementation was
removed in commit 278bee93. Removed fastreboot_disable_highpil
which is also unused.
Signed-off-by: caoxuewen cao.xuewen@zte.com.cn
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5042
From user perspective, I would expect that ZFS is always able
to remove files and directories even when the quota is exceeded.
Authored by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6940
OpenZFS-issue: https://www.illumos.org/issues/6334
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/9918916Closes#5044
For quite some time I was thinking about possibility to prefetch
ZFS indirection tables while doing sequential reads or writes.
Recent changes in predictive prefetcher made that much easier to
do. My tests on zvol with 16KB block size on 5x striped and 2x
mirrored pool of 10 disks show almost double throughput on sequential
read, and almost tripple on sequential rewrite. While for read alike
effect can be received from increasing maximal prefetch distance
(though at higher memory cost), for rewrite there is no other
solution so far.
Authored by: Alexander Motin <mav@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6322
OpenZFS-commit: https://github.com/illumos/illumos-gate/commit/cb92f413Closes#5040
Porting notes:
- Change from upstream in module/zfs/dbuf.c in 'int dbuf_read' due
to commit 5f6d0b6 'Handle block pointers with a corrupt logical size'
- Difference from upstream in module/zfs/dmu_zfetch.c,
uint32_t zfetch_max_idistance -> unsigned int zfetch_max_idistance
- Variables have been initialized at the beginning of the function
(void dmu_zfetch) to resemble the order of occurrence and account
for C99, C11 mode errors.
In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at
the db_blkptr, to prevent it from being changed by syncing context.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7086
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/98fa317Closes#5039
This fix resolves warnings reported during compiling with different gcc
optimization levels in debug mode,
Test tools:
gcc version 4.4.7 20120313 (Red Hat 4.4.7-16) (GCC)
Linux version: 2.6.32-573.18.1.el6.x86_64, Red Hat Enterprise Linux Server release 6.1 (Santiago)
List of warnings:
CFLAGS=-O1 ./configure --enable-debug ;make
../../module/icp/core/kcf_sched.c: In function ‘kcf_aop_done’:
../../module/icp/core/kcf_sched.c:499: error: ‘fg’ may be used uninitialized in this function
../../module/icp/core/kcf_sched.c:499: note: ‘fg’ was declared here
CFLAGS=-Os ./configure --enable-debug ; make
libzfs_dataset.c: In function ‘zfs_prop_set_list’:
libzfs_dataset.c:1575: error: ‘nvl_len’ may be used uninitialized in this function
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5022
ARC will evict meta buffers that exceed the arc_meta_limit. Before a further
investigating on whether we should take special protection on meta buffers,
this tunable make arc_meta_limit adjustable for different workloads.
People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko,
so some range check is added to guarantee a suitable arc_meta_limit.
Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style
tunable as well.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4957
As is the case with traverse_prefetch_thread(), the deep stacks caused
by traversal require disabling reclaim in the send traverse thread.
Also, do the same for receive_writer_thread() in which similar problems
have been observed.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4912Closes#4998
API Change: Module parameter set/get methods take const parameter in
Grsecurity kernel v4.7.1
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4997Closes#5001
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).
dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:
dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()
This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.
Sponsored by: Intel Corp.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7004
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/109Closes#4641Closes#4972
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.
Sponsored by: Intel Corp.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Pavel Zakharov <pavel.zakha@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7003
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/108Closes#4972
When spa retry load succeeds and spa recovery is requested it may
leak in spa_load_best function. Always free the generated config
when it is not assigned to the spa.
Signed-off-by: cao.xuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4940
This is another bug in the long line of hole-birth related issues. In
this particular case, it was discovered that a previous hole-birth fix
(illumos bug 6513, commit bc77ba73) did not cover as many cases as we
thought it did. While the issue worked in the case of hole-punching
(writing zeroes to a large part of a file), it did not deal with
truncation, and then writing beyond the new end of the file.
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
Authored by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/7176
OpenZFS-commit: https://github.com/openzfs/openzfs/pull/173/commits/8b9f3ad
Upstream-bugs: DLPX-46009
Porting notes:
- Fix ISO C90 mixed declaration error in dbuf.c ( int nlevels, epbs; ) ;
keep previous position of the initialization
Under a workload which makes heavy use of `dbuf_hold()`, I noticed that a
considerable amount of time was spent in `dbuf_hold_impl()`, due to its call to
`kmem_zalloc(sizeof (struct dbuf_hold_impl_data) * DBUF_HOLD_IMPL_MAX_DEPTH)`,
which is around 2KiB. This structure is used as a stack, to limit the size of
the C stack as dbuf_hold() calls itself recursively. We make a recursive call
to hold the parent's dbuf when the requested dbuf is not found. The vast
majority of the time, the parent or grandparent indirect dbuf is cached, so the
number of recursive calls is very low. However, we initialize this entire
array for every call to dbuf_hold().
To improve performance, this commit changes `dbuf_hold()` to use `kmem_alloc()`
instead of `kmem_zalloc()`. __dbuf_hold_impl_init is changed to initialize all
members of the struct before they are used. I observed ~5% performance
improvement on a workload which creates many files.
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4974
- Benchmark memory block is increased to 128kiB to reflect real block sizes more
accurately. Measurements include all three stages needed for checksum generation,
i.e. `init()/compute()/fini()`. The inner loop is repeated multiple times to offset
overhead of time function.
- Fastest implementation selects native and byteswap methods independently in
benchmark. To support this new function pointers `init_byteswap()/fini_byteswap()`
are introduced.
- Implementation mutex lock is replaced by atomic variable.
- To save time, benchmark is not executed in userspace. Instead, highest supported
implementation is used for fastest. Default userspace selector is still 'cycle'.
- `fletcher_4_native/byteswap()` methods use incremental methods to finish
calculation if data size is not multiple of vector stride (currently 64B).
- Added `fletcher_4_native_varsize()` special purpose method for use when buffer size
is not known in advance. The method does not enforce 4B alignment on buffer size, and
will ignore last (size % 4) bytes of the data buffer.
- Benchmark `kstat` is changed to match the one of vdev_raidz. It now shows
throughput for all supported implementations (in B/s), native and byteswap,
as well as the code [fastest] is running.
Example of `fletcher_4_bench` running on `Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz`:
implementation native byteswap
scalar 4768120823 3426105750
sse2 7947841777 4318964249
ssse3 7951922722 6112191941
avx2 13269714358 11043200912
fastest avx2 avx2
Example of `fletcher_4_bench` running on `Intel(R) Xeon Phi(TM) CPU 7210 @ 1.30GHz`:
implementation native byteswap
scalar 1291115967 1031555336
sse2 2539571138 1280970926
ssse3 2537778746 1080016762
avx2 4950749767 1078493449
avx512f 9581379998 4010029046
fastest avx512f avx512f
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4952
Adds a module option which disables the hole_birth optimization
which has been responsible for several recent bugs, including
issue #4050.
Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4833
Import a raidz pool which has a vdev with a bad label, zpool status
shows the right state of the dev, but the wrong state of the pool.
The pool state should be DEGRADED, not ONLINE.
We examine the label in vdev_validate while in spa_load_impl, the bad
label can be detected but doesn't propagate its state to the parent.
There are other chances to propagate state in the following vdev_load
if we failed to load DTL, but our pool is raidz1 which can tolerate a
faulted disk. So we lost the last chance to correct the pool state.
Propagate the leaf vdev's state to parent if its label was corrupted,
as is done elsewhere in vdev_validate.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Don Brady <don.brady@intel.com>
Closes#4948
Authored by: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Josef Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Signed-off-by: Don Brady <don.brady@intel.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/5997
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/1437283
Porting Notes:
In addition to the OpenZFS changes this patch realigns the events
with those found in OpenZFS.
Events which would be logged as sysevents on illumos have been
been mapped to the 'sysevent' class for Linux. In addition, several
subclass names have been changed to match what is used in OpenZFS.
In all cases this means a '.' was changed to an '_' in the subclass.
The scripts provided by ZoL have been updated, however users which
provide scripts for any of the following events will need to rename
them based on the new subclass names.
ereport.fs.zfs.config.sync sysevent.fs.zfs.config_sync
ereport.fs.zfs.zpool.destroy sysevent.fs.zfs.pool_destroy
ereport.fs.zfs.zpool.reguid sysevent.fs.zfs.pool_reguid
ereport.fs.zfs.vdev.remove sysevent.fs.zfs.vdev_remove
ereport.fs.zfs.vdev.clear sysevent.fs.zfs.vdev_clear
ereport.fs.zfs.vdev.check sysevent.fs.zfs.vdev_check
ereport.fs.zfs.vdev.spare sysevent.fs.zfs.vdev_spare
ereport.fs.zfs.vdev.autoexpand sysevent.fs.zfs.vdev_autoexpand
ereport.fs.zfs.resilver.start sysevent.fs.zfs.resilver_start
ereport.fs.zfs.resilver.finish sysevent.fs.zfs.resilver_finish
ereport.fs.zfs.scrub.start sysevent.fs.zfs.scrub_start
ereport.fs.zfs.scrub.finish sysevent.fs.zfs.scrub_finish
ereport.fs.zfs.bootfs.vdev.attach sysevent.fs.zfs.bootfs_vdev_attach
If there is no explicit note in the .S files, the obj file will mark it
as requiring an executable stack. This is unneeded and causes issues on
hardened systems.
More info:
https://wiki.gentoo.org/wiki/Hardened/GNU_stack_quickstart
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4947Closes#4962
The constify plugin will automatically constify a class of types that contain
only function pointers. The icp structs fail to build if this is enabled with
the following error. The no_const attribute makes the plugin skip those
structs.
module/icp/spi/kcf_spi.c: In function ‘copy_ops_vector_v1’:
module/icp/spi/kcf_spi.c:61:16: error: assignment of read-only location ‘*dst_ops->cou.cou_v1.co_control_ops’
*((dst)->ops) = *((src)->ops);
^
module/icp/spi/kcf_spi.c:74:2: note: in expansion of macro ‘KCF_SPI_COPY_OPS’
KCF_SPI_COPY_OPS(src_ops, dst_ops, co_control_ops);
^
Signed-off-by: Jason Zaman <jason@perfinion.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4947Closes#4962
nvlist_pack() and nvlist_unpack are implemented recursively, which can
cause the stack to overflow with a deeply nested nvlist; i.e. an nvlist
which contains an nvlist, which contains an nvlist, which...
Unprivileged users can pass an nvlist to the kernel via certain ioctls
on /dev/zfs, which the kernel will unpack without additional permission
checking or validation. Therefore, an unprivileged user can cause the
kernel's stack to overflow and panic.
Ideally, these functions would be implemented non-recursively. As a
quick fix, this patch limits the depth of the recursion and returns an
error when attempting to pack and unpack a deeply-nested nvlist.
Signed-off-by: Adam Leventhal <ahl@delphix.com>
Signed-off-by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Prakash Surya <prakash.surya@delphix.com>
OpenZFS-issue: https://www.illumos.org/issues/7263
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/0511d6d
-
Fix bugs due to kernel change in torvalds/linux@4bacc9c923 ("overlayfs:
Make f_path always point to the overlay and f_inode to the underlay").
This problem crashes system when use zfs as a layer of overlayfs.
Signed-off-by: Chen Haiquan <oc@yunify.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4914Closes#4935
The indefinite article before nvlist should be "an", not "a".
We have 27 "an nvlist" and 7 "a nvlist" in our comment, they should
stay the same as we are such a strict filesystem.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4941
Non-Linux OpenZFS implementations require additional support to be
used a root pool. This code should simply be removed to avoid
confusion and improve readability.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4951
All users of bio->bi_rw have been replaced with compatibility wrappers.
This allows the kernel specific logic to be abstracted away, and for
each of the supported cases to be documented with the wrapper. The
updated interfaces are as follows:
* void blk_queue_set_write_cache(struct request_queue *, bool, bool)
* boolean_t bio_is_flush(struct bio *)
* boolean_t bio_is_fua(struct bio *)
* boolean_t bio_is_discard(struct bio *)
* boolean_t bio_is_secure_erase(struct bio *)
* VDEV_WRITE_FLUSH_FUA
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4951
This fix resolves warnings reported during compiling of user-space
libraries with different gcc optimization levels.
Tested with gcc versions: 4.9.2 (Debian), and 6.1.1 (Fedora).
The patch enables use of following opt levels: O0, O1, O2, O3, Og, Os, Ofast.
List of warnings:
[GCC 4.9.2 -Os]
libzfs_sendrecv.c:3726:26: error: 'clp' may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 4.9.2 -Og]
fs_fletcher.c:323:26: error: 'idx' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1290:12: error: 'atp' may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 4.9.2 -Ofast]
u8_textprep.c:1310:9: error: 'tc[3ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
u8_textprep.c:177:23: error: 'u8t[0ul]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:2089:37: error: ‘hds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3216:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:1591:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
dsl_dataset.c:3341:2: error: ‘ds’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1153:8: error: 'dcount[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
vdev_raidz.c:1167:17: error: 'dst[2]' may be used uninitialized in this function [-Werror=maybe-uninitialized]
kernel.c:1005:2: error: ‘resid’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:2826:8: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1584:13: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3056:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:1792:66: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
libzfs_dataset.c:3986:35: error: ‘val’ may be used uninitialized in this function [-Werror=maybe-uninitialized]
[GCC 6.1.1]
Resolved in PR #4907
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4937
Starting from Linux 4.7, get_acl will set acl cache pointer to temporary
sentinel value before calling i_op->get_acl. Therefore we can't compare
against ACL_NOT_CACHED and return.
Since from Linux 3.14, get_acl already check the cache for us, so we
disable this in zpl_get_acl.
Linux 4.7 also does set_cached_acl for us so we disable it in zpl_get_acl.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4944Closes#4946
The posix_acl_valid() function has been updated to require a
user namespace. Filesystem callers should normally provide the
user_ns from the super block associcated with the ACL; the
zpl_posix_acl_valid() wrapper has been added for this purpose.
See https://github.com/torvalds/linux/commit/0d4d717f for
complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4922
Remove ZFS_AC_KERNEL_CURRENT_UMASK and ZFS_AC_KERNEL_POSIX_ACL_CACHING
configure checks, all supported kernel provide this functionality.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4922
* When the uid/gid change is handled in zfs_setattr we want to
actually adjust the user passed uid to a KUID and write that to disk.
* In trace points use the i_uid member without doing translation,
since it has already been performed.
* Use kuid in zfs_aclset_common
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4928
When arc_max is increased, arc_meta_limit will not be updated to 3/4
of the new arc_c_max value. This was done originally to preserve any
existing maximum value. This turned out to be counter intuitive to
users and this fix changes that behavior. If zfs_arc_meta_limit is
non-default, it will be picked up later in the ARC tuning function.
Signed-off-by: Gaurav Kumar <gaurav.kumar@nutanix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4893
As of gcc 6.1.1 20160621 (Red Hat 6.1.1-3) a self-comparison is
detected by gcc in metaslab_alloc(). Resolve the warning by passing
a physical size of 0 to BP_SET_BIRTH() as it done by other callers.
module/zfs/metaslab.c: In function ‘metaslab_alloc’:
module/zfs/metaslab.c:2575:184: error: self-comparison always evaluates
to true [-Werror=tautological-compare]
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4907
Fix a possible VDEV statistics array overflow when ZIOs with
ZIO_PRIORITY_NOW complete.
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4883Closes#4917
Currently i_blkbits is always set to SPA_MINBLOCKSHIFT every time
zfs_inode_update_impl is called. Since this value never changes
move its assignment to at inode creation time.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4906
The newly added icp module uses a hardcoded value of CDDL for the license,
however in local development one might want to change that to something
else in order to facilitate compiling against lock debugging enabled kernel.
All modules of the zfs use the ZFS_META_LICNSE string which is replaced with
the value held in the META file. One can modify the value in the META file
once and then rerun the configure to have all modules' licenses changed.
Change the icp module license string to be ZFS_META_LICENSE so that it
falls under the same paradigm.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4905
In zfs_ioc_log_history() function the tsd_set() function is called
with NULL which causes the zfs_allow_log_destroy() to be run. In
this case the passed value will be NULL. This is normally entirely
safe because strfree() maps directly to kfree() which may be passed
a NULL. However, since alternate implementations of strfree() may
not handle this gracefully add a check for NULL.
Observed under an embedded Linux 2.6.32.41 kernel running the
automated testing while running the ZFS Test Suite.
Signed-off-by: caoxuewen <cao.xuewen@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4872
New REQ_OP_* definitions have been introduced to separate the
WRITE, READ, and DISCARD operations from the flags. This included
changing the encoding of bi_rw. It places REQ_OP_* in high order
bits and other stuff in low order bits. This encoding is done
through the new helper function bio_set_op_attrs. For complete
details refer to:
https://github.com/torvalds/linux/commit/f215082https://github.com/torvalds/linux/commit/4e1b2d5
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4892Closes#4899
The rw argument has been removed from submit_bio/submit_bio_wait.
Callers are now expected to set bio->bi_rw instead of passing it
in. See https://github.com/torvalds/linux/commit/4e49ea4a for
complete details.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4892
Issue #4899
The memory allocation and locking in `spa_txg_history_*()` can
potentially block txg_hold_open for arbitrarily long periods of time.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4333
Here's the problem - on 4K native devices in userland on
Linux using O_DIRECT, buffers must be 4K aligned or I/O
will fail with EINVAL, causing zdb (and others) to coredump.
Since userland probably doesn't need optimized buffer caches,
we just force 4K alignment on everything.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Closes#4479
DMU_MAX_ACCESS should be cast to a uint64_t otherwise the
multiplication of DMU_MAX_ACCESS with spa_asize_inflation will
be 32 bit and may lead to an overflow. Currently DMU_MAX_ACCESS
is 64 * 1024 * 1024, so spa_asize_inflation being 64 or more will
lead to an overflow.
Found by static analysis with CoverityScan 0.8.5
CID 150942 (#1 of 1): Unintentional integer overflow
(OVERFLOW_BEFORE_WIDEN)
overflow_before_widen: Potentially overflowing expression
67108864 * spa_asize_inflation with type int (32 bits, signed)
is evaluated using 32-bit arithmetic, and then used in a context
that expects an expression of type uint64_t (64 bits, unsigned).
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4889
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773Closes#4858
Prior to b39c22b, which was first generally available in the 0.6.5
release as b39c22b, ZoL never actually submitted synchronous read or write
requests to the Linux block layer. This means the vdev_disk_dio_is_sync()
function had always returned false and, therefore, the completion in
dio_request_t.dr_comp was never actually used.
In b39c22b, synchronous ZIO operations were translated to synchronous
BIO requests in vdev_disk_io_start(). The follow-on commits 5592404 and
aa159af fixed several problems introduced by b39c22b. In particular,
5592404 introduced the new flag parameter "wait" to __vdev_disk_physio()
but under ZoL, since vdev_disk_physio() is never actually used, the wait
flag was always zero so the new code had no effect other than to cause
a bug in the use of the dio_request_t.dr_comp which was fixed by aa159af.
The original rationale for introducing synchronous operations in b39c22b
was to hurry certains requests through the BIO layer which would have
otherwise been subject to its unplug timer which would increase the
latency. This behavior of the unplug timer, however, went away during the
transition of the plug/unplug system between kernels 2.6.32 and 2.6.39.
To handle the unplug timer behavior on 2.6.32-2.6.35 kernels the
BIO_RW_UNPLUG flag is used as a hint to suppress the plugging behavior.
For kernels 2.6.36-2.6.38, the REQ_UNPLUG macro will be available and
ise used for the same purpose.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4858
Silence the following warning when compiling with gcc 5.4.0.
Specifically gcc (Ubuntu 5.4.0-6ubuntu1~16.04.1) 5.4.0 20160609.
module/avl/avl.c: In function ‘avl_add’:
module/avl/avl.c:647:2: warning: ‘where’ may be used uninitialized
in this function [-Wmaybe-uninitialized]
avl_insert(tree, new_node, where);
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Remove duplicate z_uid/z_gid member which are also held in the
generic vfs inode struct. This is done by first removing the members
from struct znode and then using the KUID_TO_SUID/KGID_TO_SGID
macros to access the respective member from struct inode. In cases
where the uid/gids are being marshalled from/to disk, use the newly
introduced zfs_(uid|gid)_(read|write) functions to properly
save the uids rather than the internal kernel representation.
Signed-off-by: Nikolay Borisov <n.borisov.lkml@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4685
Issue #227
Currently there is an issue where metaslab_fastwrite_unmark() unmarks
fastwrites on vdev_t's that have never had fastwrites marked on them.
The 'fastwrite mark' is essentially a count of outstanding bytes that
will be written to a vdev and is used in syncing context. The problem
stems from the fact that the vdev_pending_fastwrite field is not being
transferred over when replacing a top-level vdev. As a result, the
metaslab is marked for fastwrite on the old vdev and unmarked on the
new one, which brings the fastwrite count below zero. This fix simply
assigns vdev_pending_fastwrite from the old vdev to the new one so
this count is not lost.
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4267
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
A port of the Illumos Crypto Framework to a Linux kernel module (found
in module/icp). This is needed to do the actual encryption work. We cannot
use the Linux kernel's built in crypto api because it is only exported to
GPL-licensed modules. Having the ICP also means the crypto code can run on
any of the other kernels under OpenZFS. I ended up porting over most of the
internals of the framework, which means that porting over other API calls (if
we need them) should be fairly easy. Specifically, I have ported over the API
functions related to encryption, digests, macs, and crypto templates. The ICP
is able to use assembly-accelerated encryption on amd64 machines and AES-NI
instructions on Intel chips that support it. There are place-holder
directories for similar assembly optimizations for other architectures
(although they have not been written).
Signed-off-by: Tom Caputi <tcaputi@datto.com>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4329
When zfs_domount fails zsb will be freed, and its caller
mount_nodev/get_sb_nodev will do deactivate_locked_super and calls into
zfs_preumount.
In order to make sure we don't touch any nonexistent stuff, we must make sure
s_fs_info is NULL in the fail path so zfs_preumount can easily check that.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4867
Issue #4854
Print table with speed of methods for each implementation.
Last line describes contents of [fastest] selection.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4860
- Implementation lock replaced with atomic variable
- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`
- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813
- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392
- Minor fixes and cleanups
- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
Wait for iput_async before entering evict_inodes in
generic_shutdown_super. The reason we must finish before
evict_inodes is when lazytime is on, or when zfs_purgedir calls
zfs_zget, iput would bump i_count from 0 to 1. This would race
with the i_count check in evict_inodes. This means it could
destroy the inode while we are still using it.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4854
In some cases, the compiler was not respecting the GNU aligned
attribute for stack variables in 35a76a0. This was resulting in
a segfault on CentOS 6.7 hosts using gcc 4.4.7-17. This issue
was fixed in gcc 4.6.
To prevent this from occurring, use unaligned loads and stores
for all stack and global memory references in the SSE optimized
Fletcher-4 code.
Disable zimport testing against master where this flaw exists:
TEST_ZIMPORT_VERSIONS="installed"
Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4862
It is possible that the given DS may have hidden child (%recv)
datasets - "leftovers" resulting from the previously interrupted
'zfs receieve'. Try to remove the hidden child (%recv) and after
that try to remove the target dataset. If the hidden child
(%recv) does not exist the original error (EEXIST) will be returned.
Signed-off-by: Roman Strashkin <roman.strashkin@nexenta.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4818
Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.
The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.
Adds a pair of fields to an existing zcommon module parameter:
- zfs_fletcher_4_impl (str)
"sse2" - new SSE2 implementation if available
"ssse3" - new SSSE3 implementation if available
Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4789
A mostly mechanical change, taking into account i_nlink is 32 bits vs ZFS's
64 bit on-disk link count.
We revert "xattr dir doesn't get purged during iput" (ddae16a) as this is a
more Linux-integrated fix for the same issue.
In addition, setting the initial link count on a new node has been changed
from setting one less than required in zfs_mknode() then incrementing to the
correct count in zfs_link_create() (which was somewhat bizarre in the first
place), to setting the correct count in zfs_mknode() and not incrementing it
in zfs_link_create(). This both means we no longer set the link count in
sa_bulk_update() twice (once for the initial incorrect count then again for
the correct count), as well as adhering to the Linux requirement of not
incrementing a zero link count without I_LINKABLE (see linux commit
f4e0c30c).
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4838
Issue #227
Dropping DBUF_HASH_MUTEX when walking the hash list is unsafe. The dbuf
can be freed at any time.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4846
In arc_buf_info(), the arc_buf_t may have no header. If not, don't try
to fetch the arc buffer stats and instead just zero them.
The null dereferences were observed while accessing the dbuf kstat with
awk on a system in which millions of small files were being created in
order to overflow the system's metadata limit.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Closes#4837
zfs_ioc_recv_impl() is changed to always allocate the 'errors'
nvlist, its callers are responsible for freeing it.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4829
The following scenario can result in garbage in the dn_spill field.
The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
is clear to ensure the dn_spill field is cleared.
Current txg = A.
* A new spill buffer is created. Its dbuf is initialized with
db_blkptr = NULL and it's dirtied.
Current txg = B.
* The spill buffer is modified. It's marked as dirty in this txg.
* Additional changes make the spill buffer unnecessary because the
xattr fits into the bonus buffer, so it's removed. The dbuf is
undirtied in this txg, but it's still referenced and cannot be
destroyed.
Current txg = C.
* Starts syncing of txg A
* dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
is NULL, dbuf_check_blkptr() is called.
* The dbuf starts being written and it reaches the ready state
(not done yet).
* A new change makes the spill buffer necessary again.
sa_build_layouts() ends up calling dbuf_find() to locate the
dbuf. It finds the old dbuf because it has not been destroyed yet
(it will be destroyed when the previous write is done and there
are no more references). The old dbuf has db_blkptr != NULL.
* txg A write is complete and the dbuf released. However it's still
referenced, so it's not destroyed.
Current txg = D.
* Starts syncing of txg B
* dbuf_sync_leaf() is called for the bonus buffer. Its contents are
directly copied into the dnode, overwriting the blkptr area because,
in txg B, the bonus buffer was big enough to hold the entire xattr.
* At this point, the db_blkptr of the spill buffer used in txg C
gets corrupted.
Signed-off-by: Peng <peng.hse@xtaotech.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3937
zp->z_xattr_parent will pin the parent. This will cause huge issue
when unlink a file with xattr. Because the unlinked file is pinned, it
will never get purged immediately. And because of that, the xattr
stuff will never be marked as unlinked. So the whole unlinked stuff
will stay there until shrink cache or umount.
This change partially reverts e89260a. This is safe because only the
zp->z_xattr_parent optimization is removed, zpl_xattr_security_init()
is still called from the zpl outside the inode lock.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
We need to set inode->i_nlink to zero so iput will purge it. Without this, it
will get purged during shrink cache or umount, which would likely result in
deadlock due to zfs_zget waiting forever on its children which are in the
dispose_list of the same thread.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Issue #4359
Issue #3508
Issue #4413
Issue #4827
When generation mismatch, it usually means the file pointed by the file handle
was deleted. We should return ESTALE to indicate this. We return ENOENT in
zfs_vget since zpl_fh_to_dentry will convert it to ESTALE.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
Allow accessing XATTR through export handle is a very bad idea. It
would allow user to write whatever they want in fields where they
otherwise could not.
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4828
Certain ioctl operations will call get_zfs_sb, which will holds an active
count on sb without checking whether it's active or not. This will result
in use-after-free. We fix this by using atomic_inc_not_zero to make sure
we got an active sb.
P1 P2
--- ---
deactivate_locked_super(): s_active = 0
zfs_sb_hold()
->get_zfs_sb(): s_active = 1
->zpl_kill_sb()
-->zpl_put_super()
--->zfs_umount()
---->zfs_sb_free(zsb)
zfs_sb_rele(zsb)
Signed-off-by: Chunwei Chen <david.chen@osnexus.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
If compiled with -O0, gcc doesn't do any stack frame coalescing
and -Wframe-larger-than=1024 is triggered in debug mode.
Starting with gcc 4.8, new opt level -Og is introduced for debugging, which
does not trigger this warning.
Fix bench zio size, using SPA_OLD_MAXBLOCKSHIFT
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4799
The fletcher_4_native() and fletcher_4_byteswap() functions may only
safely use the vectorized implementations when the buffer is 128-bit
aligned. This is because both the AVX2 and SSE implementations process
four 32-bit words per iterations. Fallback to the scalar implementation
which only processes a single 32-bit word for unaligned buffers.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Issue #4330
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking
for trouble. We should check every dataset on import, using a 1024 byte
buffer and checking each time to see if the dataset's new name is longer
than 256 bytes.
OpenZFS-issue: https://www.illumos.org/issues/6876
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ca8674e
Adds ZFS_IOC_RECV_NEW for resumable streams and preserves the legacy
ZFS_IOC_RECV user/kernel interface. The new interface supports all
stream options but is currently only used for resumable streams.
This way updated user space utilities will interoperate with older
kernel modules.
ZFS_IOC_RECV_NEW is modeled after the existing ZFS_IOC_SEND_NEW
handler. Non-Linux OpenZFS platforms have opted to change the
legacy interface in an incompatible fashion instead of adding a
new ioctl.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6562
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5f7a8e6
Authored by: Dan McDonald <danmcd@omniti.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/4986
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/5878fad
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f
Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3542
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates. As summarized by
@mahrens in #4636:
"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects). When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.
The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."
Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>