This change introduces a new weighting algorithm to improve
metaslab selection. The new weighting algorithm relies on the
SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight
now encodes the type of weighting algorithm used (size-based
vs segment-based).
Porting Notes: The metaslab allocation tracing code is conditionally
removed on linux (dependent on mdb debugger).
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported-by: Don Brady <don.brady@intel.com>
OpenZFS-issue: https://www.illumos.org/issues/7303
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/d5190931bdCloses#5404
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5547Closes#5543
Enable zio_dva_throttle_enabled=1 by default. Subsequent
testing has been unable to reproduce the suspected regression.
Tested-by: kernelOfTruth kerneloftruth@gmail.com
Reviewed-by: Olaf Faaland <faaland1@llnl.gov>
Signed-off-by: Brian Behlendorf behlendorf1@llnl.gov
Reverts #5335Closes#5289Closes#5457
This patch adds a command (-c) option to zpool status and zpool iostat. The
-c option allows you to run an arbitrary command on each vdev and display
the first line of output in zpool status/iostat. The environment vars
VDEV_PATH and VDEV_UPATH are set to the vdev's path and "underlying path"
before running the command. For device mapper, multipath, or partitioned
vdevs, VDEV_UPATH is the actual underlying /dev/sd* disk. This can be useful
if the command you're running requires a /dev/sd* device.
The patch also uses /sys/block/<dev>/slaves/ to lookup the underlying device
instead of using libdevmapper. This not only removes the libdevmapper
requirement at build time, but also allows you to resolve device mapper
devices without being root. This means that UDEV_UPATH get set correctly
when running zpool status/iostat as an unprivileged user.
Example:
$ zpool status -c 'echo I am $VDEV_PATH, $VDEV_UPATH'
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
mpatha ONLINE 0 0 0 I am /dev/mapper/mpatha, /dev/sdc
sdb ONLINE 0 0 0 I am /dev/sdb1, /dev/sdb
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#5368
Allow `zfs unshare <protocol> -a` command to share or unshare all datasets
of a given protocol, nfs or smb.
Additionally, enable most of ZFS Test Suite zfs_share/zfs_unshare test cases.
To work around some Illumos-specific functionalities ($SHARE/$UNSHARE) some
function wrappers were added around them.
Finally, fix and issue in smb_is_share_active() that would leave SMB shares
exported when invoking 'zfs unshare -a'
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3238Closes#5367
Bold and Normal codes were mixed up in a few places resulting in
bad highlighting.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: DHE <git@dehacked.net>
Closes#5397
Repair indent of zpool.8 man page, just before zpool labelclear
details. Accidentally introduced by 193a37cb2 (git bisect).
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Haakan T Johansson <f96hajo@chalmers.se>
Closes#5394
avx512f should work on all AVX512 hardware, since it only uses
Foundation instructions.
avx512bw should be faster on hardware supporting the AVW512BW
extension. We can use full-width pshufb (instead of relying on the 256
bits AVX2 pshufb). As a side-effect, the code is also unrolled more.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.github@dolbeau.name>
Closes#5219
Until it can be determined definitively that a performance
regression wasn't introduced accidentally by 3dfb57a this
functionality is being disabled by default. It can be re-
enabled by setting zio_dva_throttle_enabled=1.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5335
Issue #5289
Sometimes it is desirable to specifically disable one or several
features directly on the 'zpool create' command line.
$ zpool create -o feature@<feature>=disabled ...
Original-patch-by: Turbo Fredriksson <turbo@bayour.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Closes#3460Closes#5142Closes#5324
This is not useful on micro-architecture with a weak NEON
implementation (only 64 bits); the native version is slower &
the byteswap barely faster than scalar. On A53 or A57, it's
a small improvement on scalar but OK for byteswap.
Results from an A53 system:
0 0 0x01 -1 0 1499068294333000 1499101101878000
implementation native byteswap
scalar 1008227510 755880264
aarch64_neon 1198098720 1044818671
fastest aarch64_neon aarch64_neon
Results from a A57 system:
0 0 0x01 -1 0 4407214734807033 4407233933777404
implementation native byteswap
scalar 2302071241 1124873346
aarch64_neon 2542214946 2245570352
fastest aarch64_neon aarch64_neon
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes#5248
1. Enable multipath autoreplace support for FMA.
This extends FMA autoreplace to work with multipath disks. This
requires libdevmapper to be installed at build time.
2. Turn on/off fault LEDs when VDEVs become degraded/faulted/online
Set ZED_USE_ENCLOSURE_LEDS=1 in zed.rc to have ZED turn on/off the enclosure
LED for a drive when a drive becomes FAULTED/DEGRADED. Your enclosure must
be supported by the Linux SES driver for this to work. The enclosure LED
scripts work for multipath devices as well. The scripts will clear the LED
when the fault is cleared.
3. Rate limit ZIO delay and checksum events so as not to flood ZED
ZIO delay and checksum events are rate limited to 5/sec in the zfs module.
Reviewed-by: Richard Laager <rlaager@wiktel.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#2449Closes#3017Closes#5159
OpenZFS 7090 - zfs should throttle allocations
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Don Brady <don.brady@intel.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
When write I/Os are issued, they are issued in block order but the ZIO
pipeline will drive them asynchronously through the allocation stage
which can result in blocks being allocated out-of-order. It would be
nice to preserve as much of the logical order as possible.
In addition, the allocations are equally scattered across all top-level
VDEVs but not all top-level VDEVs are created equally. The pipeline
should be able to detect devices that are more capable of handling
allocations and should allocate more blocks to those devices. This
allows for dynamic allocation distribution when devices are imbalanced
as fuller devices will tend to be slower than empty devices.
The change includes a new pool-wide allocation queue which would
throttle and order allocations in the ZIO pipeline. The queue would be
ordered by issued time and offset and would provide an initial amount of
allocation of work to each top-level vdev. The allocation logic utilizes
a reservation system to reserve allocations that will be performed by
the allocator. Once an allocation is successfully completed it's
scheduled on a given top-level vdev. Each top-level vdev maintains a
maximum number of allocations that it can handle (mg_alloc_queue_depth).
The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth)
are distributed across the top-level vdevs metaslab groups and round
robin across all eligible metaslab groups to distribute the work. As
top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.
OpenZFS-issue: https://www.illumos.org/issues/7090
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/4756c3d7Closes#5258
Porting Notes:
- Maintained minimal stack in zio_done
- Preserve linux-specific io sizes in zio_write_compress
- Added module params and documentation
- Updated to use optimize AVL cmp macros
The following new test cases need to have execute permissions set:
userquota/groupspace_003_pos.ksh
userquota/userquota_013_pos.ksh
userquota/userspace_003_pos.ksh
upgrade/upgrade_userobj_001_pos.ksh
upgrade/setup.ksh
upgrade/cleanup.ksh
The following source files accidentally were marked executable:
lib/libzpool/kernel.c
lib/libshare/nfs.c
lib/libzfs/libzfs_dataset.c
lib/libzfs/libzfs_util.c
tests/zfs-tests/cmd/rm_lnkcnt_zero_file/rm_lnkcnt_zero_file.c
tests/zfs-tests/cmd/dir_rd_update/dir_rd_update.c
cmd/zed/zed_exec.c
module/icp/core/kcf_sched.c
module/zfs/dsl_pool.c
module/zfs/arc.c
module/nvpair/nvpair.c
man/man5/zfs-module-parameters.5
Reviewed-by: GeLiXin <ge.lixin@zte.com.cn>
Reviewed-by: Andreas Dilger <andreas.dilger@intel.com>
Reviewed-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#5241
This patch tracks dnode usage for each user/group in the
DMU_USER/GROUPUSED_OBJECT ZAPs. ZAP entries dedicated to dnode
accounting have the key prefixed with "obj-" followed by the UID/GID
in string format (as done for the block accounting).
A new SPA feature has been added for dnode accounting as well as
a new ZPL version. The SPA feature must be enabled in the pool
before upgrading the zfs filesystem. During the zfs version upgrade,
a "quotacheck" will be executed by marking all dnode as dirty.
ZoL-bug-id: https://github.com/zfsonlinux/zfs/issues/3500
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Johann Lombardi <johann.lombardi@intel.com>
Authored by: ilovezfs <ilovezfs@icloud.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
In any pool without the extensible dataset feature flag already enabled,
creating a dataset with dedup set to use one of the new checksums would
result in the following panic as soon as any data was added:
panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature,
&refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c
line 390
Inpsection showed that feature->fi_feature was 7, which is the value of
SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum. This commit
adds extensible dataset as a dependency for the sha512, edonr, and skein
feature flags, which prevents the panic.
OpenZFS-issue: https://www.illumos.org/issues/6585
OpenZFS-commit: 892586e8a1
Porting Notes:
This code was originally from Illumos, but I actually ported it from:
openzfsonosx/zfs@b62a652
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Ported by: Tony Hutter <hutter2@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/4185
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/45818ee
Porting Notes:
This code is ported on top of the Illumos Crypto Framework code:
b5e030c8db
The list of porting changes includes:
- Copied module/icp/include/sha2/sha2.h directly from illumos
- Removed from module/icp/algs/sha2/sha2.c:
#pragma inline(SHA256Init, SHA384Init, SHA512Init)
- Added 'ctx' to lib/libzfs/libzfs_sendrecv.c:zio_checksum_SHA256() since
it now takes in an extra parameter.
- Added CTASSERT() to assert.h from for module/zfs/edonr_zfs.c
- Added skein & edonr to libicp/Makefile.am
- Added sha512.S. It was generated from sha512-x86_64.pl in Illumos.
- Updated ztest.c with new fletcher_4_*() args; used NULL for new CTX argument.
- In icp/algs/edonr/edonr_byteorder.h, Removed the #if defined(__linux) section
to not #include the non-existant endian.h.
- In skein_test.c, renane NULL to 0 in "no test vector" array entries to get
around a compiler warning.
- Fixup test files:
- Rename <sys/varargs.h> -> <varargs.h>, <strings.h> -> <string.h>,
- Remove <note.h> and define NOTE() as NOP.
- Define u_longlong_t
- Rename "#!/usr/bin/ksh" -> "#!/bin/ksh -p"
- Rename NULL to 0 in "no test vector" array entries to get around a
compiler warning.
- Remove "for isa in $($ISAINFO); do" stuff
- Add/update Makefiles
- Add some userspace headers like stdio.h/stdlib.h in places of
sys/types.h.
- EXPORT_SYMBOL *_Init/*_Update/*_Final... routines in ICP modules.
- Update scripts/zfs2zol-patch.sed
- include <sys/sha2.h> in sha2_impl.h
- Add sha2.h to include/sys/Makefile.am
- Add skein and edonr dirs to icp Makefile
- Add new checksums to zpool_get.cfg
- Move checksum switch block from zfs_secpolicy_setprop() to
zfs_check_settable()
- Fix -Wuninitialized error in edonr_byteorder.h on PPC
- Fix stack frame size errors on ARM32
- Don't unroll loops in Skein on 32-bit to save stack space
- Add memory barriers in sha2.c on 32-bit to save stack space
- Add filetest_001_pos.ksh checksum sanity test
- Add option to write psudorandom data in file_write utility
This re-use the framework established for SSE2, SSSE3 and
AVX2. However, GCC is using FP registers on Aarch64, so
unlike SSE/AVX2 we can't rely on the registers being left alone
between ASM statements. So instead, the NEON code uses
C variables and GCC extended ASM syntax. Note that since
the kernel explicitly disable vector registers, they
have to be locally re-enabled explicitly.
As we use the variable's number to define the symbolic
name, and GCC won't allow duplicate symbolic names,
numbers have to be unique. Even when the code is not
going to be used (e.g. the case for 4 registers when
using the macro with only 2). Only the actually used
variables should be declared, otherwise the build
will fails in debug mode.
This requires the replacement of the XOR(X,X) syntax
by a new ZERO(X) macro, which does the same thing but
without repeating the argument. And perhaps someday
there will be a machine where there is a more efficient
way to zero a register than XOR with itself. This affects
scalar, SSE2, SSSE3 and AVX2 as they need the new macro.
It's possible to write faster implementations (different
scheduling, different unrolling, interleaving NEON and
scalar, ...) for various cores, but this one has the
advantage of fitting in the current state of the code,
and thus is likely easier to review/check/merge.
The only difference between aarch64-neon and aarch64-neonx2
is that aarch64-neonx2 unroll some functions some more.
Reviewed-by: Gvozden Neskovic <neskovic@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Romain Dolbeau <romain.dolbeau@atos.net>
Closes#4801
Enable ignore_hole_birth by default until all known hole birth bugs
have been resolved and relevant test cases added.
Reviewed-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4809Closes#5099
Authored by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Tom Caputi <tcaputi@datto.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Ported by: David Quigley <david.quigley@intel.com>
This review covers the reading and writing of compressed arc headers, sharing
data between the arc_hdr_t and the arc_buf_t, and the implementation of a new
dbuf cache to keep frequently access data uncompressed.
I've added a new member to l1 arc hdr called b_pdata. The b_pdata always hangs
off the arc_buf_hdr_t (if an L1 hdr is in use) and points to the physical block
for that DVA. The physical block may or may not be compressed. If compressed
arc is enabled and the block on-disk is compressed, then the b_pdata will match
the block on-disk and remain compressed in memory. If the block on disk is not
compressed, then neither will the b_pdata. Lastly, if compressed arc is
disabled, then b_pdata will always be an uncompressed version of the on-disk
block.
Typically the arc will cache only the arc_buf_hdr_t and will aggressively evict
any arc_buf_t's that are no longer referenced. This means that the arc will
primarily have compressed blocks as the arc_buf_t's are considered overhead and
are always uncompressed. When a consumer reads a block we first look to see if
the arc_buf_hdr_t is cached. If the hdr is cached then we allocate a new
arc_buf_t and decompress the b_pdata contents into the arc_buf_t's b_data. If
the hdr already has a arc_buf_t, then we will allocate an additional arc_buf_t
and bcopy the uncompressed contents from the first arc_buf_t to the new one.
Writing to the compressed arc requires that we first discard the b_pdata since
the physical block is about to be rewritten. The new data contents will be
passed in via an arc_buf_t (uncompressed) and during the I/O pipeline stages we
will copy the physical block contents to a newly allocated b_pdata.
When an l2arc is inuse it will also take advantage of the b_pdata. Now the
l2arc will always write the contents of b_pdata to the l2arc. This means that
when compressed arc is enabled that the l2arc blocks are identical to those
stored in the main data pool. This provides a significant advantage since we
can leverage the bp's checksum when reading from the l2arc to determine if the
contents are valid. If the compressed arc is disabled, then we must first
transform the read block to look like the physical block in the main data pool
before comparing the checksum and determining it's valid.
OpenZFS-issue: https://www.illumos.org/issues/6950
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/7fc10f0
Issue #5078
ARC will evict meta buffers that exceed the arc_meta_limit. Before a further
investigating on whether we should take special protection on meta buffers,
this tunable make arc_meta_limit adjustable for different workloads.
People can set zfs_arc_meta_limit_percent to any value while insmod zfs.ko,
so some range check is added to guarantee a suitable arc_meta_limit.
Suggested by Tim Chase, zfs_arc_dnode_limit is changed to a percent-style
tunable as well.
Signed-off-by: GeLiXin <ge.lixin@zte.com.cn>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4957
Adds a module option which disables the hole_birth optimization
which has been responsible for several recent bugs, including
issue #4050.
Original-patch: https://gist.github.com/pcd1193182/2c0cd47211f3aee623958b4698836c48
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4833
Metadata-intensive workloads can cause the ARC to become permanently
filled with dnode_t objects as they're pinned by the VFS layer.
Subsequent data-intensive workloads may only benefit from about
25% of the potential ARC (arc_c_max - arc_meta_limit).
In order to help track metadata usage more precisely, the other_size
metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size.
The new zfs_arc_dnode_limit tunable, which defaults to 10% of
zfs_arc_meta_limit, defines the minimum number of bytes which is desirable
to be consumed by dnodes. Attempts to evict non-metadata will trigger
async prune tasks if the space used by dnodes exceeds this limit.
The new zfs_arc_dnode_reduce_percent tunable specifies the amount by
which the excess dnode space is attempted to be pruned as a percentage of
the amount by which zfs_arc_dnode_limit is being exceeded. By default,
it tries to unpin 10% of the dnodes.
The problem of dnode metadata pinning was observed with the following
testing procedure (in this example, zfs_arc_max is set to 4GiB):
- Create a large number of small files until arc_meta_used exceeds
arc_meta_limit (3GiB with default tuning) and arc_prune
starts increasing.
- Create a 3GiB file with dd. Observe arc_mata_used. It will still
be around 3GiB.
- Repeatedly read the 3GiB file and observe arc_meta_limit as before.
It will continue to stay around 3GiB.
With this modification, space for the 3GiB file is gradually made
available as subsequent demands on the ARC are made. The previous behavior
can be restored by setting zfs_arc_dnode_limit to the same value as the
zfs_arc_meta_limit.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4345
Issue #4512
Issue #4773Closes#4858
- Implementation lock replaced with atomic variable
- Trailing whitespace is removed from user specified parameter, to enhance
experience when using commands that add newline, e.g. `echo`
- raidz_test: remove dependency on `getrusage()` and RUSAGE_THREAD, Issue #4813
- silence `cppcheck` in vdev_raidz, partial solution of Issue #1392
- Minor fixes and cleanups
- Enable use of original parity methods in [fastest] configuration.
New opaque original ops structure, representing native methods, is added
to supported raidz methods. Original parity methods are executed if selected
implementation has NULL fn pointer.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4813
Issue #1392
Builds off of 1eeb4562 (Implementation of AVX2 optimized Fletcher-4)
This commit adds another implementation of the Fletcher-4 algorithm.
It is automatically selected at module load if it benchmarks higher
than all other available implementations.
The module benchmark was also amended to analyze the performance of
the byteswap-ed version of Fletcher-4, as well as the non-byteswaped
version. The average performance of the two is used to select the
the fastest implementation available on the host system.
Adds a pair of fields to an existing zcommon module parameter:
- zfs_fletcher_4_impl (str)
"sse2" - new SSE2 implementation if available
"ssse3" - new SSSE3 implementation if available
Signed-off-by: Tyler J. Stachecki <stachecki.tyler@gmail.com>
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4789
2605 want to resume interrupted zfs send
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/2605
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/9c3fd12
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6980
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/ea4a67f
Porting notes:
- All rsend and snapshop tests enabled and updated for Linux.
- Fix misuse of input argument in traverse_visitbp().
- Fix ISO C90 warnings and errors.
- Fix gcc 'missing braces around initializer' in
'struct send_thread_arg to_arg =' warning.
- Replace 4 argument fletcher_4_native() with 3 argument version,
this change was made in OpenZFS 4185 which has not been ported.
- Part of the sections for 'zfs receive' and 'zfs send' was
rewritten and reordered to approximate upstream.
- Fix mktree xattr creation, 'user.' prefix required.
- Minor fixes to newly enabled test cases
- Long holds for volumes allowed during receive for minor registration.
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3542
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4328
ZFS allows for specific permissions to be delegated to normal users
with the `zfs allow` and `zfs unallow` commands. In addition, non-
privileged users should be able to run all of the following commands:
* zpool [list | iostat | status | get]
* zfs [list | get]
Historically this functionality was not available on Linux. In order
to add it the secpolicy_* functions needed to be implemented and mapped
to the equivalent Linux capability. Only then could the permissions on
the `/dev/zfs` be relaxed and the internal ZFS permission checks used.
Even with this change some limitations remain. Under Linux only the
root user is allowed to modify the namespace (unless it's a private
namespace). This means the mount, mountpoint, canmount, unmount,
and remount delegations cannot be supported with the existing code. It
may be possible to add this functionality in the future.
This functionality was validated with the cli_user and delegation test
cases from the ZFS Test Suite. These tests exhaustively verify each
of the supported permissions which can be delegated and ensures only
an authorized user can perform it.
Two minor bug fixes were required for test-running.py. First, the
Timer() object cannot be safely created in a `try:` block when there
is an unconditional `finally` block which references it. Second,
when running as a normal user also check for scripts using the
both the .ksh and .sh suffixes.
Finally, existing users who are simulating delegations by setting
group permissions on the /dev/zfs device should revert that
customization when updating to a version with this change.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Closes#362Closes#434Closes#4100Closes#4394Closes#4410Closes#4487
New functionality:
- Preserves existing scalar implementation.
- Adds AVX2 optimized Fletcher-4 computation.
- Fastest routines selected on module load (benchmark).
- Test case for Fletcher-4 added to ztest.
New zcommon module parameters:
- zfs_fletcher_4_impl (str): selects the implementation to use.
"fastest" - use the fastest version available
"cycle" - cycle trough all available impl for ztest
"scalar" - use the original version
"avx2" - new AVX2 implementation if available
Performance comparison (Intel i7 CPU, 1MB data buffers):
- Scalar: 4216 MB/s
- AVX2: 14499 MB/s
See contents of `/sys/module/zcommon/parameters/zfs_fletcher_4_impl`
to get list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Jinshan Xiong <jinshan.xiong@intel.com>
Signed-off-by: Andreas Dilger <andreas.dilger@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4330
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/6531
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/97e8130
Porting notes:
- Added new IO delay tracepoints, and moved common ZIO tracepoint macros
to a new trace_common.h file.
- Used zio_delay_taskq() in place of OpenZFS's timeout_generic() function.
- Updated zinject man page
- Updated zpool_scrub test files
Various rewrites to the descriptions of module parameters. Corrects
spelling mistakes, makes descriptions them more user-friendly and
describes some ZFS quirks which should be understood before changing
parameter values.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4671
This is a purely cosmetical change, to consistently prefer one of
two (both acceptable) choises for the word parsable in documentation and
code. I don't really care which to use, but acording to wiktionary
https://en.wiktionary.org/wiki/parsable#English parsable is preferred.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4682
These changes should have been part of the original 930b0d4
commit but were overlooked because 193a37c had not yet been
merged when the original change was ported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #4631
Simply containing a slash is not enough, presumably because foo/bar
could be either a dataset or a mountpoint.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
Closes#4655
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
References:
https://www.illumos.org/issues/1644
Ported-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Laager <rlaager@wiktel.com>
This change is primarily about adding inline references in the
properties section to the traditional mount option names.
There are some other editorial changes too.
Signed-off-by: Richard Laager <rlaager@wiktel.com>
3993 zpool(1M) and zfs(1M) should support -p for "list" and "get"
4700 "zpool get" doesn't support -H or -o options
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Ported by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
OpenZFS-issue: https://www.illumos.org/issues/3993
OpenZFS-issue: https://www.illumos.org/issues/4700
OpenZFS-commit: https://github.com/openzfs/openzfs/commit/c58b352
Porting notes:
I removed ZoL's zpool_get_prop_literal() in favor of
zpool_get_prop(..., boolean_t literal) since that's what OpenZFS
uses. The functionality is the same.
This is foundational work for ZED.
Updates a leaf vdev's persistent device strings on Linux platform
* only applies for a dedicated leaf vdev (aka whole disk)
* updated during pool create|add|attach|import
* used for matching device matching during auto-{online,expand,replace}
* stored in a leaf disk config label (i.e. alongside 'path' NVP)
* can opt-out using env var ZFS_VDEV_DEVID_OPT_OUT=YES
Some examples:
path: '/dev/sdb1'
devid: 'scsi-350000394a8ca4fbc-part1'
phys_path: 'pci-0000:04:00.0-sas-0x50000394a8ca4fbf-lun-0'
path: '/dev/mapper/mpatha'
devid: 'dm-uuid-mpath-35000c5006304de3f'
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2856Closes#3978Closes#4416
Historically libblkid support was detected as part of configure
and optionally enabled. This was done because at the time support
for detecting ZFS pool vdevs had just be added to libblkid and
those updated packages were not yet part of many distributions.
This is no longer the case and any reasonably current distribution
will ship a version of libblkid which can detect ZFS pool vdevs.
This patch makes libblkid mandatory at build time and libblkid
the preferred method of scanning for ZFS pools. For distributions
which include a modern version of libblkid there is no change in
behavior. Explicitly scanning the default search paths is still
supported and can be enabled with the '-s' command line option.
Additionally making libblkid mandatory means that the 'zpool create'
command can reliably detect if a specified device has an existing
non-ZFS filesystem (ext4, xfs) and print a warning.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2448
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
References:
https://github.com/freebsd/freebsd@5c7a6f5dhttps://github.com/freebsd/freebsd@31b7f68dhttps://github.com/freebsd/freebsd@e186f564
Performance Testing:
https://github.com/zfsonlinux/zfs/pull/4334#issuecomment-189057141
Porting notes:
- The tunables were adjusted to have ZoL-style names.
- The code was modified to use ZoL's vd_nonrot.
- Fixes were done to make cstyle.pl happy
- Merge conflicts were handled manually
- freebsd/freebsd@e186f564bc by my
collegue Andriy Gapon has been included. It applied perfectly, but
added a cstyle regression.
- This replaces 556011dbec entirely.
- A typo "IO'a" has been corrected to say "IO's"
- Descriptions of new tunables were added to man/man5/zfs-module-parameters.5.
Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4334
Commit d2f3e29 introduced the -p option which outputs full paths
for vdevs to multiple zpool subcommands. When this was merged
there was no conflict for this flag letter. However it's certain
there will be a conflict with the -p (parsable) flag used by other
subcommands. Therefore, -p is being changed to -P to avoid this.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4368
The following options have been added to the zpool add, iostat,
list, status, and split subcommands. The default behavior was
not modified, from zfs(8).
-g Display vdev GUIDs instead of the normal short
device names. These GUIDs can be used in-place of
device names for the zpool detach/off‐
line/remove/replace commands.
-L Display real paths for vdevs resolving all symbolic
links. This can be used to lookup the current block
device name regardless of the /dev/disk/ path used
to open it.
-p Display full paths for vdevs instead of only the
last component of the path. This can be used in
conjunction with the -L flag.
This behavior may also be enabled using the following environment
variables.
ZPOOL_VDEV_NAME_GUID
ZPOOL_VDEV_NAME_FOLLOW_LINKS
ZPOOL_VDEV_NAME_PATH
This change is based on worked originally started by Richard Yao
to add a -g option. Then extended by @ilovezfs to add a -L option
for openzfsonosx. Those changes have been merged, re-factored,
a -p option added and extended to all relevant zpool subcommands.
Original-patch-by: Richard Yao <ryao@gentoo.org>
Extended-by: ilovezfs <ilovezfs@icloud.com>
Extended-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: ilovezfs <ilovezfs@icloud.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2011Closes#4341
Set a limit for the largest compressed block which can be written
to an L2ARC device. By default this limit is set to 16M so there
is no change in behavior.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Elling <Richard.Elling@RichardElling.com>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes#4323
Reintroduce a slightly adapted version of the Illumos logic for
synchronous unlinks. The basic idea here is that only files
smaller than zfs_delete_blocks (20480) blocks should be deleted
synchronously. Unlinking larger files should be handled
asynchronously to minimize impact to the caller.
To accomplish this iput() which is responsible for calling
zfs_znode_delete() on Linux is only called in the delete_now
path. Otherwise zfs_async_iput() is used which allows the
last reference to be dropped by a taskq thread effectively
making the removal asynchronous.
Porting notes:
- Add zfs_delete_blocks module option for performance analysis.
The default value is DMU_MAX_DELETEBLKCNT which is the same
as upstream. Reducing this value means that smaller files
will be unlinked asynchronously like large files.
- All occurrences of zfsvfs changes to zsb.
Ported-by: KernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Under RHEL6/CentOS6 the default stack size must be increased to 32K
to prevent overflowing the stack when running ztest. This isn't an
issue for other distributions due to either the version of pthreads
or perhaps the compiler. Doubling the stack size resolves the
issue safely for all distribution and leaves us some headroom.
$ sudo -E ztest -V -T 300 -f /var/tmp/
5 vdevs, 7 datasets, 23 threads, 300 seconds...
loading space map for vdev 0 of 1, metaslab 0 of 30 ...
...
loading space map for vdev 0 of 1, metaslab 14 of 30 ...
child died with signal 11
Exited ztest with error 3
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4215
4891 want zdb option to dump all metadata
Reviewed by: Sonu Pillai <sonu.pillai@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
We'd like a way for zdb to dump metadata in a machine-readable
format, so that we can bring that back from a customer site for
in-house diagnosis. Think of it as a crash dump for zpools,
which can be used for post-mortem analysis of a malfunctioning
pool
References:
https://www.illumos.org/issues/4891https://github.com/illumos/illumos-gate/commit/df15e41
Porting notes:
- [cmd/zdb/zdb.c]
- a5778ea zdb: Introduce -V for verbatim import
- In main() getopt 'opt' variable removed and the code was
brought back in line with illumos.
- [lib/libzpool/kernel.c]
- 1e33ac1 Fix Solaris thread dependency by using pthreads
- f0e324f Update utsname support
- 4d58b69 Fix vn_open/vn_rdwr error handling
- In vn_open() allocate 'dumppath' on heap instead of stack
- Properly handle 'dump_fd == -1' error path
- Free 'realpath' after added vn_dumpdir_code block
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
5745 zfs set allows only one dataset property to be set at a time
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Rich Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/5745https://github.com/illumos/illumos-gate/commit/3092556
Porting notes:
- Fix the missing braces around initializer, zfs_cmd_t zc = {"\0"};
- Remove extra format argument in zfs_do_set()
- Declare at the top:
- zfs_prop_t prop;
- nvpair_t *elem;
- nvpair_t *next;
- int i;
- Additionally initialize:
- int added_resv = 0;
- zfs_prop_t prop = 0;
- Assign 0 install of NULL for uint64_t types.
- zc->zc_nvlist_conf = '\0';
- zc->zc_nvlist_src = '\0';
- zc->zc_nvlist_dst = '\0';
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3574
Adds zio_taskq_batch_pct as an exported module parameter,
allowing users to modify it at module load time.
Signed-off-by: DHE <git@dehacked.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4110
Correct some misspelled words and grammatical errors, and remove
trailing white space in the man pages.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#4115
As part of the large block support effort, it makes sense to add
support for large blocks to **zpios(1)**. The specifying of a zfs
block size for zpios is optional and will default to 128K if the
block size is not specified.
`zpios ... -S size | --blocksize size ...`
This will use *size* ZFS blocks for each test, specified as a comma
delimited list with an optional unit suffix. The supported range is
powers of two from 128K through 16M. A range of block sizes can be
tested as follows: `-S 128K,256K,512K,1M`
Example run below
(non realistic results from a VM and output abbreviated for space)
```
--regioncount=750 --regionsize=8M --chunksize=1M --offset=4K
--threaddelay=0 --cleanup --human-readable --verbose --cleanup
--blocksize=128K,256K,512K,1M
th-cnt rg-cnt rg-sz ch-sz blksz wr-data wr-bw rd-data rd-bw
---------------------------------------------------------------------
4 750 8m 1m 128k 5g 90.06m 5g 93.37m
4 750 8m 1m 256k 5g 79.71m 5g 99.81m
4 750 8m 1m 512k 5g 42.20m 5g 93.14m
4 750 8m 1m 1m 5g 35.51m 5g 89.36m
8 750 8m 1m 128k 5g 85.49m 5g 90.81m
8 750 8m 1m 256k 5g 61.42m 5g 99.24m
8 750 8m 1m 512k 5g 49.09m 5g 108.78m
16 750 8m 1m 128k 5g 86.28m 5g 88.73m
16 750 8m 1m 256k 5g 64.34m 5g 93.47m
16 750 8m 1m 512k 5g 68.84m 5g 124.47m
16 750 8m 1m 1m 5g 53.97m 5g 97.20m
---------------------------------------------------------------------
```
Signed-off-by: Don Brady <don.brady@intel.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3795Closes#2071
ZFS incorrectly uses directory-based extended attributes even when
xattr=sa is specified as a dataset property or mount option. Support to
honor temporary mount options including "xattr" was added in commit
0282c4137e. There are two issues with the
mount option handling:
* Libzfs has historically included "xattr" in its list of default mount
options. This overrides the dataset property, so the dataset is always
configured to use directory-based xattrs even when the xattr dataset
property is set to off or sa. Address this by removing "xattr" from
the set of default mount options in libzfs.
* There was no way to enable system attribute-based extended attributes
using temporary mount options. Add the mount options "saxattr" and
"dirxattr" which enable the xattr behavior their names suggest. This
approach has the advantages of mirroring the valid xattr dataset
property values and following existing conventions for mount option
names.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3787
When adding a zvol to the system prefetch zvol_prefetch_bytes from the
start and end of the volume. Prefetching these regions of the volume is
desirable because they are likely to be accessed immediately by blkid(8),
the kernel scanning for a partition table, or another task which probes
the devices.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3659
Internally ZFS keeps a small log to facilitate debugging. By default
the log is disabled, to enable it set zfs_dbgmsg_enable=1. The contents
of the log can be accessed by reading the /proc/spl/kstat/zfs/dbgmsg file.
Writing 0 to this proc file clears the log.
$ echo 1 >/sys/module/zfs/parameters/zfs_dbgmsg_enable
$ echo 0 >/proc/spl/kstat/zfs/dbgmsg
$ zpool import tank
$ cat /proc/spl/kstat/zfs/dbgmsg
1 0 0x01 -1 0 2492357525542 2525836565501
timestamp message
1441141408 spa=tank async request task=1
1441141408 txg 70 open pool version 5000; software version 5000/5; ...
1441141409 spa=tank async request task=32
1441141409 txg 72 import pool version 5000; software version 5000/5; ...
1441141414 command: lt-zpool import tank
Note the zfs_dbgmsg() and dprintf() functions are both now mapped to
the same log. As mentioned above the kernel debug log can be accessed
though the /proc/spl/kstat/zfs/dbgmsg kstat. For user space consumers
log messages are immediately written to stdout after applying the
ZFS_DEBUG environment variable.
$ ZFS_DEBUG=on ./cmd/ztest/ztest -V
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3728
This patch is based on the previous work done by @andrey-ve and
@yshui. It triggers the automount by using kern_path() to traverse
to the known snapshout mount point. Once the snapshot is mounted
NFS can access the contents of the snapshot.
Allowing NFS clients to access to the .zfs/snapshot directory would
normally mean that a root user on a client mounting an export with
'no_root_squash' would be able to use mkdir/rmdir/mv to manipulate
snapshots on the server. To prevent configuration mistakes a
zfs_admin_snapshot module option was added which disables the
mkdir/rmdir/mv functionally. System administators desiring this
functionally must explicitly enable it.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#2797Closes#1655Closes#616
Internally, zvols are files exposed through the block device API. This
is intended to reduce overhead when things require block devices.
However, the ZoL zvol code emulates a traditional block device in that
it has a top half and a bottom half. This is an unnecessary source of
overhead that does not exist on any other OpenZFS platform does this.
This patch removes it. Early users of this patch reported double digit
performance gains in IOPS on zvols in the range of 50% to 80%.
Comments in the code suggest that the current implementation was done to
obtain IO merging from Linux's IO elevator. However, the DMU already
does write merging while arc_read() should implicitly merge read IOs
because only 1 thread is permitted to fetch the buffer into ARC. In
addition, commercial ZFSOnLinux distributions report that regular files
are more performant than zvols under the current implementation, and the
main consumers of zvols are VMs and iSCSI targets, which have their own
elevators to merge IOs.
Some minor refactoring allows us to register zfs_request() as our
->make_request() handler in place of the generic_make_request()
function. This eliminates the layer of code that broke IO requests on
zvols into a top half and a bottom half. This has several benefits:
1. No per zvol spinlocks.
2. No redundant IO elevator processing.
3. Interrupts are disabled only when actually necessary.
4. No redispatching of IOs when all taskq threads are busy.
5. Linux's page out routines will properly block.
6. Many autotools checks become obsolete.
An unfortunate consequence of eliminating the layer that
generic_make_request() is that we no longer calls the instrumentation
hooks for block IO accounting. Those hooks are GPL-exported, so we
cannot call them ourselves and consequently, we lose the ability to do
IO monitoring via iostat. Since zvols are internally files mapped as
block devices, this should be okay. Anyone who is willing to accept the
performance penalty for the block IO layer's accounting could use the
loop device in between the zvol and its consumer. Alternatively, perf
and ftrace likely could be used. Also, tools like latencytop will still
work. Tools such as latencytop sometimes provide a better view of
performance bottlenecks than the traditional block IO accounting tools
do.
Lastly, if direct reclaim occurs during spacemap loading and swap is on
a zvol, this code will deadlock. That deadlock could already occur with
sync=always on zvols. Given that swap on zvols is not yet production
ready, this is not a blocker.
Signed-off-by: Richard Yao <ryao@gentoo.org>
Add the required kernel side infrastructure to parse arbitrary
mount options. This enables us to support temporary mount
options in largely the same way it is handled on other platforms.
See the 'Temporary Mount Point Properties' section of zfs(8)
for complete details.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#985Closes#3351
Allow for easy turning of a pools reserved free space. Previous
versions of ZFS (v0.6.4 and earlier) held 1/64 of the pools capacity
in reserve. Commits 3d45fdd and 0c60cc3 increased this to 1/32.
Setting spa_slop_shift=6 will restore the previous default setting.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3724
Add new keyword 'slot' to vdev_id.conf
This selects from where to get the slot number for a SAS/SATA disk
Needed to enable access to the physical position of a disk in a
Supermicro 2027R-AR24NV .
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes#3693
This brings the behavior of arc_memory_throttle() back in sync with
illumos. The updated memory throttling policy roughly goes like this:
* Never throttle if more than 10% of memory is free. This threshold
is configurable with the zfs_arc_lotsfree_percent module option.
* Minimize any throttling of kswapd even when free memory is below
the set threshold. Allow it to write out pages as quickly as
possible to help alleviate the memory pressure.
* Delay all other threads when free memory is below the set threshold
in order to avoid compounding the memory pressure. Buffers will be
evicted from the ARC to reduce the issue.
The Linux specific zfs_arc_memory_throttle_disable module option has
been removed in favor of the existing zfs_arc_lotsfree_percent tuning.
Setting zfs_arc_lotsfree_percent=0 will have the same effect as
zfs_arc_memory_throttle_disable and it was therefore redundant.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3637
While Linux doesn't provide detailed information about the state of
the VM it does provide us total free pages. This information should
be incorporated in to the arc_available_memory() calculation rather
than solely relying on a signal from direct reclaim. Conceptually
this brings arc_available_memory() back in sync with illumos.
It is also desirable that the target amount of free memory be tunable
on a system. While the default values are expected to work well
for most workloads there may be cases where custom values are needed.
The zfs_arc_sys_free module option was added for this purpose.
zfs_arc_sys_free - The target number of bytes the ARC should leave
as free memory on the system. This value can
checked in /proc/spl/kstat/zfs/arcstats and
setting this module option will override the
default value.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3637
The zvol_threads module option should be bounded to a reasonable
range. The taskq must have at least 1 thread and shouldn't have
more than 1,024 at most. The default value of 32 is a reasonable
default.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3614
Commit f521ce1 removed the minimum value for "arc_p" allowing it to
drop to zero or grow to "arc_c". This was done to improve specific
workload which constantly dirties new "metadata" but also frequently
touches a "small" amount of mfu data (e.g. mkdir's).
This change may still be desirable but it needs to be re-investigated.
in the context of the recent ARC changes from upstream. Therefore
this code is being restored to facilitate benchmarking. By setting
"zfs_arc_p_min_shift=64" we easily compare the performance.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
https://github.com/illumos/illumos-gate/commit/db1741fhttps://www.illumos.org/issues/5661
Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3571
A few minor mistakes than should be fixed:
zpool:
compatability -> compatibility
zfs:
accessable -> accessible
availible -> available
zfs-events:
availible -> available
zfs-module-parameters:
proceding -> proceeding
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3544
This seems generally useful. metaslab_aliquot is the ZFS allocation
granularity, which is roughly equivalent to what is called the stripe
size in traditional RAID arrays. It seems relevant to performance
tuning.
Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
sysstat's iostat omits the first report when the -y option is used.
This patch adds that functionality and omits the first report with
statistics since system boot.
Signed-off-by: Hajo Möller <dasjoe@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3439
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Porting notes and other significant code changes:
The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.
Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.
The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.
Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:
302f753 - Integrate ARC more tightly with Linux
39e055c - Adjust arc_p based on "bytes" in arc_shrink
f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
77765b5 - Remove "arc_meta_used" from arc_adjust calculation
94520ca - Prune metadata from ghost lists in arc_adjust_meta
Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:
fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }
The following arcstats have been removed:
recycle_miss - Used by arcstat.py and arc_summary.py, both of which
have been updated appropriately.
l2_writes_hdr_miss
The following arcstats have been added:
evict_not_enough - Number of times arc_evict_state() was unable to
evict enough buffers to reach its target amount.
evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
because it was being written to the l2arc.
l2_writes_lock_retry - Replaces l2_writes_hdr_miss. Number of times
l2arc_write_done() failed to acquire hash_lock (and re-tries).
arc_meta_min - Shows the value of the zfs_arc_meta_min module
parameter (see below).
The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme. Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.
The following module parameters have been added:
zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
before moving on to the next sub-list.
zfs_arc_meta_min - Enforce a floor on the amount of metadata in
the ARC.
zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
ARC state.
zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
the target size to be considered "overflowing".
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
* Add information about the 'zpool events' command in zpool(8).
* More events and payloads defined in zfs-events(5).
* I/O Stages and I/O Flags sections added.
* Remove unused legacy "zio_deadline" payload define.
Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3467
If the pool/dataset command-line argument is specified with a trailing
slash, for example, "tank/", it is interpreted as the root dataset.
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3415