Commit Graph

2000 Commits

Author SHA1 Message Date
mmacy
cffd15327b ZFS/MFV: Use cached feature info in spa_add_feature_stats()
commit 417104bdd3c7ce07ec58674dd078f9891c3bc780
Author: Ned Bass <bass6@llnl.gov>
Date:   Thu Feb 26 12:24:11 2015 -0800

    Use cached feature info in spa_add_feature_stats()

    Avoid issuing I/O to the pool when retrieving feature flags information.
    Trying to read the ZAPs from disk means that zpool clear would hang if
    the pool is suspended and recovery would require a reboot. To keep the
    feature stats resident in memory, we hang a cached nvlist off of the
    spa.  It is built up from disk the first time spa_add_feature_stats() is
    called, and refreshed thereafter using the cached feature reference
    counts. spa_add_feature_stats() gets called at pool import time so we
    can be sure the cached nvlist will be available if the pool is later
    suspended.

    Signed-off-by: Ned Bass <bass6@llnl.gov>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3082
2018-08-10 23:42:11 +00:00
mmacy
5160573320 Performance optimization of AVL tree comparator functions
MFV:
commit ee36c709c3d5f7040e1bd11f5c75318aa03e789f
Author: Gvozden Neskovic <neskovic@gmail.com>
Date:   Sat Aug 27 20:12:53 2016 +0200

    perf: 2.75x faster ddt_entry_compare()
        First 256bits of ddt_key_t is a block checksum, which are expected
    to be close to random data. Hence, on average, comparison only needs to
    look at first few bytes of the keys. To reduce number of conditional
    jump instructions, the result is computed as: sign(memcmp(k1, k2)).

    Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} ,
    which is computed efficiently.  Synthetic performance evaluation of
    original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R)
    CPU E5-2660 v3:

    old     6.85789 s
    new     2.49089 s

    perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare()
        Compute the result directly instead of using conditionals

    perf: zfs_range_compare()
        Speedup between 1.1x - 2.5x, depending on compiler version and
    optimization level.

    perf: spa_error_entry_compare()
        `bcmp()` is not suitable for comparator use. Use `memcmp()` instead.

    perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare()
    perf: 2.8x faster zil_bp_compare()
    perf: 2.8x faster mze_compare()
    perf: faster dbuf_compare()
    perf: faster compares in spa_misc
    perf: 2.8x faster layout_hash_compare()
    perf: 2.8x faster space_reftree_compare()
    perf: libzfs: faster avl tree comparators
    perf: guid_compare()
    perf: dsl_deadlist_compare()
    perf: perm_set_compare()
    perf: 2x faster range_tree_seg_compare()
    perf: faster unique_compare()
    perf: faster vdev_cache _compare()
    perf: faster vdev_uberblock_compare()
    perf: faster fuid _compare()
    perf: faster zfs_znode_hold_compare()

    Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
    Signed-off-by: Richard Elling <richard.elling@gmail.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #5033
2018-08-10 06:42:08 +00:00
mav
592eca3179 Reduce taskq and context-switch cost of zio pipe
When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the
logical zio_read(), and then a physical zio. Currently, each of these
results in a separate taskq_dispatch(zio_execute).

On high-read-iops workloads, this causes a significant performance
impact. By processing all 3 ZIO's in a single taskq entry, we reduce the
overhead on taskq locking and context switching.  We accomplish this by
allowing zio_done() to return a "next zio to execute" to zio_execute().

This results in a ~12% performance increase for random reads, from
96,000 iops to 108,000 iops (with recordsize=8k, on SSD's).

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <george.wilson@delphix.com>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
External-issue: DLPX-59292
Closes #7736

zfsonlinux/zfs@62840030a7
2018-08-03 02:16:45 +00:00
mav
a68b7794d2 MFV r337223:
9580 Add a hash-table on top of nvlist to speed-up operations

illumos/illumos-gate@2ec7644aab

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-08-03 01:52:25 +00:00
mav
d1a9e67be4 MFV r337220: 8375 Kernel memory leak in nvpair code
illumos/illumos-gate@843c2111b1

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-08-03 01:30:03 +00:00
mav
730a64d03c MFV r337218: 7261 nvlist code should enforce name length limit
illumos/illumos-gate@48dd5e630c

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-08-03 01:26:07 +00:00
mav
cd3a34292e MFV r337216: 7263 deeply nested nvlist can overflow stack
illumos/illumos-gate@9ca527c3d3

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-08-03 01:09:12 +00:00
mav
067c6a290b MFV 337214:
9621 Make createtxg and guid properties public

illumos/illumos-gate@e8d4a73c86

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Yuri Pankov <yuripv@yuripv.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Josh Paetzel <josh@tcbug.org>
2018-08-03 00:24:27 +00:00
mav
42b94416b6 MFV r337212:
9465 ARC check for 'anon_size > arc_c/2' can stall the system

illumos/illumos-gate@abe1fd01ce

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Don Brady <don.brady@delphix.com>
2018-08-03 00:14:36 +00:00
mav
78003d8dd8 MFV r337210: 9577 remove zfs_dbuf_evict_key tsd
The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can
instead pass a flag down in a few places to prevent recursive dbuf eviction.
Making this change has 3 benefits:

1. The code semantics are easier to understand.
2. On Linux, performance is improved, because creating/removing TSD values
(by setting to NULL vs non-NULL) is expensive, and we do it very often.
3. According to Nexenta, the current semantics can cause a deadlock when
concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they
are working on a "parallel unmount" change that triggers this more easily)

illumos/illumos-gate@c2919acbea

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-03 00:01:48 +00:00
mav
078a6fb8f3 MFV r337208: 9591 ms_shift can be incorrectly changed in MOS config for
indirect vdevs that have been historically expanded

illumos/illumos-gate@11f6a9680e

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2018-08-02 23:56:07 +00:00
mav
2b17cdbaca MFV r337206: 9338 moved dnode has incorrect dn_next_type
illumos/illumos-gate@c7fbe46df9

Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:50:03 +00:00
mav
253d52f67c MFV r337204: 9439 ZFS double-free due to failure to dirty indirect block
illumos/illumos-gate@99a19144e8

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:46:30 +00:00
mav
b2141aeb01 MFV r337200:
9438 Holes can lose birth time info if a block has a mix of birth times

Ultimately, the problem here is that when you truncate and write a file in
the same transaction group, the dbuf for the indirect block will be zeroed
out to deal with the truncation, and then written for the write. During
this process, we will lose hole birth time information for any holes in the
range. In the case where a dnode is being freed, we need to determine
whether the block should be converted to a higher-level hole in the zio
pipeline, and if so do it when the dnode is being synced out.

illumos/illumos-gate@738e2a3ce3

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-08-02 23:43:01 +00:00
mav
22e6be616f Fix build after r337196 mismerge. 2018-08-02 23:40:28 +00:00
mav
4fe3524922 MFV r337197: 9456 ztest failure in zil_commit_waiter_timeout
illumos/illumos-gate@b6031810da

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Prakash Surya <prakash.surya@delphix.com>
2018-08-02 23:25:49 +00:00
mav
9b1a0e4bff MFV r337195: 9454 ::zfs_blkstats should count embedded blocks
illumos/illumos-gate@dec267e7ea

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:23:10 +00:00
mav
2aa2398e3b MFV r337193:
9424 ztest failure: "unprotected error in call to Lua API (Invalid value type 'f
unction' for key 'error')"

illumos/illumos-gate@fe3ba4d122

Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 23:15:10 +00:00
mav
486e11723b MFV r337190: 9486 reduce memory used by device removal on fragmented pools
In the most fragmented real-world cases, this reduces memory used by the
mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less
fragmented cases will typically also see around 50-100MB of RAM per 1TB
of storage.

illumos/illumos-gate@cfd63e1b1b

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 21:59:46 +00:00
mav
84ffe82515 MFV r337182: 9330 stack overflow when creating a deeply nested dataset
Datasets that are deeply nested (~100 levels) are impractical. We just put
a limit of 50 levels to newly created datasets. Existing datasets should
work without a problem.

illumos/illumos-gate@5ac95da7d6

Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author:     Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
2018-08-02 21:19:35 +00:00
mav
a93c53b4de 9539 Make zvol operations use _by_dnode routines
Continues what was started in 7801 add more by-dnode routines by fully
converting zvols to avoid unnecessary dnode_hold() calls. This saves a
small amount of CPU time and slightly improves latencies of operations
on zvols.

illumos/illumos-gate@8dfe5547fb

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Rick McNeal <rick.mcneal@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Richard Yao <richard.yao@prophetstor.com>
2018-08-02 21:07:04 +00:00
mav
cdfff2b3ed MFV r337175: 9487 Free objects when receiving full stream as clone
All objects after the last written or freed object are not supposed to
exist after receiving the stream. We should free them accordingly, as if
a freeobjects record for them had been included in the stream.

zfsonlinux/zfs@48fbb9ddbf
illumos/illumos-gate@7864b8192b

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-08-02 20:33:13 +00:00
mav
d0e26b6b03 MFV r337171:
9464 txg_kick() fails to see that we are quiescing, forcing transactions
to their next stages without leaving them accumulate changes

Ideally we would like txg_kick() to get triggered only when we are sure
that we are not syncing AND not quiescing any txg. This way we can kick
an open TXG to the quiescing state when we are sure that there is nothing
going on and we would benefit from the different states running
concurrently.

illumos/illumos-gate@fa41d87de9

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2018-08-02 20:18:49 +00:00
mav
6e7827dd75 MFV r337167: 9442 decrease indirect block size of spacemaps
Updates to indirect blocks of spacemaps can contribute significantly to
write inflation.  Therefore we want to reduce the indirect block size of
spacemaps from 128K to 16K.

illumos/illumos-gate@221813c13b

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Albert Lee <trisk@forkgnu.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-02 20:06:46 +00:00
mav
c6e40e5893 MFV r337029:
9426 metaslab size can exceed offset addressable by spacemap

metaslab size can exceed offset addressable by spacemap. The vdev can
address up to 2^63 * SPA_MAXBLOCKSIZE (512). A metaslab can address up to
2^47 * 2^vdev_ashift. Therefore we may need to increase the number of
metaslabs so that the maximum metaslab size is capped at the amount that
can be addressed by the spacemap. This should happen in
vdev_metaslab_set_size().

illumos/illumos-gate@b4bf0cf045

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Don Brady <don.brady@delphix.com>
2018-08-01 03:21:17 +00:00
mav
89566ef6ce MFV r337027:
9328 zap code can take advantage of c99
9329 panic in zap_leaf_lookup() due to concurrent zapification

illumos/illumos-gate@bf26014c55

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-01 03:07:33 +00:00
mav
98de93d873 MFV r337022:
9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error

This assertion (VERIFY) failure was reported when reading a block. Turns out
the problem is that if we get an i/o error (ECKSUM in this case), and there
are multiple concurrent ARC reads of the same block (from different clones),
then the ARC will put multiple buf's on the same ANON hdr, which isn't
supposed to happen, and then causes a panic when we try to arc_buf_destroy()
the buf.

illumos/illumos-gate@fa98e487a9

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-08-01 02:39:44 +00:00
mav
9cab53843c MFV r337020:9443 panic when scrub a v10 pool
illumos/illumos-gate@bb1f424574

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-07-31 23:00:58 +00:00
mav
3d8e119409 MFV r337014:
9421 zdb should detect and print out the number of "leaked" objects
9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue

illumos/illumos-gate@20b5dafb42

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-07-31 22:50:50 +00:00
mav
7bffd764fe MFV r336991, r337001:
9102 zfs should be able to initialize storage devices

The first access to a disk block can incur a performance penalty on some
platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that
volumes be "thick provisioned", where supported by the platform (VMware).
Thick provisioning is time consuming and often is ignored. If the thick
provision step is omitted, customers will see suboptimal performance until
we have written to all parts of the LUN. ZFS should be able to initialize
any unused storage to remove any first-write penalty that exists.

illumos/illumos-gate@094e47e980

Reviewed by: John Wren Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     George Wilson <george.wilson@delphix.com>
2018-07-31 21:06:04 +00:00
mav
ecba38e72f MFV r336960: 9256 zfs send space estimation off by > 10% on some datasets
illumos/illummos-gate@df477c0afa

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Paul Dagnelie <pcd@delphix.com>
2018-07-31 01:02:22 +00:00
mav
3a353ba423 MFV r336958: 9337 zfs get all is slow due to uncached metadata
This project's goal is to make read-heavy channel programs and zfs(1m)
administrative commands faster by caching all the metadata that they will
need in the dbuf layer. This will prevent the data from being evicted, so
that any future call to i.e. zfs get all won't have to go to disk (very
much).

illumos/illumos-gate@adb52d9262

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:58:21 +00:00
mav
5947e79ef4 MFV r336955: 9236 nuke spa_dbgmsg
We should use zfs_dbgmsg instead of spa_dbgmsg.  Or at least,
metaslab_condense() should call zfs_dbgmsg because it's important and rare
enough to always log. It's possible that the message in zio_dva_allocate()
would be too high-frequency for zfs_dbgmsg.

illumos/illumos-gate@21f7c81cc1

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:47:27 +00:00
mav
a431f9b1ff MFV r336952: 9192 explicitly pass good_writes to vdev_uberblock/label_sync
Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume
that its io_private is a pointer to the good_writes count. They should
instead accept this argument explicitly.

illumos/illumos-gate@a3b5583021

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:37:45 +00:00
mav
5c26614dcc MFV r336950: 9290 device removal reduces redundancy of mirrors
Mirrors are supposed to provide redundancy in the face of whole-disk failure
and silent damage (e.g. some data on disk is not right, but ZFS hasn't
detected the whole device as being broken). However, the current device
removal implementation bypasses some of the mirror's redundancy.

illumos/illumos-gate@3a4b1be953

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2018-07-31 00:25:39 +00:00
mav
92f5755f2b MFV r336948: 9112 Improve allocation performance on high-end systems
On high-end systems running async sequential write workloads, especially
NUMA systems with flash or NVMe storage, one significant performance
bottleneck is selecting a metaslab to do allocations from. This process
can be parallelized, providing significant performance increases for
these workloads.

illumos/illumos-gate@f78cdc34af

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Paul Dagnelie <pcd@delphix.com>
2018-07-31 00:02:42 +00:00
mav
8e1f6ca37d MFV r336946: 9238 ZFS Spacemap Encoding V2
The current space map encoding has the following disadvantages:
[1] Assuming 512 sector size each entry can represent at most 16MB for a segment.
This makes the encoding very inefficient for large regions of space.
[2] As vdev-wide space maps have started to be used by new features (i.e.
device removal, zpool checkpoint) we've started imposing limits in the
vdevs that can be used with them based on the maximum addressable offset
(currently 64PB for a top-level vdev).

The new remains backwards compatible with the old one. The introduced
two-word entry format, besides extending the limits imposed by the single-entry
layout, also includes a vdev field and some extra padding after its prefix.

The extra padding after the prefix should is reserved for future usage (e.g.
new prefixes for future encodings or new fields for flags). The new vdev field
not only makes the space maps more self-descriptive, but also opens the doors
for pool-wide space maps.

One final important note is that the number of bits used for vdevs is reduced
to 24 bits for blkptrs. That was decided as we don't know of any setups that
use more than 16M vdevs for the time being and
we wanted to fit the vdev field in the space map. In addition that gives us
some extra bits in dva_t.

illumos/illumos-gate@17f11284b4

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-07-30 23:47:38 +00:00
mav
7a5d3c92c7 MFV r336942: 9189 Add debug to vdev_label_read_config when txg check fails
illumos/illumos-gate@b6bf6e1540

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2018-07-30 22:03:29 +00:00
allanjude
b8150c4df5 ZFS: Reserve DMU_BACKUP_FEATURE flags for Native Encryption and ZSTD 2018-07-24 04:38:11 +00:00
sef
b99e763f8e Fix a couple of typos in r334844 noticed by Richard Kojedzinszky.
Submitted by:	Richard Kojedzinszky
Reviewed by:	sef
Approved by:	mav
2018-07-18 16:03:40 +00:00
jhibbits
c959497ef5 dtrace/powerpc: Correct register indices for non-indexed registers in the trapframe
Fix an off-by-one error, LR starts at index 32, not index 33, and the others
follow suit.
2018-07-16 19:47:29 +00:00
sef
bfd14e4f72 Fix up some missed and mis-merges from the sequential scan code
(r334844). Most of the changes involve moving some code around to
reduce conflicts with future merges.  One of the missing changes
included a notification on scrub cancellation.

Approved by:	mav
Sponsored by:	iXsystems Inc
2018-07-10 20:11:32 +00:00
sef
107a344bbb This exposes ZFS user and group quotas via the normal
quatactl(2) mechanism.  (Read-only at this point, however.)
In particular, this is to allow rpc.rquotad query quotas
for NFS mounts, allowing users to see their quotas on the
hosts using the datasets.

The changes specifically:

* Add new RPC entry points for querying quotas.
* Changes the library routines to allow non-UFS quotas.
* Changes rquotad to check for quotas on mounted filesystems,
rather than being limited to entries in /etc/fstab
* Lastly, adds a VFS entry-point for ZFS to query quotas.

Note that this makes one unavoidable behavioural change: if quotas
are enabled, then they can be queried, as opposed to the current
method of checking for quotas being specified in fstab.  (With
ZFS, if there are user or group quotas, they're used, always.)

Reviewed by:	delphij, mav
Approved by:	mav
Sponsored by:	iXsystems Inc
Differential Revision:	https://reviews.freebsd.org/D15886
2018-07-05 22:56:13 +00:00
mmacy
6f2eb073ec opensolaris compat: fix compile error
when opensolaris/sys/types.h is included before stddef.h ptrdiff_t
would be typedef'd twice
2018-07-03 23:45:02 +00:00
sef
a73aecd5f1 This originated from ZFS On Linux, as
d4a72f2386

During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.

Approved by:	Alexander Motin
Obtained from:	ZFS On Linux
Relnotes:	Yes (improved scrub performance, with tunables)
Differential Revision:	https://reviews.freebsd.org/D15562
2018-06-08 17:38:28 +00:00
benno
6dafa53868 Break recursion involving getnewvnode and zfs_rmnode.
When we're at our vnode limit, getnewvnode will call into the vnode LRU
cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we
end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling
represents something with extended attributes, zfs_rmnode will call
zfs_zget which will attempt to allocate another vnode. If the next vnode we
try to recycle is also a ZFS vnode representing something with extended
attributes we can recurse further. This ends up being unbounded and can end
up overflowing the stack.

In order to avoid this, restructure zfs_rmnode to simply add the extended
attribute directory's object ID to the unlinked set, thus not requiring the
allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain
which will do the work of properly marking the vnodes for unlinking.
zfs_unlinked_drain is also called on mount so these will be cleaned up
there.

Reviewed by:	avg, mav
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D15342
2018-06-07 18:59:32 +00:00
jhibbits
45332e3d1f Revert r326083, it doesn't behave as expected.
Even though there do appear to be more artificial frames, with 12, stack
traces no longer list at all.  Revert until a better, more stable value can
be determined.
2018-06-03 03:53:11 +00:00
jhibbits
c0a10a2d85 Protect dtrace_getpcstack() from a NULL stack pointer in a trap frame
Found when trying to use lockstat on a POWER9, the stack pointer (r1) could
be NULL, and result in a NULL pointer dereference, crashing the kernel.
2018-05-30 03:48:27 +00:00
hselasky
af8c4fe8c2 Fix 32-bit buildworld for i386 after r334320.
The 64-bit atomics defined for i386 are currently only available in
the kernel space.

Found by:	cy@
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-05-29 13:43:16 +00:00
hselasky
831d81975f Implement atomic_add_64() and atomic_subtract_64() for the i386 target.
While at it add missing _acq_ and _rel_ variants for 64-bit atomic
operations under i386.

Reviewed by:	kib @
MFC after:	1 week
Sponsored by:	Mellanox Technologies
2018-05-29 11:59:02 +00:00