freebsd-skq

Author	SHA1	Message	Date
Andriy Gapon	bc390f4947	7801 add more by-dnode routines (lint) illumos/illumos-gate@411be58a6e `411be58a6e` https://www.illumos.org/issues/7801 Add _by_dnode() routines for accessing objects given their dnode_t , this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Ported from: https://github.com/zfsonlinux/zfs/commit/ 0eef1bde31d67091d3deed23fe2394f5a8bf2276 Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-14 18:25:22 +00:00
Andriy Gapon	fd870ba29a	7801 add more by-dnode routines illumos/illumos-gate@b0c42cd470 `b0c42cd470` https://www.illumos.org/issues/7801 Add _by_dnode() routines for accessing objects given their dnode_t , this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Ported from: https://github.com/zfsonlinux/zfs/commit/ 0eef1bde31d67091d3deed23fe2394f5a8bf2276 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: bzzz77 <bzzz.tomas@gmail.com>	2017-04-14 18:25:02 +00:00
Andriy Gapon	c9327351e8	7869 panic in bpobj_space(): null pointer dereference illumos/illumos-gate@a3905a4592 `a3905a4592` https://www.illumos.org/issues/7869 The issue fixed by this patch is a race condition in the deadlist code. A thread executing an administrative command that uses `dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to protect the access of all its entries that the deadlist contains in an avl tree. Sync threads trying to insert a new entry in the deadlist (through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our sentinel value), we close and reopen it. Between these two operations, it is possible for the `dsl_deadlist_space_range()` thread to dereference that bpobj which is `NULL` during that window. Threads should hold the a deadlist's `dl_lock` when they manipulate its internal data so scenarios like the one above are avoided. In addition, threads should also hold the bpobj lock whenever they are allocating the subobj list of a bpobj, and not just when they actually insert the subobj to the list. This way we can avoid potential memory leaks. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2017-04-14 18:24:38 +00:00
Andriy Gapon	1ea6ff796d	7793 ztest fails assertion in dmu_tx_willuse_space illumos/illumos-gate@61e255ce72 `61e255ce72` https://www.illumos.org/issues/7793 Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_ accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-14 18:24:03 +00:00
Andriy Gapon	59dc4d6bbc	7816 remove static unused variable in zfs_vfsops.c illumos/illumos-gate@2e972bf18f `2e972bf18f` https://www.illumos.org/issues/7816 found by gcc6 build unused static variable: -static const fs_operation_def_t zfs_vfsops_eio_template[] = { - VFSNAME_FREEVFS, { .vfs_freevfs = zfs_freevfs }, - NULL, NULL -}; Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andy Stormont astormont@racktopsystems.com Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Igor Kozhukhov <igork@argotech.io>	2017-04-14 18:23:18 +00:00
Andriy Gapon	d3f85b54a4	7812 Remove gender specific language illumos/illumos-gate@48bbca8168 `48bbca8168` https://www.illumos.org/issues/7812 This change removes all gendered language that did not refer specifically to an individual person or pet. The convention taken was to use variations on "they" when referring to users and/or human beings, while using "it" when referring to code, functions, and/or libraries. Additionally, we took the liberty to fix up any whitespace issues that were found in any files that were already being modified. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Daniel Hoffman <dj.hoffman@delphix.com>	2017-04-14 18:22:42 +00:00
Andriy Gapon	b71204d77d	1300 filename normalization doesn't work for removes illumos/illumos-gate@1c17160ac5 `1c17160ac5` https://www.illumos.org/issues/1300 Problem: We can create invisible file in ZFS. How to reproduce: 0. Prepare normalization formD ZFS. # cat cat /etc/release cat: cat: No such file or directory OpenIndiana Development oi_151 X86 (powered by illumos) Copyright 2011 Oracle and/or its affiliates. All rights reserved. Use is subject to license terms. Assembled 28 April 2011 # mkfile 100M 100M # zpool create -O utf8only=on -O normalization=formD test_pool $( pwd )/ 100M # zpool upgrade This system is currently running ZFS pool version 28. All pools are formatted using this version. # zfs get normalization test_pool NAME PROPERTY VALUE SOURCE test_pool normalization formD - # chmod 777 /test_pool 1. Create a NFD file. $ cd /test_pool/ $ cp /etc/release $( echo "\x75\xcc\x88" ) $ ls -la total 4 drwxrwxrwx 2 root root 3 2011-07-29 08:53 . drwxr-xr-x 25 root root 26 2011-07-29 08:53 .. -r--r--r-- 1 test1 staff 251 2011-07-29 08:53 u? Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Kevin Crowe <kevin.crowe@nexenta.com>	2017-04-14 18:20:20 +00:00
Andriy Gapon	be49e7b29b	4242 file rename event fires before the rename happens illumos/illumos-gate@54207fd2e1 `54207fd2e1` https://www.illumos.org/issues/4242 From Joyent's OS-2557: So we're basically just doing a check here that after we got a 'rename' event for our file, the file has actually been moved out of the way. What I've seen in bh1-stage2 is that this happens as you'd expect for all zones but many times over the last week it's failed because when we do the fs.exists () check here, the file that we got the rename event for, still exists. I've confirmed what's happening using the following dtrace script: #!/usr/sbin/dtrace -s #pragma D option quiet #pragma D option bufsize=256k syscall::open:entry, syscall::open64:entry /copyinstr(arg0) == "/var/svc/provisioning" \|\| (strlen(copyinstr(arg0)) == 69 && substr(copyinstr(arg0), 48) == "/var/svc/provisioning")/ { this->watching_open = 1; printf("%d zone %s process %s(%d) [%s] open(%s)\\n", timestamp, zonename, execname, pid, curpsinfo->pr_psargs, copyinstr(arg0)); } syscall::open:return, syscall::open64:return /this->watching_open == 1/ Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Dan McDonald <danmcd@omniti.com> Author: Jerry Jelinek <jerry.jelinek@joyent.com>	2017-04-14 18:19:48 +00:00
Andriy Gapon	9dfe195883	7740 fix for 6513 only works in hole punching case, not truncation illumos/illumos-gate@7de35a3ed0 `7de35a3ed0` https://www.illumos.org/issues/7740 The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. To verify, create a large file, truncate it to a short length, and then write beyond the end. Check with zdb to make sure that there are no holes with birth time zero (will appear as gaps). Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-14 18:15:47 +00:00
Andriy Gapon	16244ff154	7779 clean up unused definitions in zfs ctldir code illumos/illumos-gate@b7f9f60c8e `b7f9f60c8e` https://www.illumos.org/issues/7779 zfsctl_ops_shares_dir and ZFSCTL_INO_SHARES are essentially dead code. While there, fix the index range check in zfsctl_root_inode_cb. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <andriy.gapon@clusterhq.com>	2017-04-14 18:14:41 +00:00
Andriy Gapon	768d3677f6	7743 per-vdev-zaps have no initialize path on upgrade illumos/illumos-gate@555da5111b `555da5111b` https://www.illumos.org/issues/7743 When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-14 18:13:06 +00:00
Andriy Gapon	2bb93dd84d	7659 Missing thread_exit() in dmu_send.c illumos/illumos-gate@f2c1e9bc48 `f2c1e9bc48` https://www.illumos.org/issues/7659 Two threads send_traverse_thread() and receive_writer_thread() should end with thread_exit(); Mostly a cosmetic issue under IllumOS. https://github.com/openzfs/openzfs/pull/252 Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Jorgen Lundman <lundman@lundman.net>	2017-04-14 18:11:53 +00:00
Andriy Gapon	c5c6a287f1	7613 ms_freetree[4] is only used in syncing context illumos/illumos-gate@5f14577801 `5f14577801` https://www.illumos.org/issues/7613 metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg). Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-14 18:11:16 +00:00
Andriy Gapon	26af00ef5b	7586 remove #ifdef __lint hack from dmu.h illumos/illumos-gate@4ba5b96163 `4ba5b96163` https://www.illumos.org/issues/7586 The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if we add other inline functions into header files in ZFS, especially since it is difficult to make any other solution work across all compilation targets. We should switch to disabling the lint flags that are failing instead. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-14 18:10:23 +00:00
Andriy Gapon	2bfa921051	7580 ztest failure in dbuf_read_impl illumos/illumos-gate@1a01181fdc `1a01181fdc` https://www.illumos.org/issues/7580 We need to prevent any reader whenever we're about the zero out all the blkptrs. To do this we need to grab the dn_struct_rwlock as writer in dbuf_write_children_ready and free_children just prior to calling bzero. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2017-04-14 18:09:32 +00:00
Andriy Gapon	cdeb4e5f08	7606 dmu_objset_find_dp() takes a long time while importing pool illumos/illumos-gate@7588687e6b `7588687e6b` https://www.illumos.org/issues/7606 When importing a pool with a large number of filesystems within the same parent filesystem, we see that dmu_objset_find_dp() takes a long time. It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(), and spa_load_verify(). There are several ways to improve performance here: 1. We don't really need to do spa_check_logs() or spa_ld_claim_log_blocks() if the pool was closed cleanly. 2. spa_load_verify() uses dmu_objset_find_dp() to check that no datasets have too long of names. 3. dmu_objset_find_dp() is slow because it's doing zap_value_search() (which is O(N sibling datasets)) to determine the name of each dsl_dir when it's opened. In this case we actually know the name when we are opening it, so we can provide it and avoid the lookup. This change implements fix #3 from the above list; i.e. make dmu_objset_find_dp() provide the name of the dataset so that we don't have to search for it. Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-14 18:08:50 +00:00
Andriy Gapon	b3264caf7b	7252 7628 compressed zfs send / receive illumos/illumos-gate@5602294fda `5602294fda` https://www.illumos.org/issues/7252 This feature includes code to allow a system with compressed ARC enabled to send data in its compressed form straight out of the ARC, and receive data in its compressed form directly into the ARC. https://www.illumos.org/issues/7628 We should have longer, more readable versions of the ZFS send / recv options. 7628 create long versions of ZFS send / receive options Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-14 18:07:43 +00:00
Andriy Gapon	8f77e16c27	7490 real checksum errors are silenced when zinject is on illumos/illumos-gate@6cedfc397d `6cedfc397d` https://www.illumos.org/issues/7490 When zinject is on, error codes from zfs_checksum_error() can be overwritten due to an incorrect and overly-complex if condition. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2017-04-14 17:15:49 +00:00
Andriy Gapon	aa06caf86c	7448 ZFS doesn't notice when disk vdevs have no write cache illumos/illumos-gate@295438ba32 `295438ba32` https://www.illumos.org/issues/7448 I built a SmartOS image with all the NVMe commits including 7372 (support NVMe volatile write cache) and repeated my dd testing: > #!/bin/bash > for i in `seq 1 1000`; do > dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync & > dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync & > wait > rm file00 file01 > done > Previously each dd command took ~145 seconds to finish, now it takes ~400 seconds. Eventually I figured out it is 7372 that causes unnecessary nvme_bd_sync() executions which wasted CPU cycles. If a NVMe device doesn't support a write cache, the nvme_bd_sync function will return ENOTSUP to indicate this to upper layers. It seems this returned value is ignored by ZFS, and as such this bug is not really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the nvme driver in general does support cache flush. Instead it will issue an asynchronous flush to nvme and immediately return 0, and hence ZFS will not set vdev_nowritecache here. The nvme driver will at some point process the cache flush command, and if there is no write cache on the device it will return ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This function will not check the error code and not set nowritecache. The right place to check the error code from the cache flush is in zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous cache flushes. This would also be independent of the implementation detail that some drivers can return ENOTSUP immediately. Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>	2017-04-14 17:15:11 +00:00
Andriy Gapon	40e6e931e0	7430 Backfill metadnode more intelligently illumos/illumos-gate@af346df588 `af346df588` https://www.illumos.org/issues/7430 Description and patch from brought over from the following ZoL commit: https:// github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6 Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don’t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can’t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Ned Bass <bass6@llnl.gov>	2017-04-14 17:13:49 +00:00
Josh Paetzel	fedc1382bd	7603 xuio_stat_wbuf_* should be declared (void) illumos/illumos-gate@99aa8b5505 `99aa8b5505` https://www.illumos.org/issues/7603 The funcs are declared k&r style, where the args are not specified: void xuio_stat_wbuf_copied(); They should be declared to take no arguments: void xuio_stat_wbuf_copied(void); Need to change both .c and .h. Author: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2017-03-26 16:49:20 +00:00
Josh Paetzel	7e45ac57cb	7303 dynamic metaslab selection illumos/illumos-gate@8363e80ae7 https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18aa53f0 https://www.illumos.org/issues/7303 This change introduces a new weighting algorithm to improve metaslab selection. The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based). This also introduce a new allocation tracing facility and two new dcmds to help debug allocation problems. Each zio now contains a zio_alloc_list_t structure that is populated as the zio goes through the allocations stage. Here's an example of how to use the tracing facility: > c5ec000::print zio_t io_alloc_list \| ::walk list \| ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 ENOSPC ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 ENOSPC ztest.0a 1 0 400 1 x 8M 17b1a00 ztest.0a > 1ff2400::print zio_t io_alloc_list \| ::walk list \| ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 200 0 NOT_ALLOCATABLE mirror-2 - 0 200 0 NOT_ALLOCATABLE mirror-0 1 0 200 1 x 4M 112ae00 mirror-1 - 1 200 0 NOT_ALLOCATABLE mirror-2 - 1 200 0 NOT_ALLOCATABLE mirror-0 1 1 200 1 x 4M 112b000 mirror-1 - 2 200 0 NOT_ALLOCATABLE mirror-2 If the metaslab is using segment-based weighting then the WEIGHT column will display the number of segments available in the bucket where the allocation attempt was made. Author: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2017-03-15 04:18:40 +00:00
Andriy Gapon	b060bbc16a	7867 ARC space accounting leak illumos/illumos-gate@6de76ce2a9 `6de76ce2a9` https://www.illumos.org/issues/7867 It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we would fail to update the ARC space statistics. In the normal case those statistics are updated in arc_free_data_buf(). But in the arc_hdr_free_on_write() path we don't do that. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org>	2017-03-08 13:40:24 +00:00
Andriy Gapon	f1a99e4268	7843 get_clones_stat() is suboptimal for lots of clones illumos/illumos-gate@c5bde7273e `c5bde7273e` https://www.illumos.org/issues/7843 get_clones_stat() could be very slow if a snapshot has many (thousands) clones. Clone names are added to an nvlist that's created with NV_UNIQUE_NAME. So, each time a new name is appended to the list, the whole list is searched linearly to see if that name is not already in the list. That results in the quadratic complexity. That should be easy to fix as we know in advance that we should not get any duplicate names, so we can drop NV_UNIQUE_NAME when creating the list. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org>	2017-03-08 13:39:22 +00:00
Josh Paetzel	dbac7ab9c0	7570 tunable to allow zvol SCSI unmap to return on commit of txn to ZIL illumos/illumos-gate@1c9272b861 `1c9272b861` https://www.illumos.org/issues/7570 Based on the discovery that every unmap waits for the commit of the txn to the ZIL, introducing a very high latency to unmap commands, this behavior was made into a tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is that by default SCSI unmap commands will result in space being freed within the zvol (today they are ignored and returned with good status). However, unlike the code today, instead of 18+ms per unmap, they take about 30us. With the testing done on NTFS against a Win2k12 target, the new behavior should work seamlessly. Files on the zvol that have already been set with the zfree application will continue to write 0's when deleted, and any new files created since zvol creation will send unmap commands when deleted. This behavior exists today, but with this change the unmap commands will be processed and result in reclaim of space. Author: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com>	2017-02-25 19:16:20 +00:00
Josh Paetzel	acf1cb602b	6676 Race between unique_insert() and unique_remove() causes ZFS fsid change illumos/illumos-gate@40510e8eba `40510e8eba` https://www.illumos.org/issues/6676 The fsid of zfs filesystems might change after reboot or remount. The problem seems to be caused by a race between unique_insert() and unique_remove(). The unique_remove() is called from dsl_dataset_evict() which is now an asynchronous thread. In a case the dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end up with changed fsid on zfs mount. This problem is very likely caused by #5056. Steps to Reproduce Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore machines it is not so easy to reproduce. # uname -a SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris # zfs create rpool/TEST # FS=$(echo ::fsinfo \| mdb -k \| grep TEST \| awk '{print $1}') # echo $FS::print vfs_t vfs_fsid \| mdb -k vfs_fsid = { vfs_fsid.val = [ 0x54d7028a, 0x70311508 ] } # zfs umount rpool/TEST # zfs mount rpool/TEST # FS=$(echo ::fsinfo \| mdb -k \| grep TEST \| awk '{print $1}') # echo $FS::print vfs_t vfs_fsid \| mdb -k vfs_fsid = { vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ] } # Impact The persistent fsid (filesystem id) is essential for proper NFS functionality. If the fsid of a filesystem changes on remount (or after reboot) the NFS clients might not be able to automatically recover from such event and the manual remount of the NFS filesystems on every NFS client might be needed. Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Dan Vatca <dan.vatca@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com>	2017-02-25 03:34:22 +00:00
Josh Paetzel	91c038e77a	7504 kmem_reap hangs spa_sync and administrative tasks illumos/illumos-gate@405a5a0f5c https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80 https://www.illumos.org/issues/7504 We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(), which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv. Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds. Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com>	2017-02-17 15:00:13 +00:00
Josh Paetzel	281825cdf1	7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid illumos/illumos-gate@653af1b809 `653af1b809` https://www.illumos.org/issues/7500 With the integration of: commit 0f6d88aded0d165f5954688a9b13bac76c38da84 Author: Alex Reece <alex@delphix.com> Date: Sat Jul 26 13:40:04 2014 -0800 4873 zvol unmap calls can take a very long time for larger datasets the dnode's dn_bufs field was changed from a list to a tree. As a result, the dn_unlisted_l0_blkid field is no longer necessary. Author: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com>	2017-02-16 01:44:56 +00:00
Josh Paetzel	1e37d7e555	6569 large file delete can starve out write ops illumos/illumos-gate@ff5177ee8b `ff5177ee8b` https://www.illumos.org/issues/6569 The core issue I've found is that there is no throttle for how many deletes get assigned to one TXG. As a results when deleting large files we end up filling consecutive TXGs with deletes/frees, then write throttling other (more important) ops. There is an easy test case for this problem. Try deleting several large files (at least 1/2 TB) while you do write ops on the same pool. What we've seen is performance of these write ops (let's call it sideload I/O) would drop to zero. More specifically the problem is that dmu_free_long_range_impl() can/will fill up all of the dirty data in the pool "instantly", before many of the sideload ops can get in. So sideload performance will be impacted until all the files are freed. The solution we have tested at Nexenta (with positive results) creates a relatively simple throttle for how many "free" ops we let into one TXG. However this solution exposes other problems that should also be addressed. If we are to slow down freeing of data that means one has to wait even longer (assuming vnode ref count of 1) to get shell back after an rm or for NFS thread to finish the free-ing op. To avoid this the proposed solution is to call zfs_inactive() async for "large" files. Async freeing then begs for the reclaimed space to be accounted for in the zpool's "freeing" prop. The other issue with having a longer delete is the inability to export/unmount for a longer period of time. The proposed solution is to interrupt freeing of blocks when a fs is unmounted. Author: Alek Pinchuk <alek@nexenta.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com>	2017-01-19 20:44:29 +00:00
Andriy Gapon	aed3e94d2f	3821 Race in rollback, zil close, and zil flush illumos/illumos-gate@43297f973a `43297f973a` https://www.illumos.org/issues/3821 We recently had nodes with some of the latest zfs bits panic on us in a rollback-heavy environment. The following is from my preliminary analysis: Let's look at where we died: > $C ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1) ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1) ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1) ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1) ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680) ffffff01ea6b9c30 thread_start+8() If we dig in we can find that this dataset corresponds to a zone: > ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233- 9e378e89877b" ] Okay so we have a null taskq pointer. That only happens during the calls to zil_open and zil_close. If we poke around we can see that we're actually in midst of a rollback: > ::pgrep zfs \| ::printf "0x%x %s\\n" proc_t . p_user.u_psargs 0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d- 4b8ab06213c2@marlin_init 0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233- 9e378e89877b@marlin_init 0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc- 4ac4ba8fe9f8@marlin_init 0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e- 10cd45ff8e9f@marlin_init 0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c- 6d1b2751e6f5@marlin_init 0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24- 2ed45cd0da70@marlin_init Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com>	2016-11-28 15:09:58 +00:00
Andriy Gapon	aee4554721	7181 race between zfs_mount and zfs_ioc_rollback illumos/illumos-gate@90f2c094b3 `90f2c094b3` https://www.illumos.org/issues/7181 zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths. dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup() before the setup is actually completed, thus an under-constructed zfsvfs becomes visible. Additionally, there is nothing to serialize the two call paths. As a result two threads can step on each other's toes. assertion failed: zilog->zl_clean_taskq == NULL, file: ../../common/fs/zfs/zil.c, line: 1772 > $c vpanic() 0xfffffffffbdf6928() zil_open+0x45(ffffff1bbc5dd000, fffffffff7993880) zfsvfs_setup+0x84(ffffffb378d77000, 0) zfs_resume_fs+0x132(ffffffb378d77000, ffffffb37ddcf000) zfs_ioc_rollback+0x96(ffffffb37ddcf000, ffffff01dcdc4cd0, ffffff01aa091000) zfsdev_ioctl+0x215(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) cdev_ioctl+0x39(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) spec_ioctl+0x60(ffffff0197737700, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) fop_ioctl+0x55(ffffff0197737700, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) ioctl+0x9b(7, 5a19, 80465f8) sys_syscall32+0x1f7() > ffffff1bbc5dd000::print objset_t os_zil os_zil = 0xffffff1c053cf7c0 > 0xffffff1c053cf7c0::print zilog_t zl_clean_taskq Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Andriy Gapon <andriy.gapon@clusterhq.com>	2016-11-22 11:51:55 +00:00
Andriy Gapon	7f74889d89	7199, 7200 dsl_dataset_rollback_sync may try to free already free blocks 7199 dsl_dataset_rollback_sync may try to free already free blocks 7200 no blocks must be born in a txg after a snaphot is created illumos/illumos-gate@bfaed0b91e `bfaed0b91e` https://www.illumos.org/issues/7199 dsl_dataset_rollback_sync may try to free already freed blocks when it calls dsl_destroy_head_sync_impl to destroy a temporary clone. That happens if a snapshot to which we are rolling back and from which the clone is created has some ZIL records. https://www.illumos.org/issues/7200 No new blocks must be born in a dataset in the same TXG after a snapshot of the dataset is taken. Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg and as such they would be presumed to belong o the snapshot while in fact they do not. All the datasets must be clean before sync tasks are run, so the described scenario may happen only if one of the sync tasks dirties the dataset and another sync task takes its snapshot. Then, there will be another sync pass because of the dirty data and the new blocks will be born in the same TXG when the data is written out. It seems that almost all of the existing sync tasks modify only MOS and do not dirty any objsets. The only exception that I've been able to identify so far is the rollback which can modify an objset when it zeroes out the objset's ZIL. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Andriy Gapon <andriy.gapon@clusterhq.com>	2016-11-22 11:49:55 +00:00
Andriy Gapon	d414bfe23b	7180 potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename illumos/illumos-gate@690041b9ca `690041b9ca` https://www.illumos.org/issues/7180 If a filesystem is not unmounted while the rename is being performed, then, for example, a concurrect zfs rollback may call zfs_suspend_fs followed by zfs_resume_fs on the same filesystem. The latter takes the filesystem's name as an argument. If the filesystem name changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os) call in zfs_resume_fs would fail resulting in a kernel panic. So far I have been able to reproduce this problem on FreeBSD where zfs rename has -u option that skips the unmounting before doing the renaming. But I think that in theory the same problem can occur on illumos as well, because the unmounting is done in userland before invoking the rename ioctl and there could be a race with, e.g., zfs mount. panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2 == 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/ zfs/zfs_vfsops.c, line: 2210 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710 vpanic() at vpanic+0x182/frame 0xfffffe004df30790 panic() at panic+0x43/frame 0xfffffe004df307f0 assfail3() at assfail3+0x2c/frame 0xfffffe004df30810 zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860 zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0 zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940 devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0 kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00 sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0 amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <andriy.gapon@clusterhq.com>	2016-11-22 11:47:27 +00:00
Andriy Gapon	0a682531f4	3746 ZRLs are racy illumos/illumos-gate@260af64db7 `260af64db7` https://www.illumos.org/issues/3746 From the original change log: It was possible for a reference to be added even with the lock held, and for references added just after a lock release to be lost. This bug was also independently found and reported in wesunsolve.net issues 6985013 6995524. In zrl_add(), always use an atomic operation to update the refcount. The mutex in the ZRL only guarantees that wakeups occur for waiters on the lock. It offers no protection against concurrent updates of the refcount. The only refcount transition that is safe to perform without an atomic operation is from ZRL_LOCKED back to 0, since this can only be performed by the thread which has the ZRL locked. Authored by: Will Andrews <will@freebsd.org> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Youzhong Yang <yyang@mathworks.com>	2016-10-27 07:11:31 +00:00
Alexander Motin	760c70cd25	7301 zpool export -f should be able to interrupt file freeing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Author: Alek Pinchuk <alek@nexenta.com> Closes #175	2016-10-14 11:49:36 +00:00
Alexander Motin	d503f63e1b	6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates Using a benchmark which creates 2 million files in one TXG, I observe that the thread running spa_sync() is on CPU almost the entire time we are syncing, and therefore can be a performance bottleneck. About 50% of the time in spa_sync() is in dmu_objset_do_userquota_updates(). The problem is that dmu_objset_do_userquota_updates() calls zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was modified (or created). In this benchmark, all the files are owned by the same user/group, so all 2 million calls to zap_increment_int() are modifying the same entry in the zap. The same issue exists for the DMU_GROUPUSED_OBJECT. We should keep an in-memory map from user to space delta while we are syncing, and when we finish, iterate over the in-memory map and modify the ZAP once per entry. This reduces the number of calls to zap_increment_int() from "number of objects modified" to "number of owners/groups of modified files". This reduced the time spent in spa_sync() in the file create benchmark by ~33%, from 11 seconds to 7 seconds. Closes #107 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@5fc46359c5	2016-10-14 11:46:17 +00:00
Alexander Motin	23f49f4ada	5120 zfs should allow large block/gzip/raidz boot pool (loader project) Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Toomas Soome <tsoome@me.com> openzfs/openzfs@c8811bd3e2	2016-10-14 11:41:06 +00:00
Alexander Motin	7fd98f779f	7402 Create tunable to ignore hole_birth feature Until we can resolve the numerous hole_birth bugs that have cropped up recently, and come up with a way going forwards to protect users from corruption, we should disable the hole_birth feature. Using a tunable allows those who are confident that their data is correct to continue to take advantage of the feature. Closes #188 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-28 23:46:08 +00:00
Alexander Motin	1c13b49766	7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs dsl_dataset_space is looking at the ds_bp's fill count while dmu_objset_write_ready() is concurrently modifying it. This fix adds an rrwlock to protect the ds_bp. Closes #180 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-28 23:44:13 +00:00
Alexander Motin	ccb5d7b8e8	7259 DS_FIELD_LARGE_BLOCKS is unused The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of this patch: commit ca0cc3918a1789fa839194af2a9245f801a06b1a Author: Matthew Ahrens <mahrens@delphix.com> Date: Fri Jul 24 09:53:55 2015 -0700 5959 clean up per-dataset feature count code Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> This patch simply removes this macro from dsl_dataset.h. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-07 20:07:39 +00:00
Alexander Motin	03d951aef3	7278 tuning zfs_arc_max does not impact arc_c_min When changing zfs_arc_max (e.g. as zdb does), it may be set to less than the default arc_c_min. arc_c_min should decrease to not be more than arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@608764bead	2016-09-07 20:00:22 +00:00
Alexander Motin	72a9a6ded9	7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Closes #109 Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@d3e523d489	2016-09-03 10:54:56 +00:00
Alexander Motin	6ee1596f02	7247 zfs receive of deduplicated stream fails This resolves two 'zfs recv' issues. First, when receiving into an existing filesystem, a snapshot created during the receive process is not added to the guid->dataset map for the stream, resulting in failed lookups for deduped streams when a WRITE_BYREF record refers to a snapshot received earlier in the stream. Second, the newly created snapshot was also not set properly, referencing the snapshot before the new receiving dataset rather than the existing filesystem. Closes #159 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Author: Chris Williamson <chris.williamson@delphix.com> openzfs/openzfs@b09697c8c1	2016-09-03 10:50:43 +00:00
Alexander Motin	7b84e6dc6f	7003 zap_lockdir() should tag hold zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Closes #108 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@0780b3eab5	2016-09-03 10:48:48 +00:00
Andriy Gapon	4b498e5e9d	7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information 7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often illumos/illumos-gate@b72b6bb10a `b72b6bb10a` https://www.illumos.org/issues/7136 6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents whenever an aux device gets removed from a pool. However, those sysevents will be created without the vdev_guid and vdev_path fields. It would be better to always populate those fields. https://www.illumos.org/issues/7115 The addition of spa_event_notify in vdev removal code (see #6922) causes events to be generated even if the spare failed to be removed with EBUSY. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alan Somers <asomers@gmail.com>	2016-08-15 14:23:50 +00:00
Andriy Gapon	cb0a9ee48f	7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records illumos/illumos-gate@12b90ee2d3 `12b90ee2d3` https://www.illumos.org/issues/7230 A test failure occurred where a send stream had only a BEGIN record. This should not be possible if the send returns without error. Prevented this from happening in the future by adding an assertion to dmu_send_impl() to verify that if the function returns 0 (success) both a BEGIN and END record are present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN or END records sent), having dump_record() set flags appropriately, adding VERIFY statement to dmu_send_impl(). Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matt Krantz <matt.krantz@delphix.com>	2016-08-15 14:22:12 +00:00
Andriy Gapon	2a7aac5a99	7235 remove unused func dsl_dataset_set_blkptr illumos/illumos-gate@bd56f80007 `bd56f80007` https://www.illumos.org/issues/7235 The function dsl_dataset_set_blkptr() is unused. We should remove it. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-08-15 14:21:46 +00:00
Andriy Gapon	2e97b962e7	7090 zfs should improve allocation order and throttle allocations illumos/illumos-gate@0f7643c737 `0f7643c737` https://www.illumos.org/issues/7090 When write I/Os are issued, they are issued in block order but the ZIO pipeline will drive them asynchronously through the allocation stage which can result in blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able to detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocation is successfully completed it's scheduled on a given top-level vdev. Each top- level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: George Wilson <george.wilson@delphix.com>	2016-08-15 14:21:10 +00:00
Andriy Gapon	bcb0ab1eb8	7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer illumos/illumos-gate@926549256b `926549256b` https://www.illumos.org/issues/7086 In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the db_blkptr, to prevent it from being changed by syncing context. Otherwise we may see that ztest got a segfault from this stack: libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20, 0) libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0) libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530) libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1) libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1) libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780) libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a) ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0) ztest_remove+0x9f(ea294588, ea293f50, 4, 3) ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1) ztest_dmu_object_alloc_free+0x71(ea294588, 13) ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0) ztest_execute+0x89(a, 807c720, 13, 0) ztest_thread+0xea(13, 0, 0, 0) libc.so.1`_thrp_setup+0x88(f0983240) libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0) Looking into it a bit, we see that this is an embedded blockpointer, so BP_GET_NDVAS should have returned 0: b32b240::blkptr EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L Instead, it looks like another thread is modifying this blockpointer: b32b240::ugrep \| ::whatis f47a0e0c is in [ stack tid=0x19f ] ebd6ec40 is in [ stack tid=0x226 ] ea293bd0 is in [ stack tid=0x244 ] ea293be4 is in [ stack tid=0x244 ] Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-07-20 09:49:09 +00:00
Andriy Gapon	76fbf360a4	7072 zfs fails to expand if lun added when os is in shutdown state illumos/illumos-gate@c39a2aae1e `c39a2aae1e` https://www.illumos.org/issues/7072 upstream: 38733 zfs fails to expand if lun added when os is in shutdown state DLPX-36910 spares and caches should not display expandable space DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2016-07-20 09:47:35 +00:00

1 2 3 4 5 ...

382 Commits