freebsd-dev

Author	SHA1	Message	Date
Andriy Gapon	e1b8f10a5e	MFV r316920: 8023 Panic destroying a metaslab deferred range tree illumos/illumos-gate@3991b535a8 `3991b535a8` https://www.illumos.org/issues/8023 $C ffffff0011bc0970 vpanic() ffffff0011bc0a00 strlog() ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00) ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380) ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800) ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000) ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0) ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000) ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000) ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190) ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149() Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com> MFC after: 2 weeks	2017-05-24 22:25:26 +00:00
Andriy Gapon	5386d7295a	MFV r316917: 7968 multi-threaded spa_sync() illumos/illumos-gate@94c2d0eb22 `94c2d0eb22` https://www.illumos.org/issues/7968 spa_sync() iterates over all the dirty dnodes and processes each of them by calling dnode_sync(). If there are many dirty dnodes (e.g. because we created or removed a lot of files), the single thread of spa_sync() calling dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied concurrently in open context (e.g. due to concurrent file creation), the os_lock will experience lock contention via dnode_setdirty(). The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to use separate threads to process each of the sublists in the multilist. On the concurrent file creation microbenchmark, the performance improvement from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/ reward, once the other bottlenecks are addressed, fixing this bug will provide a medium-large performance gain and require a medium amount of effort to implement. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 22:21:24 +00:00
Andriy Gapon	2ba631553c	MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists illumos/illumos-gate@10fbdecb05 `10fbdecb05` https://www.illumos.org/issues/7970 The global tunable zfs_arc_num_sublists_per_state is used by the ARC and the dbuf cache, and other users are planned. We should change this tunable to be common to all multilists. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 22:15:16 +00:00
Andriy Gapon	1d7634429c	MFC r316915: 7801 add more by-dnode routines (lint) illumos/illumos-gate@411be58a6e `411be58a6e` MFC after: 24 days X-MFC with: r318823	2017-05-24 21:52:20 +00:00
Andriy Gapon	31fd119cc2	MFC r316914: 7801 add more by-dnode routines illumos/illumos-gate@b0c42cd470 `b0c42cd470` https://www.illumos.org/issues/7801 Add _by_dnode() routines for accessing objects given their dnode_t , this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Ported from: `0eef1bde31` Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: bzzz77 <bzzz.tomas@gmail.com> MFC after: 24 days	2017-05-24 21:49:21 +00:00
Andriy Gapon	5aab788866	MFC r316913: 7869 panic in bpobj_space(): null pointer dereference illumos/illumos-gate@a3905a4592 `a3905a4592` https://www.illumos.org/issues/7869 The issue fixed by this patch is a race condition in the deadlist code. A thread executing an administrative command that uses `dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to protect the access of all its entries that the deadlist contains in an avl tree. Sync threads trying to insert a new entry in the deadlist (through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our sentinel value), we close and reopen it. Between these two operations, it is possible for the `dsl_deadlist_space_range()` thread to dereference that bpobj which is `NULL` during that window. Threads should hold the a deadlist's `dl_lock` when they manipulate its internal data so scenarios like the one above are avoided. In addition, threads should also hold the bpobj lock whenever they are allocating the subobj list of a bpobj, and not just when they actually insert the subobj to the list. This way we can avoid potential memory leaks. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> MFC after: 2 weeks	2017-05-24 21:45:52 +00:00
Andriy Gapon	930f1af491	MFC r316912: 7793 ztest fails assertion in dmu_tx_willuse_space illumos/illumos-gate@61e255ce72 `61e255ce72` https://www.illumos.org/issues/7793 Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_ accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 21:43:34 +00:00
Andriy Gapon	3a9c923927	MFC r316907: 1300 filename normalization doesn't work for removes illumos/illumos-gate@1c17160ac5 `1c17160ac5` https://www.illumos.org/issues/1300 FreeBSD note: recent FreeBSD was not affected by the issue fixed as the name cache is completely bypassed when normalization is enabled. The change is imported for the sake of ZAP infrastructure modifications. Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> MFC after: 3 weeks	2017-05-24 21:29:31 +00:00
Alan Somers	7ac72c256f	vdev_geom may associate multiple vdevs per g_consumer vdev_geom.c currently uses the g_consumer's private field to point to a vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For example, when you remove a disk, the vdev's status will change to REMOVED. However, vdev_geom will sometimes attach multiple vdevs to the same GEOM consumer. If this happens, then geom events will only be propagated to one of the vdevs. Fix this by storing a linked list of vdevs in g_consumer's private field. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c * g_consumer.private now stores a linked list of vdev pointers associated with the consumer instead of just a single vdev pointer. * Change vdev_geom_set_physpath's signature to more closely match vdev_geom_set_rotation_rate * Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed that we've already accessed the consumer by the time we get here. * Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it in vdev_geom_open, after we know that the open has succeeded. PR: 218634 Reviewed by: gibbs MFC after: 1 week Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D10391	2017-05-11 16:26:56 +00:00
Josh Paetzel	ba13ab83f2	Fix misport of compressed ZFS send/recv from 317414 Reported by: Michael Jung <mikej@mikej.com> Reviewed by: avg	2017-05-01 12:56:12 +00:00
Mark Johnston	babf030fd6	Get rid of some ifdef soup in the fasttrap ioctl handler. No functional change intended. MFC after: 1 week	2017-04-28 22:25:22 +00:00
Josh Paetzel	285d85ab04	MFV 316905 7740 fix for 6513 only works in hole punching case, not truncation illumos/illumos-gate@7de35a3ed0 `7de35a3ed0` https://www.illumos.org/issues/7740 The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. To verify, create a large file, truncate it to a short length, and then write beyond the end. Check with zdb to make sure that there are no holes with birth time zero (will appear as gaps). Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-28 02:11:29 +00:00
Josh Paetzel	358f157522	MFV 316900 7743 per-vdev-zaps have no initialize path on upgrade illumos/illumos-gate@555da5111b `555da5111b` https://www.illumos.org/issues/7743 When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-27 23:31:38 +00:00
Josh Paetzel	8ad5797208	MFV 316898 7613 ms_freetree[4] is only used in syncing context illumos/illumos-gate@5f14577801 `5f14577801` https://www.illumos.org/issues/7613 metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg). Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-27 22:00:03 +00:00
Josh Paetzel	e3cb0e99f8	MFV 316897 7586 remove #ifdef __lint hack from dmu.h illumos/illumos-gate@4ba5b96163 `4ba5b96163` https://www.illumos.org/issues/7586 The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if we add other inline functions into header files in ZFS, especially since it is difficult to make any other solution work across all compilation targets. We should switch to disabling the lint flags that are failing instead. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-27 21:11:57 +00:00
Josh Paetzel	011275233c	MFV 316896 7580 ztest failure in dbuf_read_impl illumos/illumos-gate@1a01181fdc `1a01181fdc` https://www.illumos.org/issues/7580 We need to prevent any reader whenever we're about the zero out all the blkptrs. To do this we need to grab the dn_struct_rwlock as writer in dbuf_write_children_ready and free_children just prior to calling bzero. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2017-04-27 16:38:28 +00:00
Josh Paetzel	fa88c78914	MFV 316895 7606 dmu_objset_find_dp() takes a long time while importing pool illumos/illumos-gate@7588687e6b `7588687e6b` https://www.illumos.org/issues/7606 When importing a pool with a large number of filesystems within the same parent filesystem, we see that dmu_objset_find_dp() takes a long time. It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(), and spa_load_verify(). There are several ways to improve performance here: 1. We don't really need to do spa_check_logs() or spa_ld_claim_log_blocks() if the pool was closed cleanly. 2. spa_load_verify() uses dmu_objset_find_dp() to check that no datasets have too long of names. 3. dmu_objset_find_dp() is slow because it's doing zap_value_search() (which is O(N sibling datasets)) to determine the name of each dsl_dir when it's opened. In this case we actually know the name when we are opening it, so we can provide it and avoid the lookup. This change implements fix #3 from the above list; i.e. make dmu_objset_find_dp() provide the name of the dataset so that we don't have to search for it. Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-27 15:10:45 +00:00
Josh Paetzel	c78abb8b50	MFV 316894 7252 7628 compressed zfs send / receive illumos/illumos-gate@5602294fda `5602294fda` https://www.illumos.org/issues/7252 This feature includes code to allow a system with compressed ARC enabled to send data in its compressed form straight out of the ARC, and receive data in its compressed form directly into the ARC. https://www.illumos.org/issues/7628 We should have longer, more readable versions of the ZFS send / recv options. 7628 create long versions of ZFS send / receive options Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-25 17:57:43 +00:00
Josh Paetzel	ef18459108	MFV 316891 7386 zfs get does not work properly with bookmarks illumos/illumos-gate@edb901aab9 `edb901aab9` https://www.illumos.org/issues/7386 The zfs get command does not work with the bookmark parameter while it works properly with both filesystem and snapshot: # zfs get -t all -r creation rpool/test NAME PROPERTY VALUE SOURCE rpool/test creation Fri Sep 16 15:00 2016 - rpool/test@snap creation Fri Sep 16 15:00 2016 - rpool/test#bkmark creation Fri Sep 16 15:00 2016 - # zfs get -t all -r creation rpool/test@snap NAME PROPERTY VALUE SOURCE rpool/test@snap creation Fri Sep 16 15:00 2016 - # zfs get -t all -r creation rpool/test#bkmark cannot open 'rpool/test#bkmark': invalid dataset name # The zfs get command should be modified to work properly with bookmarks too. Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Marcel Telka <marcel@telka.sk>	2017-04-21 19:53:52 +00:00
Josh Paetzel	36064ac2d5	MFV 316871 7490 real checksum errors are silenced when zinject is on illumos/illumos-gate@6cedfc397d `6cedfc397d` https://www.illumos.org/issues/7490 When zinject is on, error codes from zfs_checksum_error() can be overwritten due to an incorrect and overly-complex if condition. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2017-04-21 00:24:59 +00:00
Josh Paetzel	9a625bd31c	MFV 316870 7448 ZFS doesn't notice when disk vdevs have no write cache illumos/illumos-gate@295438ba32 `295438ba32` https://www.illumos.org/issues/7448 I built a SmartOS image with all the NVMe commits including 7372 (support NVMe volatile write cache) and repeated my dd testing: > #!/bin/bash > for i in `seq 1 1000`; do > dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync & > dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync & > wait > rm file00 file01 > done > Previously each dd command took ~145 seconds to finish, now it takes ~400 seconds. Eventually I figured out it is 7372 that causes unnecessary nvme_bd_sync() executions which wasted CPU cycles. If a NVMe device doesn't support a write cache, the nvme_bd_sync function will return ENOTSUP to indicate this to upper layers. It seems this returned value is ignored by ZFS, and as such this bug is not really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the nvme driver in general does support cache flush. Instead it will issue an asynchronous flush to nvme and immediately return 0, and hence ZFS will not set vdev_nowritecache here. The nvme driver will at some point process the cache flush command, and if there is no write cache on the device it will return ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This function will not check the error code and not set nowritecache. The right place to check the error code from the cache flush is in zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous cache flushes. This would also be independent of the implementation detail that some drivers can return ENOTSUP immediately. Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Obtained from: Illumos	2017-04-21 00:17:54 +00:00
Josh Paetzel	47e222432b	MFV 316868 7430 Backfill metadnode more intelligently illumos/illumos-gate@af346df588 `af346df588` https://www.illumos.org/issues/7430 Description and patch from brought over from the following ZoL commit: https:// github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6 Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don?t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can?t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Ned Bass <bass6@llnl.gov> Obtained from: Illumos	2017-04-21 00:12:47 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Gleb Smirnoff	9ed01c32e0	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
Andriy Gapon	656074ea60	rename vfs.zfs.debug_flags to vfs.zfs.debugflags While the former name is easier to read, the "_flags" suffix has a special meaning for loader(8) and, thus, it was impossible to set the knob via loader.conf(5). The loader interpreted the setting as flags that should be passed to a kernel module named "vfs.zfs.debug". Discussed with: smh MFC after: 2 weeks	2017-04-14 15:35:07 +00:00
Alan Somers	d255847d9e	Fix vdev_geom_attach_by_guids for partitioned disks When opening a vdev whose path is unknown, vdev_geom must find a geom provider with a label whose guids match the desired vdev. However, due to partitioning, it is possible that two non-synonomous providers will share some labels. For example, if the first partition starts at the beginning of the drive, then ada0 and ada0p1 will share the first label. More troubling, if the last partition runs to the end of the drive, then ada0p3 and ada0 will share the last label. If vdev_geom opens ada0 when it should've opened ada0p3, then the pool won't be readable. If it opens ada0 when it should've opened ada0p1, then it will corrupt some other partition when it writes the 3rd and 4th labels. The easiest way to reproduce this problem is to install a mirrored root pool with the default partition layout, then swap the positions of the two boot drives and reboot. Whether the bug manifests depends on the order in which geom lists its providers, which is arbitrary. Fix this situation by modifying the search algorithm to prefer geom providers that have all four labels intact. If no such provider exists, then open whichever provider has the most. Reviewed by: mav MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D10365	2017-04-13 14:51:34 +00:00
Steven Hartland	3e856909b7	Fix expandsz 16.0E vals and vdev_min_asize of RAIDZ children When a member of a RAIDZ has been replaced with a device smaller than the original, then the top level vdev can report its expand size as 16.0E. The reduced child asize causes the RAIDZ to have a vdev_asize lower than its vdev_max_asize which then results in an underflow during the calculation of the parents expand size. Fix this by updating the vdev_asize if it shrinks, which is already protected by a check against vdev_min_asize so should always be safe. Also for RAIDZ vdevs, ensure that the sum of their child vdev_min_asize is always greater than the parents vdev_min_size. Fixes: https://www.illumos.org/issues/7885 MFC after: 2 weeks Sponsored by: Multiplay	2017-04-03 13:11:28 +00:00
Josh Paetzel	e106234416	MFV: 315989 7603 xuio_stat_wbuf_* should be declared (void) illumos/illumos-gate@99aa8b5505 `99aa8b5505` https://www.illumos.org/issues/7603 The funcs are declared k&r style, where the args are not specified: void xuio_stat_wbuf_copied(); They should be declared to take no arguments: void xuio_stat_wbuf_copied(void); Need to change both .c and .h. Author: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2017-03-27 17:27:46 +00:00
Alexander Motin	3aef5b286a	MFV r315290, r315291: 7303 dynamic metaslab selection illumos/illumos-gate@8363e80ae7 https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18 https://www.illumos.org/issues/7303 This change introduces a new weighting algorithm to improve metaslab selection. The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based). This also introduce a new allocation tracing facility and two new dcmds to help debug allocation problems. Each zio now contains a zio_alloc_list_t structure that is populated as the zio goes through the allocations stage. Here's an example of how to use the tracing facility: > c5ec000::print zio_t io_alloc_list \| ::walk list \| ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 ENOSPC ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 ENOSPC ztest.0a 1 0 400 1 x 8M 17b1a00 ztest.0a > 1ff2400::print zio_t io_alloc_list \| ::walk list \| ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 200 0 NOT_ALLOCATABLE mirror-2 - 0 200 0 NOT_ALLOCATABLE mirror-0 1 0 200 1 x 4M `112ae00` mirror-1 - 1 200 0 NOT_ALLOCATABLE mirror-2 - 1 200 0 NOT_ALLOCATABLE mirror-0 1 1 200 1 x 4M 112b000 mirror-1 - 2 200 0 NOT_ALLOCATABLE mirror-2 If the metaslab is using segment-based weighting then the WEIGHT column will display the number of segments available in the bucket where the allocation attempt was made. Author: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2017-03-24 09:37:00 +00:00
Andriy Gapon	16b46572fa	zfs_putpages: use TXG_WAIT Explicit looping using TXG_NOWAIT is more verbose and may harm performance under heavy load because of multiple waits. MFC after: 1 week	2017-03-23 09:13:21 +00:00
Andriy Gapon	3d775e193e	zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregate This way we can avoid blocking the whole queue in the low memory situations. It's better to sacrifice some I/O performance by not doing the aggregation than to add an indefinite wait for more memory. Reviewed by: smh MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9999	2017-03-23 08:59:17 +00:00
Steven Hartland	c76da62acf	Reduce ARC fragmentation threshold As ZFS can request up to SPA_MAXBLOCKSIZE memory block e.g. during zfs recv, update the threshold at which we start agressive reclamation to use SPA_MAXBLOCKSIZE (16M) instead of the lower zfs_max_recordsize which defaults to 1M. PR: 194513 Reviewed by: avg, mav MFC after: 1 month Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D10012	2017-03-17 12:34:57 +00:00
Andriy Gapon	520758a51d	zfs: provide a special vptocnp method for the .zfs vnode vop_stdvptocnp() doesn't work properly if .zfs directory is hidden. Reported by: swills, des Tested by: des MFC after: 1 week MFC with: r314048	2017-03-11 16:00:49 +00:00
Andriy Gapon	1a3c849840	MFV r314911: 7867 ARC space accounting leak illumos/illumos-gate@6de76ce2a9 `6de76ce2a9` https://www.illumos.org/issues/7867 It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we would fail to update the ARC space statistics. In the normal case those statistics are updated in arc_free_data_buf(). But in the arc_hdr_free_on_write() path we don't do that. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 10 days	2017-03-08 13:52:45 +00:00
Andriy Gapon	7e4b3a6fa2	MFV r314910: 7843 get_clones_stat() is suboptimal for lots of clones illumos/illumos-gate@c5bde7273e `c5bde7273e` https://www.illumos.org/issues/7843 get_clones_stat() could be very slow if a snapshot has many (thousands) clones. Clone names are added to an nvlist that's created with NV_UNIQUE_NAME. So, each time a new name is appended to the list, the whole list is searched linearly to see if that name is not already in the list. That results in the quadratic complexity. That should be easy to fix as we know in advance that we should not get any duplicate names, so we can drop NV_UNIQUE_NAME when creating the list. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 1 week Sponsored by: ClusterHQ	2017-03-08 13:48:26 +00:00
Martin Matuska	0132c9cd4a	Fix null pointer dereference in zfs_freebsd_setacl(). Prevents unprivileged users from panicking the kernel by calling __acl_delete_*() on files or directories inside a ZFS mount. MFC after: 3 days	2017-03-02 23:23:28 +00:00
Alexander Motin	6d1ccf40cc	Execute last ZIO of log commit synchronously. For short transactions overhead of context switch can be too large. Skipping it gives significant latency reduction. For large ones, including multiple ZIOs, latency is less critical, while throughput there may become limited by checksumming speed of single CPU core. To get best of both cases, execute last ZIO directly from calling thread context to save latency, while all others (if there are any) enqueue to taskqueues in traditional way. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2017-03-02 07:55:47 +00:00
Alexander Motin	e93f9c7708	Completely skip cache flushing for not supporting log devices. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2017-03-02 07:50:06 +00:00
Andrey V. Elsukov	19b60f70c0	Do not invoke the resize event when previous provider's size was zero. This is similar to r303637 fix for geom_disk. Reported by: avg Tested by: avg MFC after: 1 week	2017-03-01 18:03:32 +00:00
Josh Paetzel	b98d22744f	MFV 314276 7570 tunable to allow zvol SCSI unmap to return on commit of txn to ZIL illumos/illumos-gate@1c9272b861 `1c9272b861` https://www.illumos.org/issues/7570 Based on the discovery that every unmap waits for the commit of the txn to the ZIL, introducing a very high latency to unmap commands, this behavior was made into a tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is that by default SCSI unmap commands will result in space being freed within the zvol (today they are ignored and returned with good status). However, unlike the code today, instead of 18+ms per unmap, they take about 30us. With the testing done on NTFS against a Win2k12 target, the new behavior should work seamlessly. Files on the zvol that have already been set with the zfree application will continue to write 0's when deleted, and any new files created since zvol creation will send unmap commands when deleted. This behavior exists today, but with this change the unmap commands will be processed and result in reclaim of space. Author: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com>	2017-02-25 20:01:17 +00:00
Andriy Gapon	9211bb327f	l2arc: try to fix write size calculation broken by Compressed ARC commit While there, make a change to not evict a first buffer outside the requested eviciton range. To do: - give more consistent names to the size variables - upstream to OpenZFS PR: 216178 Reported by: lev Tested by: lev MFC after: 2 weeks	2017-02-25 17:03:48 +00:00
Andriy Gapon	1e1065b60f	zfs: call spa_deadman on a taskqueue thread callout(9) prohibits callout functions from sleeping. illumos mutexes are emulated using sx(9). spa_deadman() calls vdev_deadman() and the latter acquires vq_lock. As a result we can get a more confusing panic instead of a specific panic or no panic: sleepq_add: td 0xfffff80019669960 to sleep on wchan 0xfffff8001cff4d88 with sleeping prohibited This change adds another level of indirection where the deadman callout schedules spa_deadman() to be executed on taskqueue_thread. While there, use callout_schedule(0 instead of callout_reset() in spa_sync(). Discussed with: mav MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D9762	2017-02-25 16:45:53 +00:00
Josh Paetzel	029c0bfdbd	MFV 314243 6676 Race between unique_insert() and unique_remove() causes ZFS fsid change illumos/illumos-gate@40510e8eba `40510e8eba` https://www.illumos.org/issues/6676 The fsid of zfs filesystems might change after reboot or remount. The problem seems to be caused by a race between unique_insert() and unique_remove(). The unique_remove() is called from dsl_dataset_evict() which is now an asynchronous thread. In a case the dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end up with changed fsid on zfs mount. This problem is very likely caused by #5056. Steps to Reproduce Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore machines it is not so easy to reproduce. # uname -a SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris # zfs create rpool/TEST # FS=$(echo ::fsinfo \| mdb -k \| grep TEST \| awk '{print $1}') # echo $FS::print vfs_t vfs_fsid \| mdb -k vfs_fsid = { vfs_fsid.val = [ 0x54d7028a, 0x70311508 ] } # zfs umount rpool/TEST # zfs mount rpool/TEST # FS=$(echo ::fsinfo \| mdb -k \| grep TEST \| awk '{print $1}') # echo $FS::print vfs_t vfs_fsid \| mdb -k vfs_fsid = { vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ] } # Impact The persistent fsid (filesystem id) is essential for proper NFS functionality. If the fsid of a filesystem changes on remount (or after reboot) the NFS clients might not be able to automatically recover from such event and the manual remount of the NFS filesystems on every NFS client might be needed. Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Dan Vatca <dan.vatca@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com>	2017-02-25 14:45:54 +00:00
Andriy Gapon	7fa27112f3	zfs: clean up unused files and definitions MFC after: 1 month X-MFC after: r314048	2017-02-24 07:53:56 +00:00
Andriy Gapon	b93763e55d	zfs: move zio_taskq_basedc under SYSDC That knob is useless without SDC (or alike) scheduling class support. That is, it's unused on FreeBSD. MFC after: 4 days	2017-02-21 21:11:58 +00:00
Andriy Gapon	2b1bedaf06	zfs: lower priority of zio_write_issue threads by four The difference of one was insignificant because zio_write_issue threads ended up on the same run queues as other zio threads. See sys/priority.h and sys/runq.h for more details. Add a comment describing FreeBSD priority considerations and restore the illumos variant of the code for comparison. Obtained from: Panzura MFC after: 2 weeks Sponsored by: Panzura	2017-02-21 21:09:21 +00:00
Andriy Gapon	47c8e3d912	reimplement zfsctl (.zfs) support The current code is written on top of GFS, a library with the generic support for writing filesystems, which was ported from illumos. Because of significant differences between illumos VFS and FreeBSD VFS models, both the GFS and zfsctl code were heavily modified to work on FreeBSD. Nonetheless, they still contain quite a few ugly hacks and bugs. This is a reimplementation of the zfsctl code where the VFS-specific bits are written from scratch and only the code that interacts with the rest of ZFS is reused. Some highlights. We use two types of nodes, static and on-demand. The static nodes are used for permanent directories like .zfs, .zfs/snapshot, etc. The on-demand nodes are used for ephemeral directories that act as snapshot mount points. Initially only static nodes are created. Their vnodes are instantiated when they are looked up. The on-demand nodes and vnodes are instantiated as needed and the nodes are destroyed as soon as the corresponding vnodes are reclaimed. We also try very hard to ensure that uncovered snapshot vnodes do not linger. They are supposed to become inactive as soon as they are uncovered and we try to recycle them immediately. When a filesystem is unmounted all snapshots under .zfs are unmounted first, then all vnodes are flushed and finally the static .zfs nodes are destroyed. There are some changes outside of zfsctl code too. z_ctldir is never used directly (as it is an opaque pointer), zfsctl_root() has to be used instead. The function returns a locked vnode now, so it accepts a lock flags parameter. The function can also fail now, e.g. during force unmounting, whereas previously it was infallible. zfsctl_root_lookup() is retired, instead of it VOP_LOOKUP() on the .zfs vnode (obtained with zfsctl_root) is used. Some ideas are picked from an independent work by will. Reviewed by: asomers, smh MFC after: 1 month Relnotes: maybe Differential Revision: https://reviews.freebsd.org/D7421	2017-02-21 17:47:08 +00:00
Josh Paetzel	aedc925301	MVF: 313876 7504 kmem_reap hangs spa_sync and administrative tasks illumos/illumos-gate@405a5a0f5c https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80 https://www.illumos.org/issues/7504 We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(), which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv. Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds. Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com>	2017-02-17 17:52:12 +00:00
Josh Paetzel	c53cc7187c	MFV 313786 7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid illumos/illumos-gate@653af1b809 `653af1b809` https://www.illumos.org/issues/7500 With the integration of: commit 0f6d88aded0d165f5954688a9b13bac76c38da84 Author: Alex Reece <alex@delphix.com> Date: Sat Jul 26 13:40:04 2014 -0800 4873 zvol unmap calls can take a very long time for larger datasets the dnode's dn_bufs field was changed from a list to a tree. As a result, the dn_unlisted_l0_blkid field is no longer necessary. Author: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com>	2017-02-16 19:00:09 +00:00
Mark Johnston	28180eff9e	Use pget() instead of pfind() in fasttrap_pid_{enable,disable}(). Suggested by: mjg MFC after: 1 week	2017-02-15 06:07:01 +00:00

1 2 3 4 5 ...

1406 Commits