freebsd-dev

Author	SHA1	Message	Date
Alexander Motin	25f15f404d	MFV r336960: 9256 zfs send space estimation off by > 10% on some datasets illumos/illummos-gate@df477c0afa Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-31 01:02:22 +00:00
Alexander Motin	9abeb6d79e	MFV r336958: 9337 zfs get all is slow due to uncached metadata This project's goal is to make read-heavy channel programs and zfs(1m) administrative commands faster by caching all the metadata that they will need in the dbuf layer. This will prevent the data from being evicted, so that any future call to i.e. zfs get all won't have to go to disk (very much). illumos/illumos-gate@adb52d9262 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:58:21 +00:00
Alexander Motin	d1cf4052d0	MFV r336955: 9236 nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. illumos/illumos-gate@21f7c81cc1 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:47:27 +00:00
Alexander Motin	656d46109d	MFV r336952: 9192 explicitly pass good_writes to vdev_uberblock/label_sync Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume that its io_private is a pointer to the good_writes count. They should instead accept this argument explicitly. illumos/illumos-gate@a3b5583021 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:37:45 +00:00
Alexander Motin	194000fa21	MFV r336950: 9290 device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. illumos/illumos-gate@3a4b1be953 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:25:39 +00:00
Alexander Motin	9cd6f162c0	MFV r336948: 9112 Improve allocation performance on high-end systems On high-end systems running async sequential write workloads, especially NUMA systems with flash or NVMe storage, one significant performance bottleneck is selecting a metaslab to do allocations from. This process can be parallelized, providing significant performance increases for these workloads. illumos/illumos-gate@f78cdc34af Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-31 00:02:42 +00:00
Alexander Motin	6413a6d31f	MFV r336946: 9238 ZFS Spacemap Encoding V2 The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). The new remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps. One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. illumos/illumos-gate@17f11284b4 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-07-30 23:47:38 +00:00
Alexander Motin	eb235f2f8e	MFV r336942: 9189 Add debug to vdev_label_read_config when txg check fails illumos/illumos-gate@b6bf6e1540 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-07-30 22:03:29 +00:00
Allan Jude	68232f2053	ZFS: Reserve DMU_BACKUP_FEATURE flags for Native Encryption and ZSTD	2018-07-24 04:38:11 +00:00
Sean Eric Fagan	1bbaf1401c	Fix a couple of typos in r334844 noticed by Richard Kojedzinszky. Submitted by: Richard Kojedzinszky Reviewed by: sef Approved by: mav	2018-07-18 16:03:40 +00:00
Sean Eric Fagan	072ffd4b20	Fix up some missed and mis-merges from the sequential scan code (r334844). Most of the changes involve moving some code around to reduce conflicts with future merges. One of the missing changes included a notification on scrub cancellation. Approved by: mav Sponsored by: iXsystems Inc	2018-07-10 20:11:32 +00:00
Sean Eric Fagan	aad5531e71	This exposes ZFS user and group quotas via the normal quatactl(2) mechanism. (Read-only at this point, however.) In particular, this is to allow rpc.rquotad query quotas for NFS mounts, allowing users to see their quotas on the hosts using the datasets. The changes specifically: * Add new RPC entry points for querying quotas. * Changes the library routines to allow non-UFS quotas. * Changes rquotad to check for quotas on mounted filesystems, rather than being limited to entries in /etc/fstab * Lastly, adds a VFS entry-point for ZFS to query quotas. Note that this makes one unavoidable behavioural change: if quotas are enabled, then they can be queried, as opposed to the current method of checking for quotas being specified in fstab. (With ZFS, if there are user or group quotas, they're used, always.) Reviewed by: delphij, mav Approved by: mav Sponsored by: iXsystems Inc Differential Revision: https://reviews.freebsd.org/D15886	2018-07-05 22:56:13 +00:00
Sean Eric Fagan	69724399c4	This originated from ZFS On Linux, as `d4a72f2386` During scans (scrubs or resilvers), it sorts the blocks in each transaction group by block offset; the result can be a significant improvement. (On my test system just now, which I put some effort to introduce fragmentation into the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the changes.) I've seen similar rations on production systems. Approved by: Alexander Motin Obtained from: ZFS On Linux Relnotes: Yes (improved scrub performance, with tunables) Differential Revision: https://reviews.freebsd.org/D15562	2018-06-08 17:38:28 +00:00
Benno Rice	b3b11d6400	Break recursion involving getnewvnode and zfs_rmnode. When we're at our vnode limit, getnewvnode will call into the vnode LRU cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling represents something with extended attributes, zfs_rmnode will call zfs_zget which will attempt to allocate another vnode. If the next vnode we try to recycle is also a ZFS vnode representing something with extended attributes we can recurse further. This ends up being unbounded and can end up overflowing the stack. In order to avoid this, restructure zfs_rmnode to simply add the extended attribute directory's object ID to the unlinked set, thus not requiring the allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain which will do the work of properly marking the vnodes for unlinking. zfs_unlinked_drain is also called on mount so these will be cleaned up there. Reviewed by: avg, mav Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15342	2018-06-07 18:59:32 +00:00
Andriy Gapon	620b779158	fix zfs_getpages crash when called from sendfile, followup to r329363 It turns out that sendfile_swapin() has an optimization where it may insert pointers to bogus_page into the page array that it passes to VOP_GETPAGES. That happens to work with buffer cache, because it extensively uses bogus_page internally, so it has the necessary checks. However, ZFS did not expect bogus_page as VOP_GETPAGES(9) does not document such a (ab)use of bogus_page. So, this commit adds checks and handling of bogus_page. I expect that use of bogus_page with VOP_GETPAGES will get documented sooner rather than later. Reported by: Andrew Reilly <areilly@bigpond.net.au>, delphij Tested by: Andrew Reilly <areilly@bigpond.net.au> Requested by: many MFC after: 1 week	2018-05-25 07:29:52 +00:00
Andriy Gapon	873c2703d8	Fix 'zpool create -t <tempname>' Creating a pool with a temporary name fails when we also specify custom dataset properties: this is because we mistakenly call zfs_set_prop_nvlist() on the "real" pool name which, as expected, cannot be found because the SPA is present in the namespace with the temporary name. Fix this by specifying the correct pool name when setting the dataset properties. Author: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Obtained from: ZFS on Linux, zfsonlinux/zfs@4ceb8dd6fd MFC after: 1 week	2018-05-15 13:27:29 +00:00
Matt Macy	cbd92ce62e	Eliminate the overhead of gratuitous repeated reinitialization of cap_rights - Add macros to allow preinitialization of cap_rights_t. - Convert most commonly used code paths to use preinitialized cap_rights_t. A 3.6% speedup in fstat was measured with this change. Reported by: mjg Reviewed by: oshogbo Approved by: sbruno MFC after: 1 month	2018-05-09 18:47:24 +00:00
Jamie Gritton	0e5c6bd436	Make it easier for filesystems to count themselves as jail-enabled, by doing most of the work in a new function prison_add_vfs in kern_jail.c Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and the rest is taken care of. This includes adding a jail parameter like allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed. Both of these used to be a static list of known filesystems, with predefined permission bits. Reviewed by: kib Differential Revision: D14681	2018-05-04 20:54:27 +00:00
Ed Maste	3804f572a3	zfs_ioctl: avoid out-of-bound read admbugs: 796 Submitted by: Domagoj Stolfa <ds815@cam.ac.uk> Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reviewed by: avg MFC after: 1 day	2018-05-04 00:56:41 +00:00
Alexander Motin	bbbac409fe	9433 Fix ARC hit rate When the compressed ARC feature was added in commit `d3c2ae1` the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2018-04-16 00:54:58 +00:00
Andriy Gapon	81f187e576	allow ZFS pool to have temporary name for duration of current import The change adds -t <name> option to zpool create and -t option to zpool import in its form with an old name and a new name. This allows to import (or create) a pool under a name that's different from its real, permanent name without affecting that name. This is useful when working with VM images or images of other physical systems if they happen to have a ZFS pool with the same name as the host system. The changes come from ZoL with some small tweaks. The porting has been done by julian. The change is being submitted to OpenZFS: https://github.com/openzfs/openzfs/pull/600 Submitted by: julian Reviewed by: smh MFC after: 2 weeks Sponsored by: Panzura (porting) Differential Revision: https://reviews.freebsd.org/D14972	2018-04-12 10:37:26 +00:00
Mark Johnston	87c1cb45cd	Correct a comment. Submitted by: Domagoj Stolfa X-MFC with: r332364 Sponsored by: DARPA, AFRL	2018-04-10 14:07:02 +00:00
Mark Johnston	b32171ea5a	Set zfs_arc_free_target to v_free_target. Page daemon output is now regulated by a PID controller with a setpoint of v_free_target. Moreover, the page daemon now wakes up regularly rather than waiting for a wakeup from another thread. This means that the free page count is unlikely to drop below the old zfs_arc_free_target value, and as a result the ARC was not readily freeing pages under memory pressure. Address the immediate problem by updating zfs_arc_free_target to match the page daemon's new behaviour. Reported and tested by: truckman Discussed with: jeff X-MFC with: r329882 Differential Revision: https://reviews.freebsd.org/D14994	2018-04-10 13:56:06 +00:00
Mark Johnston	8593136428	Assert that dtrace_probe() doesn't re-enter itself. This helps catch cases where an instrumented function is called while in probe context. Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com> MFC after: 2 weeks Sponsored by: DARPA/AFRL Differential Revision: https://reviews.freebsd.org/D14863	2018-04-10 13:47:09 +00:00
Alexander Motin	8b26d76a50	9434 Speculative prefetch is blocked by device removal code. Device removal code does not set spa_indirect_vdevs_loaded for pools that never experienced device removal. At least one visual consequence of it is completely blocked speculative prefetcher. This patch sets the variable in such situations.	2018-04-03 21:16:41 +00:00
Alexander Motin	849a7ce2d5	MFV r331712: 9280 Assertion failure while running removal_with_ganging test with 4K devices illumos/illumos-gate@243952c7ee Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matt Ahrens <Matt.Ahrens@delphix.com>	2018-03-28 23:17:29 +00:00
Alexander Motin	a96086ed50	MFV 331710: 9188 increase size of dbuf cache to reduce indirect block decompression illumos/illumos-gate@268bbb2a2f With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Garrett D'Amore <garrett@damore.org> Author: George Wilson <george.wilson@delphix.com>	2018-03-28 23:05:48 +00:00
Alexander Motin	9311abdd7e	MFV r331708: 9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value illumos/illumos-gate@9be12bd737 arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally In the case of zfs_compressed_arc_enabled=0, when the buf is returned via arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes is decremented by lsize, not psize. Switch to using arc_buf_size(buf), instead of psize, which will return psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf). Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Allan Jude <allanjude@freebsd.org>	2018-03-28 22:50:05 +00:00
Alexander Motin	5c4561f332	MFV r331706: 9235 rename zpool_rewind_policy_t to zpool_load_policy_t illumos/illumos-gate@5dafeea3eb We want to be able to pass various settings during import/open of a pool, which are not only related to rewind. Instead of adding a new policy and duplicate a bunch of code, we should just rename rewind_policy to a more generic term like load_policy. For instance, we'd like to set spa->spa_import_flags from the nvlist, rather from a flags parameter passed to spa_import as in some cases we want those flags not only for the import case, but also for the open case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would allow zfs to open a pool when logs are missing. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:29:06 +00:00
Alexander Motin	6cc8a8260f	MFV 331704: 9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices illumos/illumos-gate@ccef24b493 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:10:06 +00:00
Alexander Motin	2221f0d8af	MFV 331702: 9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate illumos/illumos-gate@d1de72cfa2 ztest failed with uncorrectable IO error despite having the fix for #7163. Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes it from that issue. Definitely seems like a racing condition between the vdev_validate and spa_sync: 1. Thread A (spa_sync): vdev label is updated to latest txg 2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead. 3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg. Solution: do not check txg in vdev_validate unless config lock is held. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:07:31 +00:00
Alexander Motin	0b0c76bc58	MFV r331695, 331700: 9166 zfs storage pool checkpoint illumos/illumos-gate@8671400134 The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with exactly that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme rewind that doesn’t corrupt your data). It remembers the entire state of the pool at the point that it was taken and the user can revert back to it later or discard it. Its generic use case is an administrator that is about to perform a set of destructive actions to ZFS as part of a critical procedure. She takes a checkpoint of the pool before performing the actions, then rewinds back to it if one of them fails or puts the pool into an unexpected state. Otherwise, she discards it. With the assumption that no one else is making modifications to ZFS, she basically wraps all these actions into a “high-level transaction”. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-03-28 22:01:27 +00:00
Andriy Gapon	f4043145f2	ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count It's not sufficient nor required to use the vnode interlock when checking if we are going to drop the last use count as the code in vputx() uses refcount (atomic) operations for both checking and decrementing the use code. Apply the same method to vn_rele_async(). While here, remove vn_rele_inactive(), a wrapper around vrele() that didn't add any value. Also, the change required making vfs_refcount_release_if_not_last() public. I've made vfs_refcount_acquire_if_not_zero() public as well. They are in sys/refcount.h now. While making the move I've dropped the vfs_ prefix. Reviewed by: mjg MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D14869	2018-03-28 08:55:31 +00:00
John Baldwin	d41e41f9f0	Remove very old and unused signal information codes. These have been supplanted by the MI signal information codes in <sys/signal.h> since 7.0. The FPE_*_TRAP ones were deprecated even earlier in 1999. PR: 226579 (exp-run) Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14637	2018-03-27 20:57:51 +00:00
Andriy Gapon	f3fe7e5eff	zfs: fix mismatch between format specifier and type vdev_dbgmsg_print_tree printed vdev_id of uint64_t type with %u format specifier. That caused subsequent parameters to be incorrectly read from the stack and lead to a crash when a wrong value was interpreted as a string pointer. This should be upstreamed. Reported by: pho MFC after: 3 days	2018-03-23 09:42:47 +00:00
Alexander Motin	4148c56f78	Reduce struct aggsum_bucket padding to fit into one cache line. Reported by: mjg	2018-03-23 02:50:38 +00:00
Alexander Motin	e76e77a972	MFV r331407: 9213 zfs: sytem typo illumos/illumos-gate@edc8ef7d92 Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Toomas Soome <tsoome@me.com>	2018-03-23 02:30:29 +00:00
Alexander Motin	f222611ab0	MFV r331405: 9084 spa_*_ashift must ignore spare devices illumos/illumos-gate@b037f3dbd6 Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-03-23 02:24:52 +00:00
Alexander Motin	b8436536c9	MFV r331400: 8484 Implement aggregate sum and use for arc counters In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-03-23 02:15:05 +00:00
Mark Johnston	95099bbad1	Fix an access of an uninitialized variable in dtrace_probe(). Reported by: Coverity, via cem MFC after: 3 days	2018-03-18 17:01:50 +00:00
Andriy Gapon	289c14e811	MFV r330973: 9164 assert: newds == os->os_dsl_dataset illumos/illumos-gate@5f5913bb83 `5f5913bb83` https://www.illumos.org/issues/9164 This issue has been reported by Alan Somers as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877 dmu_objset_refresh_ownership() first disowns a dataset (and releases it) and then owns it again. There is an assert that the new dataset object is the same as the old dataset object. When running ZFS Test Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test: panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000 == 0xfffff80021ab4800) I see that the old dataset has dsl_dataset_evict_async() pending in ds_dbu.dbu_tqent and its ds_dbuf is NULL. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <don.brady@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <avg@FreeBSD.org> PR: 225877 Reported by: asomers MFC after: 1 week	2018-03-15 08:49:21 +00:00
Steven Hartland	a59e720e65	Prevent ZFS TRIM breaking VTOC8 partitions Update the ZFS TRIM code to ensure it respects VTOC8 partition headers as documented by the ZFS On-Disk Specification section 1.3 Before this a zpool create on a VTOC8 partitioned device would overwrite the partition metadata. Reported by: marius Reviewed by: marius agv MFC after: 1 week Sponsored by: Multiplay	2018-03-14 21:21:03 +00:00
Andriy Gapon	4a24755d68	MFV r330591: 8984 fix for 6764 breaks ACL inheritance illumos/illumos-gate@e9bacc6d1a `e9bacc6d1a` https://www.illumos.org/issues/8984 Consider a directory configured as: drwx-ws---+ 2 henson cpp 3 Jan 23 12:35 dropbox/ user:henson:rwxpdDaARWcC--:f-i----:allow owner@:--------------:f-i----:allow group@:--------------:f-i----:allow everyone@:--------------:f-i----:allow owner@:rwxpdDaARWcC--:-di----:allow group:cpp:-wx-----------:-------:allow owner@:rwxpdDaARWcC--:-------:allow A new file created in this directory ends up looking like: rw-r--r-+ 1 astudent cpp 0 Jan 23 12:39 testfile user:henson:rw-pdDaARWcC--:------I:allow owner@:--------------:------I:allow group@:--------------:------I:allow everyone@:--------------:------I:allow owner@:rw-p--aARWcCos:-------:allow group@:r-----a-R-c--s:-------:allow everyone@:r-----a-R-c--s:-------:allow with extraneous group@ and everyone@ entries allowing read access that shouldn't exist. Per Albert Lee on the zfs mailing list: "aclinherit=passthrough/passthrough-x should still ignore the requested mode when an inheritable ACE for owner@ group@, or everyone@ is present in the parent directory. It appears there was an oversight in my fix for https://www.illumos.org/issues/6764 which made calling zfs_acl_chmod from zfs_acl_inherit unconditional. I think the parent ACL check for aclinherit=passthrough needs to be reintroduced in zfs_acl_inherit." We have a large number of faculty who use dropbox directories like the example to have students submit projects. All of these directories are now allowing Reviewed by: Sam Zaydel <szaydel@racktopsystems.com> Reviewed by: Paul B. Henson <henson@acm.org> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Dominik Hassler <hadfl@omniosce.org> PR: 216886 MFC after: 2 weeks	2018-03-07 13:49:26 +00:00
Andriy Gapon	f3b7b054dd	add ZFS_ENTER protection to .zfs/snapshot vnode operations that need it Those operations, zfsctl_snapdir_readdir and zfsctl_snapdir_getattr, access the filesystem's objset and it can be unstable during operations like receive and rollback. MFC after: 2 weeks	2018-02-27 14:08:54 +00:00
Alexander Motin	060bac1db8	Add sysctls/tunables for dbuf cache size. MFC after: 2 weeks	2018-02-27 01:36:43 +00:00
Alan Somers	92bd443160	Implement CTASSERT using _Static_assert Prevents warnings about "unused typedef" with GCC-6 Reported by: GCC-6 MFC after: 18 days X-MFC-With: 329722	2018-02-24 16:01:21 +00:00
Andriy Gapon	de2cb430ad	another rework of getzfsvfs / getzfsvfs_impl code This change is designed to account for yet another difference between illumos and FreeBSD VFS. In FreeBSD a filesystem driver is supposed to clean up mnt_data in its VFS_UNMOUNT method because it's the last call into the driver before a struct mount object is destroyed. The VFS drains all references to the object before destroying it, but for the driver it's already as good as gone. In contrast, illumos VFS provides another method, VFS_FREEVFS, that is called when all references are drained. So, the driver can keep its data after VFS_UNMOUNT and clean it up in VFS_FREEVFS after all references are gone. This is what ZFS does on illumos. So there a reference to a filesystem is sufficient to guarantee that the ZFS specific data, aka zfsvfs_t, stays around (even if the filesystem gets unmounted). In FreeBSD we need to vfs_busy the filesystem to get the same guarantee. vfs_ref guarantees only that the struct mount is kept. The following rules should be observed in getzfsvfs / getzfsvfs_impl on FreeBSD: - if we need access to zfsvfs_t then we must use vfs_busy - if only we need to access struct mount (aka vfs_t), then vfs_ref is enough - when illumos code actually needs only the vfs_t, they still can pass the zfsvfs_t and get the vfs_t from it; that can work in FreeBSD if the filesystem is busied, but when it's just referenced then we have to pass the vfs_t explicitly - we cannot call vfs_busy while holding a dataset because that creates a LOR with dp_config_rwlock As a result: - getzfsvfs_impl now only references the filesystem, same as in illumos, but unlike illumos it has to return the vfs_t - the consumers are updated to account for the change - getzfsvfs busies the filesystem (and drops the reference from getzfsvfs_impl) Also, zfs_unmount_snap() now gets a busied a filesystem, references it and then unbusies it essentially reverting actions done in getzfsvfs. This is needed because the code may perform some checks that require the zfsvfs_t. So, those are done before the unbusying. MFC after: 2 weeks	2018-02-22 13:06:27 +00:00
Andriy Gapon	8d69fe5cc8	followup to r329556, completely remove the covered vnode assert vrele() acquires the vnode lock only if the hold count drops to zero. In other scenarios it needs only the interlock. So, zfsctl_snapdir_lookup() can race with vfs_mount_destroy() -> vrele() such that the lookup adds a new reference and then vrele() drops the mountpoint's reference and only then we check the reference count. It would be just one in this case. In fact, the assert should have been removed in r323483 when the code learned how to deal with the uncovered vnode. PR: 225795 MFC after: 4 days X-MFC with: r329556	2018-02-22 11:41:00 +00:00
Alexander Motin	dd9ceab333	MFV r329803: 9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() illumos/illumos-gate@bdfded42e6 A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that was already acquired from the first call to vdev_indirect_remap(). The specific scenario, is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the same vdev, which brings up the aforementioned issue. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-02-22 03:54:59 +00:00
Alexander Motin	064827be34	MFV r329799, r329800: 9079 race condition in starting and ending condesing thread for indirect vdevs illumos/illumos-gate@667ec66f1b The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-02-22 03:49:06 +00:00
Alexander Motin	1ea10a60f9	MFV r329793, r329795: 9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 03:15:35 +00:00
Alexander Motin	613b0d87da	8942 zfs promote .../%recv should be an error illumos/illumos-gate@add927f8c8 Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843, fixed by https://github.com/zfsonlinux/zfs/pull/6339: If we are in the middle of an incremental zfs receive, the child .../%recv will exist. If you concurrently run zfs promote .../%recv, it will "work", but then zfs gets confused. For example, there's no obvious way to destroy the containing filesystem (because it is now a clone of its invisible child). Attempting to do this promote should be an error. We could fix this by having zfs_ioc_promote() check if zc_name contains a %, similar to zfs_ioc_rename(). Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:42:13 +00:00
Alexander Motin	a33ba3dbde	MFV r329776: 8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa) illumos/illumos-gate@f4c1745bd6 Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think this is the intended behaviour, we shouldn't be allowed to clear readonly pools. Probably. A fix is already in the ZFS on Linux repository to addess this issue: `92e43c1718` Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:00:46 +00:00
Alexander Motin	eea9be67e6	MFV r329774: 8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly illumos/illumos-gate@85723e5eec When iterating over the input nvlist in dsl_props_set_sync_impl() when we don't preserve the nvpair name before looking up ZPROP_VALUE, so when we later go to process it nvpair_name() is always "value" instead of the actual property name. This results in a couple of bugs in the recv code: - received properties are not restored correctly when failing to receive an incremental send stream - received properties are not completely replaced by the new ones when successfully receiving an incremental send stream This was discovered on ZFS on Linux (fixed in `5f1346c299`) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 00:55:25 +00:00
Alexander Motin	756595f675	MFV r329770: 9035 zfs: this statement may fall through illumos/illumos-gate@46ac8fdfc5 Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Toomas Soome <tsoome@me.com>	2018-02-22 00:47:38 +00:00
Alexander Motin	502d18a8f1	MFV r329766: 8962 zdb should work on non-idle pools illumos/illumos-gate@e144c4e6c9 Currently `zdb` consistently fails to examine non-idle pools as it fails during the `spa_load()` process. The main problem seems to be that `spa_load_verify()` fails as can be seen below: $ sudo zdb -d -G dcenter zdb: can't open 'dcenter': I/O error ZFS_DBGMSG(zdb): spa_open_common: opening dcenter spa_load(dcenter): LOADING disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950 spa_load(dcenter): using uberblock with txg=40824950 spa_load(dcenter): UNLOADING spa_load(dcenter): RELOADING spa_load(dcenter): LOADING disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952 spa_load(dcenter): using uberblock with txg=40824952 spa_load(dcenter): FAILED: spa_load_verify failed [error=5] spa_load(dcenter): UNLOADING This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is done by creating a global flag in zfs and then setting it in `zdb`. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 00:42:12 +00:00
Alexander Motin	fa607d017d	MFV r329762: 8961 SPA load/import should tell us why it failed illumos/illumos-gate@3ee8c80c74 When we fail to open or import a storage pool, we typically don't get any additional diagnostic information, just "no pool found" or "can not import". While there may be no additional user-consumable information, we should at least make this situation easier to debug/diagnose for developers and support. For example, we could start by using `zfs_dbgmsg()` to log each thing that we try when importing, and which things failed. E.g. "tried uberblock of txg X from label Y of device Z". Also, we could log each of the stages that we go through in `spa_load_impl()`. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 00:03:14 +00:00
Alexander Motin	df331510ab	MFV r329760: 7638 Refactor spa_load_impl into several functions illumos/illumos-gate@1fd3785ff6 spa_load_impl has grown out of proportions. It is currently over 700 lines long and makes it very hard to follow or debug the import process even for experienced ZFS developers. The objective is to split it up in a series of well commented functions. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-21 23:38:30 +00:00
Alexander Motin	b17bfcde3d	9018 Replace kmem_cache_reap_now() with kmem_cache_reap_soon() illumos/illumos-gate@36a64e6284 To prevent kmem_cache reaping from blocking other system resources, turn kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which exploits #9017's new taskq_empty(). Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Author: Tim Kordas <tim.kordas@joyent.com> FreeBSD does not use taskqueue for kmem caches reaping, so this change is less dramatic then it is on Illumos, just limiting reaping to 1 time per second. It may possibly be improved later, if needed.	2018-02-21 23:15:06 +00:00
Alexander Motin	d208c07cf3	MFV r329753: 8809 libzpool should leverage work done in libfakekernel illumos/illumos-gate@f06dce2c1f Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andrew Stormont <astormont@racktopsystems.com>	2018-02-21 21:18:04 +00:00
Alexander Motin	832deba1e5	MFV r329736: 8969 Cannot boot from RAIDZ with parity > 1 illumos/illumos-gate@0fb055e81f At present it is possible to boot from a root pool that is on RAIDZ but not one that is on RAIDZ2 or RAIDZ3. This is because, at the time the pool version is checked to ensure support for dual/triple parity, the uberblock has not yet been loaded into the SPA and therefore the code determines that the pool version is too old and returns ENOTSUP. Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Andy Fiddaman <omnios@citrus-it.co.uk> FreeBSD already had this fixed, so this is just a diff reduction.	2018-02-21 18:12:19 +00:00
Alexander Motin	24433f00ea	MFV r329502: 7614 zfs device evacuation/removal illumos/illumos-gate@5cabbc6b49 https://www.illumos.org/issues/7614: This project allows top-level vdevs to be removed from the storage pool with “zpool remove”, reducing the total amount of storage in the pool. This operation copies all allocated regions of the device to be removed onto other devices, recording the mapping from old to new location. After the removal is complete, read and free operations to the removed (now “indirect”) vdev must be remapped and performed at the new location on disk. The indirect mapping table is kept in memory whenever the pool is loaded, so there is minimal performance overhead when doing operations on the indirect vdev. The size of the in-memory mapping table will be reduced when its entries become “obsolete” because they are no longer used by any block pointers in the pool. An entry becomes obsolete when all the blocks that use it are freed. An entry can also become obsolete when all the snapshots that reference it are deleted, and the block pointers that reference it have been “remapped” in all filesystems/zvols (and clones). Whenever an indirect block is written, all the block pointers in it will be “remapped” to their new (concrete) locations if possible. This process can be accelerated by using the “zfs remap” command to proactively rewrite all indirect blocks that reference indirect (removed) vdevs. Note that when a device is removed, we do not verify the checksum of the data that is copied. This makes the process much faster, but if it were used on redundant vdevs (i.e. mirror or raidz vdevs), it would be possible to copy the wrong data, when we have the correct data on e.g. the other side of the mirror. Therefore, mirror and raidz devices can not be removed. Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Prashanth Sreenivasa <pks@delphix.com>	2018-02-21 16:51:02 +00:00
Andriy Gapon	cfb675a138	MFV r329718: 8520 7198 lzc_rollback_to should support rolling back to origin illumos/illumos-gate@95643f75d2 `95643f75d2` https://www.illumos.org/issues/8520 lzc_rollback_to() should support rolling back to a clone's origin. The current checks in zfs_ioc_rollback() would not allow that because the origin snapshot belongs to a different filesystem. The overly restrictive check was introduced in 7600, but it was not a regression as none of the existing tools provided a way to rollback to the origin. https://www.illumos.org/issues/7198 EINVAL is returned when a dataset does not have any snapshots, so there is nothing to roll back to. Although the code in zfs_do_rollback checks for that condition in advance, it's still possible that the snapshot(s) gets removed after the check and before the rollback sync task is executed. At the moment zfs command would crash when that happens. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2018-02-21 15:12:14 +00:00
Andriy Gapon	754d27df02	MFV r329715: 8997 ztest assertion failure in zil_lwb_write_issue illumos/illumos-gate@f864f99efe `f864f99efe` https://www.illumos.org/issues/8997 When dmu_tx_assign is called from zil_lwb_write_issue, it's possible for either ERESTART or EIO to be returned. If ERESTART is returned, this will cause an assertion to fail directly in zil_lwb_write_issue, where the code assumes the return value is EIO if dmu_tx_assign returns a non-zero value. This can occur if the SPA is suspended when dmu_tx_assign is called, and most often occurs when running zloop. If EIO is returned, this can cause assertions to fail elsewhere in the ZIL code. For example, zil_commit_waiter_timeout contains the following logic: lwb_t *nlwb = zil_lwb_write_issue(zilog, lwb); ASSERT3S(lwb->lwb_state, !=, LWB_STATE_OPENED); In this case, if dmu_tx_assign returned EIO from within zil_lwb_write_issue, the lwb variable passed in will not be issued to disk. Thus, it's lwb_state field will remain LWB_STATE_OPENED and this assertion will fail. zil_commit_waiter_timeout assumes that after it calls zil_lwb_write_issue, the lwb will be issued to disk, and doesn't handle the case where this is not true; i.e. it doesn't handle the case where dmu_tx_assign returns EIO. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> MFC after: 3 weeks	2018-02-21 15:07:49 +00:00
Andriy Gapon	9d6810819c	MFV r329713: 8731 ASSERT3U(nui64s, <=, UINT16_MAX) fails for large blocks illumos/illumos-gate@a6c1eb3c08 `a6c1eb3c08` https://www.illumos.org/issues/8731 annotate_ecksum() asserts that nui64s, calculated as nui64s = size / sizeof (uint64_t), is not greater than UINT16_MAX. This restriction is needed because histograms of incorrectly set and cleared bits have 16 bit counters and if the buffer consists of too many 64-bit words, then a counter can potentially overflow producing an incorrect result. When the largest buffer size was 128KB the greatest value of nui64s was 16K, well within the limit. But now we have support for large buffers and for buffer sizes of 512KB and above the restriction is violated. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2018-02-21 14:31:48 +00:00
Andriy Gapon	70639f9a5e	MFV r329710: 8966 Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable illumos/illumos-gate@82693e09cc `82693e09cc` https://www.illumos.org/issues/8966 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Richard Lowe <richlowe@richlowe.net> Author: WHR <msl0000023508@gmail.com> PR: 225162 Submitted by: WHR <msl0000023508@gmail.com> Reported by: WHR <msl0000023508@gmail.com> MFC after: 1 week	2018-02-21 14:17:07 +00:00
Alexander Motin	d8e89539c8	MFV r324198: 8081 Compiler warnings in zdb illumos/illumos-gate@3f7978d02b `3f7978d02b` https://www.illumos.org/issues/8081 zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD, which uses -WError, the only way to build it is to disable all compiler warnings. This makes it much harder to detect newly introduced bugs. We should cleanup all the warnings. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Alan Somers <asomers@gmail.com>	2018-02-21 03:08:47 +00:00
Alexander Motin	03618fe74d	MFV r319737: 6939 add sysevents to zfs core for commands illumos/illumos-gate@ce1577b049 `ce1577b049` https://www.illumos.org/issues/6939 Originally created https://smartos.org/bugview/OS-4489 sysevents should be fired in the kernel from ZFS whenever a command is run that is logged in zpool history. Example output Terminal 1 root - gz sunos ~ # zfs create zones/foobar root - gz sunos ~ # zfs set quota=10g zones/foobar root - gz sunos ~ # zfs destroy zones/foobar Terminal 2 root - gz sunos ~ # sysevent EC_zfs nvlist version: 0 date = 2016-04-28T14:50:08.964Z vendor = SUNW publisher = zfs class = EC_zfs subclass = ESC_ZFS_history_event pid = 0 data = (embedded nvlist) nvlist version: 0 pool_name = zones pool_guid = 0x40c964e8f9a7a694 history_record = (embedded nvlist) nvlist version: 0 dsname = zones/foobar dsid = 0x1525 history internal str = internal_name = create history txg = 0x4c4ef3 Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Joshua M. Clulow <jmc@joyent.com> Reviewed by: Josh Wilsdon <jwilsdon@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Alan Somers <asomers@gmail.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Dave Eddy <dave@daveeddy.com>	2018-02-21 02:19:42 +00:00
Alexander Motin	63e739af67	MFV r319736: 6396 remove SVM illumos/illumos-gate@5f10ef697f `5f10ef697f` https://www.illumos.org/issues/6396 LVM = SVM = Solaris Volume Manager dead code and not using with ZFS based platform. Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Yuri Pankov <yuri.pankov@nexenta.com>	2018-02-21 00:24:54 +00:00
Alexander Motin	e5a4a83784	MFV r318941: 7446 zpool create should support efi system partition illumos/illumos-gate@7855d95b30 `7855d95b30` https://www.illumos.org/issues/7446 Since we support whole-disk configuration for boot pool, we also will need whole disk support with UEFI boot and for this, zpool create should create efi- system partition. I have borrowed the idea from oracle solaris, and introducing zpool create - B switch to provide an way to specify that boot partition should be created. However, there is still an question, how big should the system partition be. For time being, I have set default size 256MB (thats minimum size for FAT32 with 4k blocks). To support custom size, the set on creation "bootsize" property is created and so the custom size can be set as: zpool create B - o bootsize=34MB rpool c0t0d0 After pool is created, the "bootsize" property is read only. When -B switch is not used, the bootsize defaults to 0 and is shown in zpool get output with value ''. Older zfs/zpool implementations are ignoring this property. https://www.illumos.org/rb/r/219/ Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Dan McDonald <danmcd@kebe.com> Author: Toomas Soome <tsoome@me.com> This commit makes no sense for FreeBSD, that is why I blocked the option, but it should be good to stay closer to upstream.	2018-02-21 00:18:57 +00:00
Alexander Motin	89dabdb4ea	MFC r316910: 7812 Remove gender specific language illumos/illumos-gate@48bbca8168 `48bbca8168` https://www.illumos.org/issues/7812 This change removes all gendered language that did not refer specifically to an individual person or pet. The convention taken was to use variations on "they" when referring to users and/or human beings, while using "it" when referring to code, functions, and/or libraries. Additionally, we took the liberty to fix up any whitespace issues that were found in any files that were already being modified. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Daniel Hoffman <dj.hoffman@delphix.com>	2018-02-20 05:07:21 +00:00
Alexander Motin	3542f1bd3a	MFV r307315: 7301 zpool export -f should be able to interrupt file freeing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Author: Alek Pinchuk <alek@nexenta.com> Closes #175	2018-02-20 04:36:51 +00:00
Alexander Motin	4a0867c8d2	MFV r302649: 7016 arc_available_memory is not 32-bit safe illumos/illumos-gate@0dd053d7d8 `0dd053d7d8` https://www.illumos.org/issues/7016 upstream DLPX-39446 arc_available_memory is not 32-bit safe https://github.com/delphix/delphix-os/commit/ 6b353ea3b8a1610be22e71e657d051743c64190b related to this upstream: DLPX-38547 delphix engine hang https://github.com/delphix/delphix-os/commit/ 3183a567b3e8c62a74a65885ca60c86f3d693783 DLPX-38547 delphix engine hang (fix static global) https://github.com/delphix/delphix-os/commit/ 22ac551d8ef085ad66cc8f65e51ac372b12993b9 DLPX-38882 system hung waiting on free segment https://github.com/delphix/delphix-os/commit/ cdd6beef7548cd3b12f0fc0328eeb3af540079c2 Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-02-20 04:14:12 +00:00
Andriy Gapon	8cebd0e419	relax an assert in zfsctl_snapdir_lookup to match r323578 Since r323578 we may remove the last reference to a covered vnode with vrele() instead of vput(). So, v_usecount may be decremented before the vnode is locked and zfsctl_snapdir_lookup may "catch" the vnode with v_usecount of zero and v_holdcnt of one. PR: 225795 Reported by: asomers MFC after: 1 week	2018-02-19 08:55:22 +00:00
Alan Somers	4571b2776f	zfs: fix formatting in a log statement Submitted by: Dave Baukus <daveb@spectralogic.com> MFC after: 3 weeks Sponsored by: Spectra Logic Corp	2018-02-16 21:59:08 +00:00
Alan Somers	dfbc272d5d	Handle generic pathconf attributes in the .zfs ctldir MFC instructions: change the value of _PC_LINK_MAX to INT_MAX Reported by: jhb MFC after: 19 days X-MFC-With: 329265 Sponsored by: Spectra Logic Corp	2018-02-16 16:56:09 +00:00
Andriy Gapon	c945107d23	read-behind / read-ahead support for zfs_getpages() ZFS caches blocks it reads in its ARC, so in general the optional pages are not as useful as with filesystems that read the data directly into the target pages. But still the optional pages are useful to reduce the number of page faults and associated VM / VFS / ZFS calls. Another case that gets optimized (as a side effect) is paging in from a hole. ZFS DMU does not currently provide a convenient API to check for a hole. Instead it creates a temporary zero-filled block and allows accessing it as if it were a normal data block. Getting multiple pages one by one from a hole results in repeated creation and destruction of the temporary block (and an associated ARC header). Tested with fsx using various supported blocks sizes from 512 bytes to 128 KB and additionally 1 MB. Please note that in illumos and ZoL they do not do the range-locking in the page-in path. This is because ZFS has a double-caching problem between ARC and page cache and that requires zfs_read() and zfs_write() to consult pages in the page cache. So, in those functions they first lock a range and then lock pages corresponding to the range. While in the page-in (and maybe page-out) path they first lock the pages and then would lock the range. So, they would have a deadlock. I believe that FreeBSD does not have that problem, because the page-in deals only with invalid pages while zfs_read() and zfs_write() need to access only valid pages. They do not wait on a busy page unless it's already valid. Reviewed by: kib MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D14263	2018-02-16 06:59:35 +00:00
Andriy Gapon	113ce413ba	MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio illumos/illumos-gate@d6e1c446d7 `d6e1c446d7` https://www.illumos.org/issues/8857 I had an OS panic on one of our servers: ffffff01809128c0 vpanic() ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80) ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80) ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0, ffffff3373370908) ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0) ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0) ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140) ffffff0180912b40 thread_start+8() It panicked here: http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/ zio.c#430 pio->io_lock is DEAD, thus a panic. Further analysis shows the "pio" (parent zio of "cio") has already been destroyed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Youzhong Yang <youzhong@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com> PR: 223803 Tested by: shiva.bhanujan@quorum.com MFC after: 2 weeks	2018-02-15 14:46:29 +00:00
Alan Somers	d64bae1f1a	Implement .vop_pathconf and .vop_getacl for the .zfs ctldir zfsctl_common_pathconf will report all the same variables that regular ZFS volumes report. zfsctl_common_getacl will report an ACL equivalent to 555, except that you can't read xattrs or edit attributes. Fixes a bug where "ls .zfs" will occasionally print something like: ls: .zfs/.: Operation not supported PR: 225793 Reviewed by: avg MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D14365	2018-02-14 15:49:31 +00:00
Alexander Motin	bc55f7298e	Add sysctls for dnode block and indirect block shifts. MFC after: 2 weeks	2018-02-09 23:29:50 +00:00
Andriy Gapon	f11548e1da	remove a duplicate assignment There should be no functional change. MFC after: 1 week	2018-02-08 13:22:40 +00:00
Jeff Roberson	e2068d0bcd	Use per-domain locks for vm page queue free. Move paging control from global to per-domain state. Protect reservations with the free lock from the domain that they belong to. Refactor to make vm domains more of a first class object. Reviewed by: markj, kib, gallatin Tested by: pho Sponsored by: Netflix, Dell/EMC Isilon Differential Revision: https://reviews.freebsd.org/D14000	2018-02-06 22:10:07 +00:00
Andriy Gapon	be4be0e5ca	zfs: move a utility function, ioflags, closer to its consumers No functional change. MFC after: 1 week	2018-02-05 14:19:36 +00:00
Andriy Gapon	6c1e03251e	ZFS ARC: restore illumos uses of 'needfree' that were removed in r325851 This is purely a cosmetic change to have a more complete copy of ifdef-ed out illumos code. MFC after: 1 week	2018-02-02 12:57:33 +00:00
Andriy Gapon	1fc46a9ba0	zfs_rezget: drop cached pages before doing anything else We did that in the case of success to prevent the use of stale cached data, but it makes even less sense to keep the cached data when we fail. Ideally, we should call vgone() on the vnode in the case of zfs_rezget failure, but the current lock order prevents us from doing that. The change also rearranges the order of unlinked check and the size change check. While there, add missing SET_ERROR in one of the error paths. MFC after: 2 weeks	2018-01-31 14:44:51 +00:00
Alexander Motin	1dbc0fc352	MFV r328253: 8835 Speculative prefetch in ZFS not working for misaligned reads illumos/illumos-gate@5cb8d943bc https://www.illumos.org/issues/8835: Sequential reads not aligned to block size are not detected by ZFS prefetcher as sequential, killing prefetch and severely hurting performance. It is caused by dmu_zfetch() in case of misaligned sequential accesses being called with overlap of one block. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Alexander Motin <mav@FreeBSD.org>	2018-01-22 05:57:14 +00:00
Alexander Motin	4eb2803697	MFV r328251: 8652 Tautological comparisons with ZPROP_INVAL illumos/illumos-gate@4ae5f5f06c https://www.illumos.org/issues/8652: Clang and GCC prefer to use unsigned ints to store enums. With Clang, that causes tautological comparison warnings when comparing a zfs_prop_t or zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error handling code is being silently removed as a result. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Alan Somers <asomers@gmail.com>	2018-01-22 05:52:39 +00:00
Alexander Motin	b5eb78f824	MFV r328247: 8959 Add notifications when a scrub is paused or resumed illumos/illumos-gate@301fd1d6f2 Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Sean Eric Fagan <sef@ixsystems.com>	2018-01-22 04:31:48 +00:00
Alexander Motin	abe7ff88e1	MFV r328245: 8856 arc_cksum_is_equal() doesn't take into account ABD-logic illumos/illumos-gate@01a059ee0c https://www.illumos.org/issues/8856: arc_cksum_is_equal() calls zio_push_transform() that requires abd_t* (second arg), but a void* is passed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Roman Strashkin <roman.strashkin@nexenta.com>	2018-01-22 04:23:48 +00:00
Alexander Motin	226cd471b4	MFV r328229: 8930 zfs_zinactive: do not remove the node if the filesystem is readonly illumos/illumos-gate@93c618e0f4 https://www.illumos.org/issues/8930: We normally remove an unlinked node when its last user goes away and the node becomes inactive. However, we should not do that if the filesystem is mounted read-only including the case where it has its readonly property set. The node will remain on the unlinked queue, so it will not be leaked. One particular scenario is when we receive an incremental stream into a mounted read-only filesystem and that stream contains an unlinked file (still on the unlinked queue). If that file is opened before the receive and some time later after the receive it becomes inactive we would remove it and, thus, modify the read-only filesystem. As a result, the filesystem would diverge from its source and further incremental receives would not be possible (without forcing a rollback). Another related scenario, that may or may not be possible depending on an OS / VFS policy, is when an open file is unlinked, then the filesystem is remounted read-only, and then the file is closed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Andriy Gapon <avg@FreeBSD.org>	2018-01-21 23:49:17 +00:00
Alexander Motin	272f833cde	MFV r328227: 8909 8585 can cause a use-after-free kernel panic illumos/illumos-gate@94ddd0900a https://www.illumos.org/issues/8909: There's a race condition that exists if `zil_free_lwb` races with either `zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`. Here's an example panic due to this bug: > ::status debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40 operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc) image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513 panic message: BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo dule "zfs" due to a NULL pointer dereference dump content: kernel pages only > $c zio_shrink+0x12() zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20) zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8) zil_commit+0x80(ffffff03dcd15cc0, 9a9) zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0) write+0x250(42, fffffd7ff4832000, 2000) sys_syscall+0x177() If there's an outstanding lwb that's in `zil_commit_waiter_timeout` waiting to timeout, waiting on it's waiter's CV, we must be sure not to call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB may be freed and can result in a use-after-free situation where the stale lwb pointer stored in the `zil_commit_waiter_t` structure of the thread waiting on the waiter's CV is used. A similar situation can occur if an lwb is issued to disk, and thus in the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the disk is servicing that lwb. In this situation, the lwb will be freed by `zil_free_lwb`, which will result in a use-after-free situation when the lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called. This race condition is prevented in `zil_close` by calling `zil_commit` before `zil_free_lwb` is called, which will ensure all outstanding (i.e. all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states) reach the `LWB_STATE_DONE` state before the lwb's are freed (`zil_commit` will not return untill all the lwb's are `LWB_STATE_DONE`). Further, this race condition is prevented in `zil_sync` by only calling `zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set. All lwb's not in the `LWB_STATE_DONE` state will have a non-null value for this pointer; the pointer is only cleared in `zil_lwb_flush_vdevs_done`, at which point the lwb's state will be changed to `LWB_STATE_DONE`. This race is present in `zil_suspend`, leading to this bug. At first glance, it would appear as though this would not be true because `zil_suspend` will call `zil_commit`, just like `zil_close`, but the problem is that `zil_suspend` will set the zilog's `zl_suspend` field prior to calling `zil_commit`. Further, in `zil_commit`, if `zl_suspend` is set, `zil_commit` will take a special branch of logic and use `txg_wait_synced` instead of performing the normal `zil_commit` logic. This call to `txg_wait_synced` might be good enough for the data to reach disk safely before it returns, but it does not ensure that all outstanding lwb's reach the `LWB_STATE_DONE` state before it returns. This is because, if there's an lwb "stuck" in `zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will maintain a non-null value for it's `lwb_buf` field and thus `zil_sync` will not free that lwb. Thus, even though the lwb's data is already on disk, the lwb will be left lingering, waiting on the CV, and will eventually timeout and be issued to disk even though the write is unnesseary. So, after `zil_commit` is called from `zil_suspend`, we incorrectly assume that there are not outstanding lwb's, and proceed to free all lwb's found on the zilog's lwb list. As a result, we free the lwb that will later be used `zil_commit_waiter_timeout`. Reviewed by: John Kennedy <jwk404@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-01-21 23:18:42 +00:00
Alexander Motin	d252c97fc0	MFV r328225: 8603 rename zilog's "zl_writer_lock" to "zl_issuer_lock" illumos/illumos-gate@cf07d3da99 https://www.illumos.org/issues/8603: To help make the ZIL's code more understandable, it was suggested that the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock". Reviewed by: C Fraire <cfraire@me.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-01-21 23:11:20 +00:00
Alexander Motin	442814b7e7	MFV r328220: 8677 Open-Context Channel Programs illumos/illumos-gate@a3b2868063 https://www.illumos.org/issues/8677 We want to be able to run channel programs outside of synching context. This would greatly improve performance of channel program that just gather information, as we won't have to wait for synching context anymore. This feature should introduce the following: - A new command line flag in "zfs program" to specify our intention to run in open context. - A new flag/option within the channel program ioctl which selects the context. - Appropriate error handling whenever we try a channel program in open-context that contains zfs.sync* expressions. - Documentation for the new feature in the manual pages. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-01-21 23:02:05 +00:00
Andriy Gapon	e09304d8f3	zfs: no need to check that size of zfs_cmd_t is not greater than IOCPARM_MAX Nowadays we do not pass zfs_cmd_t directly through the ioctl interface. Instead a small zfs_iocparm_t object is passed and the command is explicitly copied in and out. So, the check has become irrelevant. MFC after: 3 weeks Sponsored by: Panzura	2018-01-21 11:19:18 +00:00
Mark Johnston	94a889089b	Use the thread's ucred struct when fetching jid or jailname. Reported by: mjg X-MFC with: r327888	2018-01-14 17:55:40 +00:00
Mark Johnston	224e0c2f61	Add "jid" and "jailname" variables to DTrace. These return the jail ID and jail name for the traced process, respectively, and are analogous to "zonename" on Solaris/illumos. "zonename" is now aliased to "jailname". Also add some stress tests for the new variables. Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com> Reviewed by: dteske (previous version) MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D13877	2018-01-12 19:59:46 +00:00
Andriy Gapon	be5060c116	zfs_mount: restore a bit of ifdef-out illumos code And correctly mark the end of the replacement FreeBSD code. MFC after: 1 week	2018-01-09 13:43:04 +00:00
Jeff Roberson	ad5b0f5b51	Fix arc after r326347 broke various memory limit queries. Use UMA features rather than kmem arena size to determine available memory. Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before the real limit is set. PR: 224330 (partial), 224080 Reviewed by: markj, avg Sponsored by: Netflix / Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D13494	2018-01-02 04:35:56 +00:00
Dimitry Andric	e42df78a1a	Remove obsolete register keyword from opensolaris's sysmacros.h. When compiling zfsd with recent clang, it leads to a warning about the register storage class being incompatible with C++17. MFC after: 3 days	2017-12-24 19:17:15 +00:00
John Baldwin	f27d3a8a72	Don't return early for non-failure for one of the EMLINK checks. r326987 enabled two #if 0'd-out EMLINK checks in zfs_link_create() for link overflow. However, one of the checks (when the vnode adding a link is a directory such as for mkdir) always returned even if the link did not overflow. Change this to only return early if it needs to report an EMLINK error. Reported by: db, shurd Sponsored by: Chelsio Communications	2017-12-19 23:54:44 +00:00
John Baldwin	b501cc5da6	Rework pathconf handling for FIFOs. On the one hand, FIFOs should respect other variables not supported by the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.). These values are fs-specific and must come from a fs-specific method. On the other hand, filesystems that support FIFOs are required to support _PC_PIPE_BUF on directory vnodes that can contain FIFOs. Given this latter requirement, once the fs-specific VOP_PATHCONF method supports _PC_PIPE_BUF for directories, it is also suitable for FIFOs permitting a single VOP_PATHCONF method to be used for both FIFOs and non-FIFOs. To that end, retire all of the FIFO-specific pathconf methods from filesystems and change FIFO-specific vnode operation switches to use the existing fs-specific VOP_PATHCONF method. For fifofs, set it's VOP_PATHCONF to VOP_PANIC since it should no longer be used. While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that only filesystems supporting FIFOs will report a value. In addition, only report a valid _PC_PIPE_BUF for directories and FIFOs. Discussed with: bde Reviewed by: kib (part of a larger patch) MFC after: 1 month Sponsored by: Chelsio Communications Differential Revision: https://reviews.freebsd.org/D12572	2017-12-19 22:39:05 +00:00
John Baldwin	599afe53a8	Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf(). Having all filesystems fall through to default values isn't always correct and these values can vary for different filesystem implementations. Most of these changes just use the existing default values with a few exceptions: - Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact permissions check this claims for chown(). - Use NANDFS_NAME_LEN for NAME_MAX for nandfs. - Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to indicate hard links aren't supported. Requested by: bde (though perhaps not this exact implementation) Reviewed by: kib (earlier version) MFC after: 1 month Sponsored by: Chelsio Communications	2017-12-19 19:51:36 +00:00
John Baldwin	697a86b6bf	Adjust ZFS' link count handling for ino64. - Define a ZFS_LINK_MAX as the ZFS version of LINK_MAX which is set to UINT64_MAX to match the on-disk format. - Enable the currently #if 0'd code to check for link overflows and return EMLINK. - Don't clamp the link count reported in stat() to LINK_MAX as that is still the 16-bit limit, but report the full link counts. Also, avoid possibly overflowing the reported link count to 0 when adjusting the link count to account for ".snapshot". - Update the LINK_MAX reported by pathconf() to report ZFS_LINK_MAX rather than LINK_MAX (but clamped to LONG_MAX for 32-bit systems). Reviewed by: avg (earlier version) Sponsored by: Chelsio Communications	2017-12-19 19:07:24 +00:00
Mark Johnston	7a177c2d5e	Unregister the ARC lowmem event handler earlier in arc_fini(). Otherwise a poorly timed lowmem event may attempt to acquire a destroyed lock. Unregister the handler before destroying the ARC reclaim thread. Reported by: gjb MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D13480	2017-12-17 18:21:40 +00:00
Mark Johnston	a981eff82e	MFV r326785: 8880 improve DTrace error checking illumos/illumos-gate@2cf374268f `2cf374268f` https://www.illumos.org/issues/8880 Reviewed by: Tim Kordas <tim.kordas@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@joyent.com> Author: Jerry Jelinek <jerry.jelinek@joyent.com> MFC after: 1 week	2017-12-12 22:08:34 +00:00
Mark Johnston	f74deabac8	Correct initialization of pc on powerpc. PR: 224293 Submitted by: Breno Leitao <breno.leitao@gmail.com> X-MFC with: r326774 Pointy hat: markj	2017-12-12 20:41:11 +00:00
Mark Johnston	5bab623438	Pass the trap frame to fasttrap hooks. The DTrace fasttrap entry points expect a struct reg containing the register values of the calling thread. Perform the conversion in fasttrap rather than in the trap handler: this reduces the number of ifdefs and avoids wasting stack space for traps that don't involve DTrace. MFC after: 2 weeks	2017-12-11 19:21:39 +00:00
Alan Somers	3613ea59c7	Fix assertion when ZFS fails to open certain devices "panic: vdev_geom_close_locked: cp->private is NULL" This panic will result if ZFS fails to open a device due to either of the following reasons: 1) The device's sector size is greater than 8KB. 2) ZFS wants to open the device RW, but it can't be opened for writing. The solution is to change the initialization order to ensure that the assertion will be satisfied. PR: 221066 Reported by: David NewHamlet <wheelcomplex@gmail.com> Reviewed by: avg MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D13278	2017-11-30 15:36:06 +00:00
Alan Somers	fb20566033	Revert r326399 Accidentally committed wrong file Pointy hat to: asomers Sponsored by: Spectra Logic Corp	2017-11-30 15:34:55 +00:00
Alan Somers	de65823b48	Fix assertion when ZFS fails to open certain devices "panic: vdev_geom_close_locked: cp->private is NULL" This panic will result if ZFS fails to open a device due to either of the following reasons: 1) The device's sector size is greater than 8KB. 2) ZFS wants to open the device RW, but it can't be opened for writing. The solution is to change the initialization order to ensure that the assertion will be satisfied. PR: 221066 Reported by: David NewHamlet <wheelcomplex@gmail.com> Reviewed by: avg MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D13278	2017-11-30 15:28:29 +00:00
Mark Johnston	e9f63df76d	Duplicate helpers after disabling inherited tracepoints during a fork. We may create probes in the nascent child process, so we first need to ensure that any inherited tracepoints are first removed. Otherwise the probe sites will not be in the state expected by fasttrap, and it won't be able to enable the probes. MFC after: 2 weeks	2017-11-23 14:29:07 +00:00
Andriy Gapon	7bcc2cfc86	zfs_write: fix problem with writes appearing to succeed when over quota The problem happens when the writes have offsets and sizes aligned with a filesystem's recordsize (maximum block size). In this scenario dmu_tx_assign() would fail because of being over the quota, but the uio would already be modified in the code path where we copy data from the uio into a borrowed ARC buffer. That makes an appearance of a partial write, so zfs_write() would return success and the uio would be modified consistently with writing a single block. That bug can result in a data loss because the writes over the quota would appear to succeed while the actual data is being discarded. This commit fixes the bug by ensuring that the uio is not changed until after all error checks are done. To achieve that the code now uses uiocopy() + uioskip() as in the original illumos design. We can do that now that uiocopy() has been updated in r326067 to use vn_io_fault_uiomove(). Reported by: mav Analyzed by: mav Reviewed by: mav Pointyhat to: avg (myself) MFC after: 1 week X-MFC after: r326067 X-Erratum: wanted	2017-11-21 18:28:14 +00:00
Mark Johnston	e9a2e17d1b	Avoid holding the process in uread() and uwrite(). In general, higher-level code will atomically verify that the process is not exiting and hold the process. In one case, we were using uwrite() to copy a probed instruction to a per-thread scratch space block, but copyout() can be used for this purpose instead; this change effectively reverts r227291. MFC after: 1 week	2017-11-16 07:25:12 +00:00
Baptiste Daroussin	c07d14deb5	remove the poor emulation of the IllumOS needfree global variable to prevent the ARC reclaim thread running longer than needed. Update the arc::needfree dtrace probe triggered in arc_lowmem() to also report the value we may want to free. Submitted by: Nikita Kozlov <nikita.kozlov at blade-group.com> Reviewed by: avg Approved by: avg MFC after: 3 weeks Sponsored by: blade Differential Revision: https://reviews.freebsd.org/D12163	2017-11-15 12:48:36 +00:00
Andriy Gapon	da1bfa506f	MFV r325609: 7531 Assign correct flags to prefetched buffers illumos/illumos-gate@2729521654 `2729521654` https://www.illumos.org/issues/7531 I found that some buffers that could be L2ARC eligible are not flagged such, leading to some performance impact. As a test I ran the same IO workload 10 times in a raw. It is a metadata only workload (files listing). l2arc_noprefetch=0. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: benrubson <ben.rubson@gmail.com> MFC after: 8 days	2017-11-09 18:22:42 +00:00
Andriy Gapon	885a5425f3	MFV r325607: 8607 zfs: variable set but not used illumos/illumos-gate@b852c2f543 `b852c2f543` https://www.illumos.org/issues/8607 Reviewed by: Yuri Pankov <yuripv@gmx.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Toomas Soome <tsoome@me.com> MFC after: 1 week	2017-11-09 18:14:42 +00:00
Andriy Gapon	2e06836fee	MFV r325605: 8713 Buffer overflow in dsl_dataset_name() illumos/illumos-gate@f37ae9a714 `f37ae9a714` https://www.illumos.org/issues/8713 If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11) we need to account for additional space needed by the origin dataset which will also be snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN". Enforce this limit in pool_namecheck(). Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com> MFC after: 1 week	2017-11-09 18:12:21 +00:00
Andriy Gapon	96ed2690df	Disable posix_fallocate(2) for ZFS The generic (naive) implementation of posix_fallocate cannot provide the standard mandated guarantee that overwrites would never fail due to the lack of free space. The fundamental reason is the copy-on-write architecture of ZFS. Other features like compression and deduplication can also increase the size difference between the (pre-)allocated dummy content and the future content. So, until ZFS can properly implement the feature it's better to report that it is unsupported rather than providing an ersatz implementation. Please note that EINVAL is used to report that the underlying file system does not support the operation (POSIX.1-2008). illumos and ZoL seem to do the same. MFC after: 3 weeks Sponsored by: Panzura	2017-11-02 13:49:08 +00:00
Andriy Gapon	40d47bb4eb	vdev_geom_close: close errored consumer even if vdev_reopening is set If vdev_geom_close doesn't close the consumer, then the subsequent call to vdev_geom_open() would be just a NOP and would always return success. Thus, at present vdev_reopen() would always succeed for vdev_geom devices even if the underlying provider is in error state. The problem was introduced as a result of an optimization in rS308055. The most significant manifistation of the problem is that zio_vdev_io_done() --> vdev_probe() --> SPA_ASYNC_PROBE --> spa_async_probe() --> vdev_reopen() chain of calls and events becomes a NOP as well. This chain is invoked when zio_vdev_io_done() detects an "unexpected" error from the lower level I/O. Additionally, that call path may race with SPA_ASYNC_REMOVE path because of the asynchronous nature of them both. So, the SPA_ASYNC_PROBE may erroneously mark a vdev as being healthy after SPA_ASYNC_REMOVE marked it as removed. Reviewed by: asomers, mav MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D12731	2017-10-31 10:15:03 +00:00
Alan Somers	659058b06f	Fix the error message when creating a zpool on a too-small device Don't check for SPA_MINDEVSIZE in vdev_geom_attach when opening by path. It's redundant with the check in vdev_open, and failing to attach here results in the wrong error message being printed. However, still check for it in some other situations: * When opening by guids, so we don't get bogged down reading from slow devices like floppy drives. * In vdev_geom_read_pool_label for the same reason, because we iterate over all providers. * If the caller requests that we verify the guid, because then we'll have to read from the device before vdev_open verifies the size. PR: 222227 Reported by: Marie Helene Kvello-Aune <marieheleneka@gmail.com> Reviewed by: avg, mav MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D12531	2017-10-23 23:05:29 +00:00
Andriy Gapon	13bacc7144	remove spa_sync_on assert from spa_async_thread_vd Unlike spa_async_thread that can get started only from spa_sync() spa_async_thread_vd can get started from other contexts. Additionally, spa_async_thread_vd does not really depend on spa sync being enabled. The incorrect assert could be triggered by importing a pool in the read-only mode and then disconnecting one of its disks. In this case spa_sync_on was false because the pool was read-only and spa_async_thread_vd was started to handle SPA_ASYNC_REMOVE event. Note: spa_async_thread_vd() currently exists only in FreeBSD, it was split out of spa_async_thread() in r253990. Discussed with: mav MFC after: 2 weeks	2017-10-19 16:36:07 +00:00
Andriy Gapon	e117882ba2	MFV r322235: 8067 zdb should be able to dump literal embedded block pointer illumos/illumos-gate@4923c69fdd `4923c69fdd` FreeBSD note: the manual page is to be updated separately. https://www.illumos.org/issues/8067 Add an option to zdb to print a literal embedded block pointer supplied on the command line: zdb -E [-A] word0:word1:...:word15 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-10-06 08:21:06 +00:00
Andriy Gapon	fed20fc736	really unbreak kernel builds on sparc64 and powerpc64 after r324163, ZFS Channel Programs This commit also reverts r324178 that did not fix the problem on powerpc64 where char is usigned. MFC after: 4 weeks X-MFC with: r324163	2017-10-05 06:39:57 +00:00
Andriy Gapon	620c2c801b	MFV r323913: 8600 ZFS channel programs - snapshot illumos/illumos-gate@2840dce1a0 `2840dce1a0` https://www.illumos.org/issues/8600 ZFS channel programs should be able to create snapshots. In addition to the base snapshot functionality, this will likely entail adding extra logic to handle edge cases which were formerly not possible, such as creating then destroying a snapshot in the same transaction sync. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Chris Williamson <chris.williamson@delphix.com> MFC after: 5 weeks X-MFC after: r324163	2017-10-02 11:32:08 +00:00
Andriy Gapon	a1f65a15ce	MFV r323912: 8592 ZFS channel programs - rollback illumos/illumos-gate@000cce6b6f `000cce6b6f` https://www.illumos.org/issues/8592 ZFS channel programs should be able to perform a rollback. This logic will probably look pretty similar to zfs.sync.destroy(). Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Brad Lewis <brad.lewis@delphix.com> MFC after: 5 weeks X-MFC after: r324163	2017-10-02 11:23:31 +00:00
Andriy Gapon	8cdc315e9e	MFV r323795: 8604 Avoid unnecessary work search in VFS when unmounting snapshots illumos/illumos-gate@ed992b0aac `ed992b0aac` https://www.illumos.org/issues/8604 Every time we want to unmount a snapshot (happens during snapshot deletion or renaming) we unnecessarily iterate through all the mountpoints in the VFS layer (see zfs_get_vfs). Ideally we would just put a hold on the snapshot and access its respective VFS resource directly. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> FreeBSD note: I added a FreeBSD specific function getzfsvfs_ref() which is like getzfsvfs() but returns a filesystem referenced, not busied. We want a busied filesystem in most cases, because we access its private data and, thus, we need to prevent the filesystem from being unmounted and its private data destroyed. But in some cases we can either get away with just a referenced filesystem or we must not busy the filesystem. Unmounting the filesystem is one of such cases. MFC after: 5 weeks X-MFC after: r324163	2017-10-02 11:15:32 +00:00
Andriy Gapon	55b73eb9c5	fix incorrect use of getzfsvfs_impl in r324163, ZFS Channel Programs getzfsvfs_impl() returns a referenced, not busied, filesystem, so the matching call is vfs_rel, not vfs_unbusy. MFC after: 4 weeks X-MFC with: r324163	2017-10-02 11:07:48 +00:00
Andriy Gapon	6a2f82bdf8	unbreak kernel builds on sparc64 and powerpc after r324163, ZFS Channel Programs The custom iscntrl() in ZFS Lua code expects a signed argumnet, so remove the harmful cast. Reported by: ian MFC after: 5 weeks X-MFC with: r324163	2017-10-01 20:12:30 +00:00
Andriy Gapon	3e52a05570	MFV r323794: 8605 zfs channel programs: zfs.exists undocumented and non-working illumos/illumos-gate@5f39f884e2 `5f39f884e2` https://www.illumos.org/issues/8605 zfs.exists() in channel programs doesn't return any result, and should have a man page entry. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Chris Williamson <chris.williamson@delphix.com> MFC after: 5 weeks X-MFC after: r324163	2017-10-01 16:51:05 +00:00
Andriy Gapon	529b326e63	MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best() illumos/illumos-gate@7d3000f774 `7d3000f774` https://www.illumos.org/issues/8521 Yuri reported this to the mailing list: doing a `reboot -d` on current illumos-gate HEAD gives the following ":: findleaks -dv" output: findleaks: maximum buffers => 301061 findleaks: actual buffers => 297587 findleaks: findleaks: potential pointers => 29289774 findleaks: dismissals => 26242305 (89.5%) findleaks: misses => 331153 ( 1.1%) findleaks: dups => 2419681 ( 8.2%) findleaks: follows => 296635 ( 1.0%) findleaks: findleaks: peak memory usage => 7353 kB findleaks: elapsed CPU time => 1.5 seconds findleaks: elapsed wall time => 2.0 seconds findleaks: CACHE LEAKED BUFCTL CALLER ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuripv@gmx.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com> MFC after: 5 weeks X-MFC after: r324163	2017-10-01 16:41:05 +00:00
Andriy Gapon	550374efe6	revert r324166, it has an unrelated change in it	2017-10-01 16:37:54 +00:00
Andriy Gapon	7d5c6491f0	MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best() illumos/illumos-gate@7d3000f774 `7d3000f774` https://www.illumos.org/issues/8521 Yuri reported this to the mailing list: doing a `reboot -d` on current illumos-gate HEAD gives the following ":: findleaks -dv" output: findleaks: maximum buffers => 301061 findleaks: actual buffers => 297587 findleaks: findleaks: potential pointers => 29289774 findleaks: dismissals => 26242305 (89.5%) findleaks: misses => 331153 ( 1.1%) findleaks: dups => 2419681 ( 8.2%) findleaks: follows => 296635 ( 1.0%) findleaks: findleaks: peak memory usage => 7353 kB findleaks: elapsed CPU time => 1.5 seconds findleaks: elapsed wall time => 2.0 seconds findleaks: CACHE LEAKED BUFCTL CALLER ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuripv@gmx.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com> MFC after: 5 weeks X-MFC after: r324163	2017-10-01 16:34:16 +00:00
Andriy Gapon	bda88d07d9	MFV r323530,r323533,r323534: 7431 ZFS Channel Programs, and followups 7431 ZFS Channel Programs illumos/illumos-gate@dfc115332c `dfc115332c` https://www.illumos.org/issues/7431 ZFS channel programs (ZCP) adds support for performing compound ZFS administrative actions via Lua scripts in a sandboxed environment (with time and memory limits). This initial commit includes both base support for running ZCP scripts, and a small initial library of API calls which support getting properties and listing, destroying, and promoting datasets. Testing: in addition to the included unit tests, channel programs have been in use at Delphix for several months for batch destroying filesystems. The dsl_destroy_snaps_nvl() call has also been replaced with Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Chris Williamson <chris.williamson@delphix.com> 8552 ZFS LUA code uses floating point math illumos/illumos-gate@916c8d8811 `916c8d8811` https://www.illumos.org/issues/8552 In the LUA interpreter used by "zfs program", the lua format() function accidentally includes support for '%f' and friends, which can cause compilation problems when building on platforms that don't support floating-point math in the kernel (e.g. sparc). Support for '%f' friends (%f %e %E %g %G) should be removed, since there's no way to supply a floating-point value anyway (all numbers in ZFS LUA are int64_t's). Reviewed by: Yuri Pankov <yuripv@gmx.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> 8590 memory leak in dsl_destroy_snapshots_nvl() illumos/illumos-gate@e6ab4525d1 `e6ab4525d1` https://www.illumos.org/issues/8590 In dsl_destroy_snapshots_nvl(), "snaps_normalized" is not freed after it is added to "arg". Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> FreeBSD notes: - zfs-program.8 manual page is taken almost as is from the vendor repository, no FreeBSD-ification done - fixed multiple instances of NULL being used where an integer is expected - replaced ETIME and ECHRNG with ETIMEDOUT and EDOM respectively This commit adds a modified version of Lua 5.2.4 under sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lua, mirroring the upstream. See README.zfs in that directory for the description of Lua customizations. See zfs-program.8 on how to use the new feature. MFC after: 5 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D12528	2017-10-01 16:11:07 +00:00
Mark Johnston	47f11baaca	Use C99 initializers for DTrace provider methods. This makes the definitions easier to read and more cscope-friendly. MFC after: 1 week	2017-09-27 17:46:38 +00:00
Andriy Gapon	443efc868c	fix r324011, MFV of r323535, 8585 improve batching done in zil_commit() I managed to commit an older version of the change. Plus, even the latest version was not ready for userland compilation. Reported by: "O. Hartmann" <ohartmann@walstatt.org>, cy MFC after: 1 week X-MFC with: r324011	2017-09-26 15:38:16 +00:00
Andriy Gapon	c13f1d82c8	MFV r323535: 8585 improve batching done in zil_commit() FreeBSD notes: - this MFV reverts FreeBSD commit r314549 to make the merge easier - at present our emulation of cv_timedwait_hires is rather poor, so I elected to use cv_timedwait_sbt directly Please see the differential revision for details. Unfortunately, I did not get any positive reviews, so there could be bugs in the FreeBSD-specific piece of the merge. Hence, the long MFC timeout. illumos/illumos-gate@1271e4b10d `1271e4b10d` https://www.illumos.org/issues/8585 The current implementation of zil_commit() can introduce significant latency, beyond what is inherent due to the latency of the underlying storage. The additional latency comes from two main problems: 1. When there's outstanding ZIL blocks being written (i.e. there's already a "writer thread" in progress), then any new calls to zil_commit() will block waiting for the currently oustanding ZIL blocks to complete. The blocks written for each "writer thread" is coined a "batch", and there can only ever be a single "batch" being written at a time. When a batch is being written, any new ZIL transactions will have to wait for the next batch to be written, which won't occur until the current batch finishes. As a result, the underlying storage may not be used as efficiently as possible. While "new" threads enter zil_commit() and are blocked waiting for the next batch, it's possible that the underlying storage isn't fully utilized by the current batch of ZIL blocks. In that case, it'd be better to allow these new threads to generate (and issue) a new ZIL block, such that it could be serviced by the underlying storage concurrently with the other ZIL blocks that are being serviced. 2. Any call to zil_commit() must wait for all ZIL blocks in its "batch" to complete, prior to zil_commit() returning. The size of any given batch is proportional to the number of ZIL transaction in the queue at the time that the batch starts processing the queue; which doesn't occur until the previous batch completes. Thus, if there's a lot of transactions in the queue, the batch could be composed of many ZIL blocks, and each call to zil_commit() will have to wait for all of these writes to complete (even if the thread calling zil_commit() only cared about one of the transactions in the batch). Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D12355	2017-09-26 11:04:08 +00:00
Ian Lepore	a78b4d1462	Use nstosbt() instead of multiplying by SBT_1NS to avoid roundoff errors. Differential Revision: https://reviews.freebsd.org/D11779	2017-09-25 15:03:27 +00:00
Andriy Gapon	f94aa61c33	MFV r323917: 8648 Fix range locking in ZIL commit codepath illumos/illumos-gate@42b1411172 `42b1411172` https://www.illumos.org/issues/8648 I'm opening this bug to track integration of the following ZFS on Linux commit into illumos: commit `f763c3d1df` Author: LOLi <loli10K@users.noreply.github.com> Date: Mon Aug 21 17:59:48 2017 +0200 Fix range locking in ZIL commit codepath Since OpenZFS 7578 (`1b7c1e5`) if we have a ZVOL with logbias=throughput we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr offset and length to the offset and length of the BIO from zvol_write()->zvol_log_write(): these offset and length are later used to take a range lock in zillog->zl_get_data function: zvol_get_data(). Now suppose we have a ZVOL with blocksize=8K and push 4K writes to offset 0: we will only be range-locking 0-4096. This means the ASSERTion we make in dbuf_unoverride() is no longer valid because now dmu_sync() is called from zilog's get_data functions holding a partial lock on the dbuf. Fix this by taking a range lock on the whole block in zvol_get_data(). Reviewed-by: Chunwei Chen <tuxoko@gmail.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: loli10K <ezomori.nozomu@gmail.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: LOLi <loli10K@users.noreply.github.com> MFC after: 10 days	2017-09-22 08:27:27 +00:00
Andriy Gapon	1c6ea90df5	MFV r323914: 8661 remove "zil-cw2" dtrace probe illumos/illumos-gate@bd9d3f9046 `bd9d3f9046` https://www.illumos.org/issues/8661 The "zil-cw1" dtrace probe was previously removed in 8558, and the "zil-cw2" probe should have been removed in that patch as well. Unfortunately, the "zil- cw2" was not removed in 8558, so this bug is to track it's removal. Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> MFC after: 1 week	2017-09-22 08:21:14 +00:00
Alan Somers	cd037f075c	MFV r323789: 8473 scrub does not detect errors on active spares illumos/illumos-gate@554675eee7 `554675eee7` https://www.illumos.org/issues/8473 Scrubbing is supposed to detect and repair all errors in the pool. However, it wrongly ignores active spare devices. The problem can easily be reproduced in OpenZFS at git rev 0ef125d with these commands: truncate -s 64m /tmp/a /tmp/b /tmp/c sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c sudo zpool replace testpool /tmp/a /tmp/c /bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c sync sudo zpool scrub testpool zpool status testpool # Will show 0 errors, which is wrong sudo zpool offline testpool /tmp/a sudo zpool scrub testpool zpool status testpool # Will show errors on /tmp/c, # which should've already been fixed FreeBSD head is partially affected: the first scrub will detect some errors, but the second scrub will detect more. Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> MFC after: 1 week Sponsored by: Spectra Logic Corp	2017-09-20 16:31:00 +00:00
Andriy Gapon	aacd0b4bb2	add vfs_zfs.abd_chunk_size tunable It is reported that the default value of 4KB results in a substantial memory use overhead (at least, on some configurations). Using 1KB seems to reduce the overhead significantly. PR: 222377 Reported by: Sean Chittenden <sean@chittenden.org> MFC after: 1 week	2017-09-20 08:36:31 +00:00
Andriy Gapon	3d5487d981	fix memory leak in g_bio zone introduced in r320452, another ABD fallout I overlooked the fact that that ZIO_IOCTL_PIPELINE does not include ZIO_STAGE_VDEV_IO_DONE stage. We do allocate a struct bio for an ioctl zio (a disk cache flush), but we never freed it. This change splits bio handling into two groups, one for normal read/write i/o that passes data around and, thus, needs the abd data tranform; the other group is for "data-less" i/o such as trim and cache flush. PR: 222288 Reported by: Dan Nelson <dnelson@allantgroup.com> Tested by: Borja Marcos <borjam@sarenet.es> MFC after: 10 days	2017-09-20 08:27:21 +00:00
Andriy Gapon	8c9377cde7	MFV r323792: 8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure illumos/illumos-gate@2bcb545854 `2bcb545854` https://www.illumos.org/issues/8602 When I landed the fix for 8558, I incorrectly added the "dp_early_sync_tasks" field to the "dsl_pool" structure. This field is used in DelphixOS, but not in illumos. It was incorrectly pulled into illumos, so this bug is to remove it from the structure. Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> MFC after: 1 week	2017-09-20 07:26:52 +00:00
Andriy Gapon	86261a95ed	slightly simplify zfs_vptocnp It's not necessary to look up the parent's ID to check if the node is the root node of the filesystem. MFC after: 2 weeks	2017-09-13 07:09:58 +00:00
Andriy Gapon	1a2ddb2997	fix a fallout from the ZTOV tightening, r323479 MFC after: 13 days X-MFC with: r323479	2017-09-12 13:21:14 +00:00
Andriy Gapon	bcab65cab5	zfsctl_snapdir_lookup should be able to handle an uncovered vnode The uncovered vnode is possible because there is no guarantee that its hold count would go to zero (and it would be inactivated and reclaimed) immediately after a covering filesystem is unmounted. So, such a vnode should be expected and it is possible to re-use it without any trouble. MFC after: 3 weeks Sponsored by: Panzura	2017-09-12 06:06:58 +00:00
Andriy Gapon	c09d0da8d1	zfs_ctldir: remove obsolete / bogus ARGSUSED lint directives None of the tagged functions had unused parameters. MFC after: 1 week	2017-09-12 06:05:30 +00:00
Andriy Gapon	65b38f7311	zfsvfs_hold: assert that the busied filesystem can not be unmounted This is a FreeBSD specific feature. MFC after: 3 weeks Sponsored by: Panzura	2017-09-12 06:04:50 +00:00
Andriy Gapon	d092f79489	zfs_get_vfs: reference a requested filesystem instead of vfs_busy-ing it The only consumer of zfs_get_vfs, zfs_unmount_snap, does not need the filesystem to be busy, it just need a reference that it can pass to dounmount. Also, previously the code was racy as it unbusied the filesystem before taking a reference on it. Now the code should be simpler and safer. MFC after: 2 weeks Sponsored by: Panzura	2017-09-12 06:04:01 +00:00
Andriy Gapon	f7519dbb76	zfs: tighten debug versions of ZTOV and VTOZ MFC after: 2 weeks Sponsored by: Panzura	2017-09-12 06:02:21 +00:00
Andriy Gapon	970165f190	MFV r323111: 8569 problem with inline functions in abd.h illumos/illumos-gate@37e84ab74e `37e84ab74e` https://www.illumos.org/issues/8569 C [C99] has peculiar rules for inline functions that are different from the C++ rules. Unlike C++ where inline is "fire and forget", in C a programmer must pay attention to the function's storage class / visibility. The main problem is with the case where a compiler decides to not inline a call to the function declared as inline. Some relevant links: - http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15831.html - http://www.drdobbs.com/the-new-c-inline-functions/184401540 The summary is that either the inline functions should be declared 'static inline' or one of the compilation units (.c files) must provide a callable externally visible function definition. In the former case, the compiler would automatically create a local non-inlined function instance in every compilation unit where it's needed. In the latter case the single external definition is used to satisfy any non-inlined calls in all compilation units. As things stand right now, we can get an undefined reference error under certain combinations of compilers and compiler options. For example, this is what I get on FreeBSD when compiling with clang 4.0.0 and -O1: In function `abd_free': /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/abd.c:385: undefined reference to `abd_is_linear' Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 1 week	2017-09-11 12:15:49 +00:00
Andriy Gapon	25625d8746	Revert r322601, Mark ZFS ABD inline functions static An alternative fix is to be merged from illumos shortly.	2017-09-11 12:08:20 +00:00
Andriy Gapon	3d9a0e564d	MFV r323110: 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems illumos/illumos-gate@216d7723a1 `216d7723a1` https://www.illumos.org/issues/8558 On a system with more than 80K ZFS filesystems, we've seen cases where lwp_create() will start to fail by returning EAGAIN. The problem being, for each of those 80K ZFS filesystems, a taskq will be created for each dataset as part of the ZIL for each dataset. For each of these taskq's, a kernel thread will be created which results in 24KB being allocated for each thread. With enough of these 24KB allocations, we eventually exhaust the memory region set aside for these allocations. Currently, segkpsize is set to a value of 2GB, which means we can only support about 80K filesystems; 2GB / 24KB = ~80K. The lwp_create() failure comes into play due to the fact that LWP creation also allocates 24KB from this same region of memory. Thus, if we've exhausted this region of memory due to the number of ZIL taskq's, there won't be any memory avaible to allow the call to lwp_create() to succeed. FreeBSD note: I haven't created sysctl-s for the new ZIL clean parameters. Let's add them if anyone requires to tune them. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> MFC after: 3 weeks	2017-09-11 11:31:43 +00:00
Andriy Gapon	90354c3200	MFV r323107: 8414 Implemented zpool scrub pause/resume illumos/illumos-gate@1702cce751 `1702cce751` FreeBSD note: rather than merging the zpool.8 update I copied the zpool scrub section from the illumos zpool.1m to FreeBSD zpool.8 almost verbatim. Now that the illumos page uses the mdoc format, it was an easier option. Perhaps the change is not in perfect compliance with the FreeBSD style, but I think that it is acceptible. https://www.illumos.org/issues/8414 This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167 Currently, there is no way to pause a scrub. Pausing may be useful when the pool is busy with other I/O to preserve bandwidth. Description This patch adds the ability to pause and resume scrubbing. This is achieved by maintaining a persistent on-disk scrub state. While the state is 'paused' we do not scrub any more blocks. We do however perform regular scan housekeeping such as freeing async destroyed and deadlist blocks while paused. Motivation and Context Scrub pausing can be an I/O intensive operation and people have been asking for the ability to pause a scrub for a while. This allows one to preserve scrub progress while freeing up bandwidth for other I/O. Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Alek Pinchuk <apinchuk@datto.com> MFC after: 2 weeks	2017-09-09 11:00:07 +00:00
Baptiste Daroussin	91f1caeccb	Add sysctls for arc shrinking and growing values The default value for arc_no_grow_shift may not be optimal when using several GiB ARC. Expose it via sysctl allows users to tune it easily. Also expose arc_grow_retry via sysctl for the same reason. The default value of 60s might, in case of intensive load, be too long. Submitted by: Nikita Kozlov <nikita.kozlov@blade-group.com> Reviewed by: mav, manu, bapt MFC after: 2 weeks Sponsored by: blade Differential Revision: https://reviews.freebsd.org/D12144	2017-08-31 13:02:17 +00:00
Ed Maste	3c3d2ba6fe	zfs: do not advertise edonr which is not yet supported illumos 4185 ("add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R") was intentionally merged only partially in r289422, without adding support for skein, sha512 and edonr on FreeBSD. Support for skein and sha512 was added later on, but edonr is still not implemented in FreeBSD. Prior to this commit zfs(8) correctly rejected edonr, but with an error message that claimed support: fk@r500 ~ $zfs set checksum=edonr tank cannot set property for 'tank': 'checksum' must be one of 'on \| off \| fletcher2 \| fletcher4 \| sha256 \| sha512 \| skein \| edonr' PR: 204055 Submitted by: Fabian Keil Approved by: allanjude Obtained from: ElectroBSD MFC after: 1 week	2017-08-29 22:24:22 +00:00
John Baldwin	ac3b479ec8	Add a guard around _ILP32 for mips. This is already done for other architectures in this file and fixes the build with clang.	2017-08-21 17:45:06 +00:00
John Baldwin	b99836cea9	Mark ZFS ABD inline functions static. When built with -fno-inline-functions zfs.ko contains undefined references to these functions if they are only marked inline. Reviewed by: avg (earlier version) MFC after: 1 week Sponsored by: Chelsio Communications	2017-08-16 23:40:32 +00:00
Alan Somers	69b14f7acd	Fix some ZFS debugging messages sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Be more careful about the use of provider names vs vdev names in ZFS_LOG statements. MFC after: 3 weeks Sponsored by: Spectra Logic Corp	2017-08-15 15:20:04 +00:00
Andriy Gapon	984d43cca5	MFV r322242: 8373 TXG_WAIT in ZIL commit path illumos/illumos-gate@d28671a3b0 `d28671a3b0` https://www.illumos.org/issues/8373 The code that writes ZIL blocks uses dmu_tx_assign(TXG_WAIT) to assign a transaction to a transaction group. That seems to be logically incorrect as writing of the ZIL block does not introduce any new dirty data. Also, when there is a lot of dirty data, the call can introduce significant delays into the ZIL commit path, thus affecting all synchronous writes. Additionally, ARC throttling may affect the ZIL writing. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2017-08-08 11:26:03 +00:00
Andriy Gapon	2653426e89	MFV r322240: 8491 uberblock on-disk padding to reserve space for smoothly merging zpool checkpoint & MMP in ZFS illumos/illumos-gate@79c2b812ee `79c2b812ee` https://www.illumos.org/issues/8491 The zpool checkpoint feature in DxOS added a new field in the uberblock. The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279). As these two changes come from two different sources and once upstreamed and deployed will introduce an incompatibility with each other we want to upstream a change that will reserve the padding for both of them so integration goes smoothly and everyone gets both features. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Olaf Faaland <faaland1@llnl.gov> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> MFC after: 3 weeks	2017-08-08 11:21:58 +00:00
Andriy Gapon	6f2f8727e3	MFV r322238: 7915 checks in l2arc_evict could use some cleaning up illumos/illumos-gate@267ae6c3a8 `267ae6c3a8` https://www.illumos.org/issues/7915 l2arc_evict() is strictly serialized with respect to l2arc_write_buffers() and l2arc_write_done(). Normally, l2arc_evict() and l2arc_write_buffers() are called from the same thread, so they can not be concurrent. Also, l2arc_write_buffers() uses zio_wait() on the parent zio of all cache zio-s. That ensures that l2arc_write_done() is completed before l2arc_write_buffers() returns. Finally, if a cache device is removed, then l2arc_evict() is called under SCL_ALL in the exclusive mode. That ensures that it can not be concurrent with the normal L2ARC accesses to the device (including writing and evicting buffers). Given the above, some checks and actions in l2arc_evict() do not make sense. For instance, it must never encounter the write head header let alone remove it from the buffer list. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2017-08-08 11:19:14 +00:00
Andriy Gapon	3cf2ea1aea	MFV r322236: 8126 ztest assertion failed in dbuf_dirty due to dn_nlevels changing illumos/illumos-gate@dcb6872c56 `dcb6872c56` https://www.illumos.org/issues/8126 The sync thread is concurrently modifying dn_phys->dn_nlevels while dbuf_dirty() is trying to assert something about it, without holding the necessary lock. We need to move this assertion further down in the function, after we have acquired the dn_struct_rwlock. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 2 weeks	2017-08-08 11:14:40 +00:00
Andriy Gapon	98a9c8e68a	zfs: no need for __DECONST after abd constification in r322233 Note that vdev_label_write_pad2() is FreeBSD specific. MFC after: 2 weeks X-MFC after: r322233	2017-08-08 11:07:34 +00:00
Andriy Gapon	b9a4f29445	MFV r322232: 8426 mark immutable buffer arguments as such in abd.h illumos/illumos-gate@9b195260e2 `9b195260e2` https://www.illumos.org/issues/8426 abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so qualify them with const. abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the corresponding arguments. Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2017-08-08 10:59:18 +00:00
Andriy Gapon	9c48e95dd9	MFV r322229: 7600 zfs rollback should pass target snapshot to kernel illumos/illumos-gate@77b171372e `77b171372e` https://www.illumos.org/issues/7600 At present, the kernel side code seems to blindly rollback to whatever happens to be the latest snapshot at the time when the rollback task is processed. The expected target's name should be passed to the kernel driver and the sync task should validate that the target exists and that it is the latest snapshot indeed. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 3 weeks	2017-08-08 10:52:01 +00:00
Andriy Gapon	8605a08bd2	MFV r322227: 8377 Panic in bookmark deletion illumos/illumos-gate@42418f9e73 `42418f9e73` https://www.illumos.org/issues/8377 The problem is that when dsl_bookmark_destroy_check() is executed from open context (the pre-check), it fills in dbda_success based on the existence of the bookmark. But the bookmark (or containing filesystem as in this case) can be destroyed before we get to syncing context. When we re-run dsl_bookmark_destroy_check() in syncing context, it will not add the deleted bookmark to dbda_success, intending for dsl_bookmark_destroy_sync() to not process it. But because the bookmark is still in dbda_success from the open-context call, we do try to destroy it. The fix is that dsl_bookmark_destroy_check() should not modify dbda_success when called from open context. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 2 weeks	2017-08-08 10:48:52 +00:00
Andriy Gapon	b4e4140d13	MFV r322223: 8378 crash due to bp in-memory modification of nopwrite block illumos/illumos-gate@b7edcb9408 `b7edcb9408` https://www.illumos.org/issues/8378 The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which we then nopwrite against. zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr to zgd_bp, dbuf_write_ready() could change db_blkptr, and dbuf_write_done() could remove the dirty record. dmu_sync() then sees the stale BP and that the dbuf it not dirty, so it is eligible for nop-writing. The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the db_mtx. We could still see a stale db_blkptr, but if it is stale then the dirty record will still exist and thus we won't attempt to nopwrite. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 2 weeks	2017-08-08 10:46:51 +00:00
Andriy Gapon	c6fb364293	MFV r322221: 7910 l2arc_write_buffers() may write beyond target_sz FreeBD note: the essence of this change was committed to FreeBSD in r314274. This commit catches up with differences between what was committed to FreeBSD and what was committed to OpenZFS, mainly more logical variable names. illumos/illumos-gate@16a7e5ac11 `16a7e5ac11` https://www.illumos.org/issues/7910 It seems that the change in issue #6950 resurrected the problem that was earlier fixed by the change in issue #5219. Please also see the following FreeBSD bug report: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216178 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2017-08-08 10:43:41 +00:00
Ruslan Bukin	ca20f8ec29	o Replace __riscv__ with __riscv o Replace __riscv64 with (__riscv && __riscv_xlen == 64) This is required to support new GCC 7.1 compiler. This is compatible with current GCC 6.1 compiler. RISC-V is extensible ISA and the idea here is to have built-in define per each extension, so together with __riscv we will have some subset of these as well (depending on -march string passed to compiler): __riscv_compressed __riscv_atomic __riscv_mul __riscv_div __riscv_muldiv __riscv_fdiv __riscv_fsqrt __riscv_float_abi_soft __riscv_float_abi_single __riscv_float_abi_double __riscv_cmodel_medlow __riscv_cmodel_medany __riscv_cmodel_pic __riscv_xlen Reviewed by: ngie Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D11901	2017-08-07 14:09:57 +00:00
Andriy Gapon	f6040f9e8e	spa_import_rootpool should be able to handle an imported root pool That is required to support reboot -r with a new root filesystem being on an already imported pool. PR: 210721 Reported by: Jan Bramkamp <crest_maintainer@rlwinm.de> MFC after: 2 weeks	2017-07-25 13:17:06 +00:00
Ed Maste	5c12f7c3e2	zfs: Fix a typo in the delay_min_dirty_percent sysctl description The description is FreeBSD-specific and was added in r266497 to fix PR189865. PR: 220825 Submitted by: Fabian Keil Obtained from: ElectroBSD MFC after: 1 week	2017-07-19 18:17:41 +00:00
Andriy Gapon	37ec52ca7a	fix a regression in r320452, ZFS ABD import I overlooked the fact that vdev_op_io_done hook is called even if the actual I/O is skipped, for example, in the case of a missing vdev. Arguably, this could be considered an issue in the zio pipeline engine, but for now I am adding defensive code to check for io_bp being NULL along with assertions that that happens only when it can be really expected. PR: 220691 Reported by: peter, cy Tested by: cy MFC after: 1 week X-MFC with: r320156, r320452	2017-07-18 07:41:38 +00:00
Justin Hibbits	8fe026c641	Make ZFS not crash on mount on 32-bit systems ZPL_VERSION is unsigned long long, not an int. With this change, a zpool can be created on a 32-bit system (tested on powerpcspe) and mounted correctly. Reviewed by: allanjude	2017-07-18 01:08:45 +00:00
Andriy Gapon	1db5f1724b	fix an architectural problem introduced in r320156, ZFS ABD import The implementation of ZFS refcount_t uses the emulated illumos mutex (the sx lock) and the waiting memory allocation when ZFS_DEBUG is enabled. This makes refcount_t unsuitable for use in GEOM g_up thread where sleeping is prohibited. When importing the ABD change I modified vdev_geom using illumos vdev_disk as an example. As a result, I added a call to abd_return_buf in vdev_geom_io_intr. The latter is called on g_up thread while the former uses refcount_t. This change fixes the problem by deferring the abd_return_buf call to the previously unused vdev_geom_io_done that is called on a ZFS zio taskqueue thread where sleeping is allowed. A side bonus of this change is that now a vdev zio has a pointer to its corresponding bio while the zio is active. Reported by: Shawn Webb <shawn.webb@hardenedbsd.org> Tested by: Shawn Webb <shawn.webb@hardenedbsd.org> MFC after: 1 week X-MFC with: r320156	2017-06-28 13:59:20 +00:00
Andriy Gapon	c20b00c6af	zfs: port vdev_file part of illumos change 3306 3306 zdb should be able to issue reads in parallel illumos/illumos-gate/31d7e8fa33fae995f558673adb22641b5aa8b6e1 https://www.illumos.org/issues/3306 The upstream change was made before we started to import upstream commits individually. It was imported into the illumos vendor area as r242733. That commit was MFV-ed in r260138, but as the commit message says vdev_file.c was left intact. This commit actually implements the parallel I/O for vdev_file using a taskqueue with multiple thread. This implementation does not depend on the illumos or FreeBSD bio interface at all, but uses zio_t to pass around all the relevent data. So, the code looks a bit different from the upstream. This commit also incorporates ZoL commit zfsonlinux/zfs/bc25c9325b0e5ced897b9820dad239539d561ec9 that fixed https://github.com/zfsonlinux/zfs/issues/2270 We need to use a dedicated taskqueue for exactly the same reason as ZoL as we do not implement TASKQ_DYNAMIC. Obtained from: illumos, ZFS on Linux MFC after: 2 weeks	2017-06-26 09:10:09 +00:00
Andriy Gapon	ee2d3c0a5b	fix gcc-specific fallout from r320156, MFV of r318946, ZFS ABD Reported by: jhibbits MFC after: 1 week X-MFC with: r320156	2017-06-23 08:42:53 +00:00
Andriy Gapon	3385c74539	MFV r319950: 5220 L2ARC does not support devices that do not provide 512B access FreeBSD note: the actual change has been in FreeBSD since r297848. This commit accounts for integration of that change with subsequent changes, especially r320156 (MFV of r318946) and r314274. illumos/illumos-gate@403a8da73c `403a8da73c` https://www.illumos.org/issues/5220 There are disk devices that have logical sector size larger than 512B, for example 4KB. That is, their physical sector size is larger than 512B and they do not provide emulation for 512B sector sizes. For such devices both a data offset and a data size must be properly aligned. L2ARC should arrange that because it uses physical I/O. zio_vdev_io_start() performs a necessary transformation if io_size is not aligned to vdev_ashift, but that is done only for logical I/O. Something similar should be done in L2ARC code. * a temporary write buffer should be allocated if the original buffer is not going to be compressed and its size is not aligned * size of a temporary compression buffer should be ashift aligned * for the reads, if a size of a target buffer is not sufficiently large and it is not aligned then a temporary read buffer should be allocated Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 3 weeks	2017-06-22 17:10:34 +00:00
Andriy Gapon	ae5ec64b88	MFV r319742: 8056 zfs send size estimate is inaccurate for some zvols illumos/illumos-gate@0255edcc85 `0255edcc85` https://www.illumos.org/issues/8056 The send size estimate for a zvol can be too low, if the size of the record headers (dmu_replay_record_t's) is a significant portion of the size. This is typically the case when the data is highly compressible, especially with embedded blocks. The problem is that dmu_adjust_send_estimate_for_indirects() assumes that blocks are the size of the "recordsize" property (128KB). However, for zvols, the blocks are the size of the "volblocksize" property (8KB). Therefore, we estimate that there will be 16x less record headers than there really will be. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> MFC after: 3 weeks	2017-06-22 16:58:09 +00:00
Andriy Gapon	e70097b50f	MFV r318947: 7578 Fix/improve some aspects of ZIL writing. FreeBSD note: this commit removes small differences between what mav committed to FreeBSD in r308782 and what ended up committed to illumos after addressing all review comments. illumos/illumos-gate@c5ee46810f `c5ee46810f` https://www.illumos.org/issues/7578 After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. While there, untangle and unify code of z_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alexander Motin <mav@FreeBSD.org> MFC after: 3 weeks	2017-06-22 16:52:22 +00:00
Andriy Gapon	a4a2976d8a	fix several fallouts from r320156, ZFS ABD import All of the problems were related to the FreeBSD-only features. One was caused by a mismerge in the zfsbootcfg support code. All others were in the TRIM support code. MFC after: 1 week X-MFC with: r320156	2017-06-21 08:12:07 +00:00
Andriy Gapon	ebf3b53dac	fix several fallouts from r320156, ZFS ABD import All of the problems were related to the FreeBSD-only features. One was caused by a mismerge in the zfsbootcfg support code. All others were in the TRIM support code. Reported by: ken, O. Hartmann <ohartmann@walstatt.org>, Trond Endrestøl <Trond.Endrestol@fagskolen.gjovik.no> MFC after: 1 week X-MFC with: r320156	2017-06-21 08:10:45 +00:00
Andriy Gapon	f9cdbaba8d	MFV r318946: 8021 ARC buf data scatter-ization illumos/illumos-gate@770499e185 `770499e185` https://www.illumos.org/issues/8021 The ARC buf data project (known simply as "ABD" since its genesis in the ZoL community) changes the way the ARC allocates `b_pdata` memory from using linear `void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This improves ZFS's performance by helping to defragment the address space occupied by the ARC, in particular for cases where compressed ARC is enabled. It could also ease future work to allocate pages directly from `segkpm` for minimal- overhead memory allocations, bypassing the `kmem` subsystem. This is essentially the same change as the one which recently landed in ZFS on Linux, although they made some platform-specific changes while adapting this work to their codebase: 1. Implemented the equivalent of the `segkpm` suggestion for future work mentioned above to bypass issues that they've had with the Linux kernel memory allocator. 2. Changed the internal representation of the ABD's scatter/gather list so it could be used to pass I/O directly into Linux block device drivers. (This feature is not available in the illumos block device interface yet.) FreeBSD notes: - the actual (default) chunk size is 4KB (despite the text above saying 1KB) - we can try to reimplement ABDs, so that they are not permanently mapped into the KVA unless explicitly requested, especially on platforms with scarce KVA - we can try to use unmapped I/O and avoid intermediate allocation of a linear, virtual memory mapped buffer - we can try to avoid extra data copying by referring to chunks / pages in the original ABD Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Dan Kimmel <dan.kimmel@delphix.com> MFC after: 3 weeks	2017-06-20 17:39:24 +00:00
Andriy Gapon	42ce346fcc	revert r315852 which introduced zio_buf_alloc_nowait for use in vdev_queue_aggregate I think that the change is still good, but reconciling it with a planned merge of the ARC buf data scatter-ization is a bit more tedious than I can handle. MFC after: 17 days	2017-06-20 16:55:30 +00:00
Andriy Gapon	602cf4e4a7	MFV r319951: 8311 ZFS_READONLY is a little too strict illumos/illumos-gate@2889ec41c0 `2889ec41c0` https://www.illumos.org/issues/8311 Description: There was a misunderstanding about the enforcement details of the "Read-only" flag introduced for SMB/CIFS compatibility, way back in 2007 in the Sun PSARC 2007/315 case. The original authors thought enforcement of the READONLY flag should work similarly as the IMMUTABLE flag. Unfortunately, that enforcement is incompatible with the expectations of Windows applications using this feature through the SMB service. Applications assume (and the MS File System Algorithms MS-FSA confirms they should) that an SMB client can: (a) Open an SMB handle on a file with read/write access, (b) Set the DOS attributes to include the READONLY flag, (c) continue to have write access via that handle. This access model is essentially the same as a Unix/POSIX application that creates a file (with read/write access), uses fchmod() to change the file mode to something not granting write access (i.e. 0444), and then continues to write that file using the open handle it got before the mode change. Currently, the SMB server works-around this problem in a way that will become difficult to maintain as we implement support for SMB3 persistent handles, so SMB depends on this fix. I've written a test program that can be used to demonstrate this problem, and added it to zfs-tests (tests/functional/acl/cifs/cifs_attr_004_pos). It currently fails, but will pass when this problem fixed. Steps to Reproduce: Run the test program on a ZFS file system. Expected Results: Pass Actual Results: Fail. Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Prakash Surya <prakash.surya@delphix.com> Author: Gordon Ross <gwr@nexenta.com> MFC after: 2 weeks	2017-06-14 16:55:47 +00:00
Andriy Gapon	7d506d0d57	MFV r319948: 5428 provide fts(), reallocarray(), and strtonum() illumos/illumos-gate@4585130b25 `4585130b25` https://www.illumos.org/issues/5428 Most of the upstream change is not applicable to FreeBSD. Only the renaming of strtonum to zfs_strtonum is relevant to us. And we already had it partially done. Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Yuri Pankov <yuri.pankov@nexenta.com> MFC after: 1 week	2017-06-14 16:42:38 +00:00
Andriy Gapon	b8d341fe26	MFV r319945,r319946: 8264 want support for promoting datasets in libzfs_core illumos/illumos-gate@a4b8c9aa65 `a4b8c9aa65` https://www.illumos.org/issues/8264 Oddly there is a lzc_clone function, but no lzc_promote function. Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@kebe.com> Approved by: Dan McDonald <danmcd@kebe.com> Author: Andrew Stormont <astormont@racktopsystems.com> MFC after: 1 week	2017-06-14 16:31:36 +00:00
Andriy Gapon	667002fa27	MFV r319741: 8156 dbuf_evict_notify() does not need dbuf_evict_lock illumos/illumos-gate@dbfd9f9300 `dbfd9f9300` https://www.illumos.org/issues/8156 dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do the eviction itself (because the evict thread is not able to keep up). This can result in massive lock contention. It isn't necessary to hold the lock, because if we make the wrong choice occasionally, nothing bad will happen. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 1 week	2017-06-09 15:28:57 +00:00
Andriy Gapon	5f9cf93878	MFV r319739: 8005 poor performance of 1MB writes on certain RAID-Z configurations illumos/illumos-gate@5b06278253 `5b06278253` https://www.illumos.org/issues/8005 RAID-Z requires that space be allocated in multiples of P+1 sectors, because this is the minimum size block that can have the required amount of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2 sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector is a unit of 2^ashift bytes, typically 512B or 4KB. To satisfy this constraint, the allocation size is rounded up to the proper multiple, resulting in up to 3 "pad sectors" at the end of some blocks. The contents of these pad sectors are not used, so we do not need to read or write these sectors. However, some storage hardware performs much worse (around 1/2 as fast) on mostly-contiguous writes when there are small gaps of non-overwritten data between the writes. Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that include pad sectors. If writing a pad sector will fill the gap between two (required) writes, we will issue the optional zio, thus doubling performance. The gap-filling performance improvement was introduced in July 2009. Writing the optional zio is done by the io aggregation code in vdev_queue.c. The problem is that it is also subject to the limit on the size of aggregate writes, zfs_vdev_aggregation_limit, which is by default 128KB. For a given block, if the amount of data plus padding written to a leaf device exceeds zfs_vdev_aggregation_limit, the optional zio will not be written, resulting in a ~2x performance degradation. The problem occurs only for certain values of ashift, compressed block size, and RAID-Z configuration (number of parity and data disks). It cannot occur with the default recordsize=128KB. If compression is enabled, all configurations with recordsize=1MB or larger will be impacted to some degree. The problem notably occurs with recordsize=1MB, compression=off, with 10 disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 10 days	2017-06-09 15:27:22 +00:00
Andriy Gapon	9f141f8d71	MFV r319738: 8155 simplify dmu_write_policy handling of pre-compressed buffers illumos/illumos-gate@adaec86ad2 `adaec86ad2` https://www.illumos.org/issues/8155 When writing pre-compressed buffers, arc_write() requires that the compression algorithm used to compress the buffer matches the compression algorithm requested by the zio_prop_t, which is set by dmu_write_policy(). This makes dmu_write_policy() and its callers a bit more complicated. We can simplify this by making arc_write() trust the caller to supply the type of pre-compressed buffer that it wants to write, and override the compression setting in the zio_prop_t. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 10 days	2017-06-09 15:26:03 +00:00
Andriy Gapon	1628f75af1	zfs_lookup: fix bogus arguments to lookup of "snapshot" directory When a parent directory lookup is done at the root of a snapshot mounted under .zfs/snapshot directory, we need to look up that directory in the parent filesystem. We achieve that by doing a VOP_LOOKUP operation on a .zfs vnode with "snapshot" as a target name. But previously we also passed ISDOTDOT flag to the lookup and, because of that, the lookup actually returned the parent of the .zfs vnode, that is, a root vnode of the parent filesystem. Reported by: lev Tested by: lev MFC after: 3 days	2017-05-29 06:30:34 +00:00
Konstantin Belousov	03311f117b	Use whole mnt_stat.f_fsid bits for st_dev. Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use f_fsid. In particular, NFSv3 and sometimes NFSv4, and ZFS use this method or reporting st_dev by stat(2). Provide a new helper vn_fsid() to avoid duplicating code to copy f_fsid to va_fsid. Note that the change is mostly cosmetic. Its motivation is to avoid sign-extension of f_fsid[0] into 64bit dev_t value which happens after dev_t becomes 64bit.. Reviewed by: avg(zfs), rmacklem (nfs) (both for previous version) Sponsored by: The FreeBSD Foundation	2017-05-27 17:00:30 +00:00
Andriy Gapon	32ecf81aff	MFV r318944: 8265 Reserve send stream flag for large dnode feature illumos/illumos-gate@bc83969fdb `bc83969fdb` https://www.illumos.org/issues/8265 Reserve bit 23 in the zfs send stream flags for the large dnode feature which has been implemented for Linux. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Brian Behlendorf <behlendorf1@llnl.gov> MFC after: 1 week	2017-05-26 12:08:38 +00:00
Andriy Gapon	a51eb0a964	MFV r318942: 8166 zpool scrub thinks it repaired offline device illumos/illumos-gate@2d2f193a21 `2d2f193a21` https://www.illumos.org/issues/8166 If we do a scrub while a leaf device is offline (via "zpool offline"), we will inadvertently clear the DTL (dirty time log) of the offline device, even though it is still damaged. When the device comes back online, we will incompletely resilver it, thinking that the scrub repaired blocks written before the scrub was started. The incomplete resilver can lead to data loss if there is a subsequent failure of a different leaf device. The fix is to never clear the DTL of offline devices. Note that if a device is onlined while a scrub is in progress, the scrub will be restarted. The problem can be worked around by running "zpool scrub" after "zpool online". See also https://github.com/zfsonlinux/zfs/issues/5806 Reviewed by: George Wilson george.wilson@delphix.com Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2017-05-26 12:04:21 +00:00
Andriy Gapon	2cd05c2473	MFV r318934: 8070 Add some ZFS comments illumos/illumos-gate@40713f2b24 `40713f2b24` https://www.illumos.org/issues/8070 Add some ZFS comments left by various developers at different times Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alan Somers <asomers@gmail.com> MFC after: 1 week	2017-05-26 11:49:42 +00:00
Andriy Gapon	0a07ea0e2f	MFV r318931: 8063 verify that we do not attempt to access inactive txg illumos/illumos-gate@b7b2590dd9 `b7b2590dd9` https://www.illumos.org/issues/8063 A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 2 weeks	2017-05-26 11:37:11 +00:00
Andriy Gapon	28c5e43e36	MFV r318929: 7786 zfs`vdev_online() needs better notification about state changes illumos/illumos-gate@5f368aef86 `5f368aef86` https://www.illumos.org/issues/7786 Currently, vdev_online() will only post sysevent if previous state was "offline". It should also post the event when the state changes from "removed" or "faulted" to "healthy" or "degraded". This will fix the following scenario: - pull disk from slot A - check that hotspare has taken its place (if available) - insert disk into slot B - check that hotspare moved back to "avail" state (if spare was used) The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail to update the vdev FRU. Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Approved by: Albert Lee <trisk@forkgnu.org> Author: Yuri Pankov <yuri.pankov@nexenta.com> MFC after: 1 week	2017-05-26 11:33:34 +00:00
Andriy Gapon	9c2a3c861f	MFV r318927: 8025 dbuf_read() creates unnecessary zio_root() for bonus buf illumos/illumos-gate@def4fac588 `def4fac588` https://www.illumos.org/issues/8025 dbuf_read() creates a zio_root() to track and wait for all the zio's that may happen as part of this call. However, if the blkptr_t for this buffer is NULL or a hole, we will not create any more zio's, so this zio_root() is unnecessary. This is always the case when calling dbuf_read() on a bonus buffer, because it has no blkptr (it's part of the containing dnode). For workloads that read a lot of bonus buffers (e.g. file creation and removal), creating and destroying these unnecessary zio's can decrease performance by around 3%. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-05-26 11:30:55 +00:00
Andriy Gapon	ebaf416f95	MFV r316929: 6914 kernel virtual memory fragmentation leads to hang illumos/illumos-gate@af868f46a5 `af868f46a5` https://www.illumos.org/issues/6914 FreeBSD note: only a ZFS part of the change is merged, changes to the VM subsystem are not ported (obviously). Also, now that FreeBSD has vmem(9) we don't have to ifdef-out the code that uses it. MFC after: 2 weeks	2017-05-26 11:23:16 +00:00
Andriy Gapon	8629ec8394	arc_init: make code closer to upstream by introducing 'allmem' variable All the differences in calculations are kept. A comment about arc_max being 1/2 of all memory is fixed to reflect the actual code that uses 5/8 as a factor. MFC after: 1 week	2017-05-26 11:05:56 +00:00

... 2 3 4 5 6 ...

1762 Commits