freebsd-dev

Author	SHA1	Message	Date
Mark Johnston	772dd133c6	Avoid direct accesses of the vm_page wire_count field. No functional change intended. Sponsored by: Netflix	2019-08-28 18:01:54 +00:00
Jeff Roberson	cf27e0d125	Use an atomic reference count for paging in progress so that callers do not require the object lock. Reviewed by: markj Tested by: pho (as part of a larger branch) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D21311	2019-08-19 23:09:38 +00:00
Andriy Gapon	97b342a310	zfs_vget: fix vnode reference count leak in error path If vn_lock() failed, then the function returned the error but the vnode obtained via zfs_zget() was never released. MFC after: 10 days Sponsored by: Panzura	2019-08-17 09:23:03 +00:00
Andriy Gapon	5a75f51a1f	Revert r351076 and r351074 because of atomic_swap_64 on 32-bit platforms Trying to sort it out.	2019-08-15 15:27:58 +00:00
Andriy Gapon	30f7381b8e	MFV r351075: 10406 large_dnode changes broke zfs recv of legacy stream illumos/illumos-gate@811964cd9f `811964cd9f` https://www.illumos.org/issues/10406 The large dnode changes from 8423 caused problems in zfs recv for a legacy stream. This manifests when attempting to mount the received stream, but the problem is in the receive code. We missed the following commit from ZoL which fixes this. commit `da2feb42fb` Author: Tom Caputi <tcaputi@datto.com> Date: Thu Jun 28 17:55:11 2018 -0400 Fix 'zfs recv' of non large_dnode send streams Currently, there is a bug where older send streams without the DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly. The code in receive_object() fails to handle cases where drro->drr_dn_slots is set to 0, which is always the case when the sending code does not support this feature flag. This patch fixes the issue by ensuring that that a value of 0 is treated as DNODE_MIN_SLOTS. Author: Tom Caputi <tcaputi@datto.com> MFC after: 3 weeks X-MFC after: r351074	2019-08-15 15:11:20 +00:00
Andriy Gapon	93132b76cd	MFV r350898: 8423 8199 7432 Implement large_dnode pool feature 8423 8199 7432 Implement large_dnode pool feature 8423 Implement large_dnode pool feature 8199 multi-threaded dmu_object_alloc() 7432 Large dnode pool feature llumos/illumos-gate@54811da5ac `54811da5ac` https://www.illumos.org/issues/8423 https://www.illumos.org/issues/8199 https://www.illumos.org/issues/7432 ZoL issues: Improved dnode allocation #6564 Clean up large dnode code #6262 Fix dnode_hold() freeing dnode behavior #8172 Fix dnode allocation race #6414, #6439 Partial: Raw sends must be able to decrease nlevels #6821, #6864 Remove unnecessary txg syncs from receive_object() Closes #7197 This updates FreeBSD large_dnode code (that was imported from ZoL) to a version that was committed to illumos. It has some cleanups, improvements and fixes comparing to what we have in FreeBSD now. I think that the most significant update is 8199 multi-threaded dmu_object_alloc(). Obtained from: illumos MFC after: 3 weeks	2019-08-15 14:57:27 +00:00
Andriy Gapon	4139761bb5	MFV r350896: 6585 sha512, skein, and edonr have an unenforced dependency on extensible dataset illumos/illumos-gate@892586e8a1 `892586e8a1` https://www.illumos.org/issues/6585 In any pool without the extensible dataset feature flag already enabled, creating a dataset with dedup set to use one of the new checksums would result in the following panic as soon as any data was added: panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature, &refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c line 390 ffffff0006761830 fffffffffba8fbdd () ffffff0006761890 zfs:feature_do_action+11a () ffffff00067618c0 zfs:spa_feature_incr+1e () ffffff0006761920 zfs:dmu_object_zapify+b7 () ffffff00067619b0 zfs:dsl_dataset_activate_feature+97 () ffffff0006761a20 zfs:dsl_dataset_sync+ba () ffffff0006761ab0 zfs:dsl_pool_sync+153 () ffffff0006761b70 zfs:spa_sync+26e () ffffff0006761c20 zfs:txg_sync_thread+227 () ffffff0006761c30 unix:thread_start+8 () Inspection showed that feature->fi_feature was 7, which is the value of SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum. Testing shows that the panic can be prevented by explicitly setting extensible dataset as a dependency for the sha512, edonr, and skein feature flags. Alternatively, the new checksums code could possibly be changed to obviate the need for the dependency. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: ilovezfs <ilovezfs@icloud.com> Note that FreeBSD does not support ednor yet. MFC after: 2 weeks	2019-08-12 11:42:16 +00:00
Andriy Gapon	cfe94339f2	a stop gap fix for a race between dnode_hold and dnode_sync_free The race was introduced in r337669, the large dnode feature import from ZoL. The problem was debugged by ZoL developers and then, independently, on FreeBSD. The fix is an early proposal by Brian Behlendorf: `50f32ed74e` This fix never went into ZoL. A larger change that was committed later included a different solution because of the re-worked code. Ideally, we want to revert this fix and re-synchronize FreeBSD large dnode code with that in illumos (or newer ZoL). illumos has a later import of the feature from ZoL that does not have the bug. PR: 236480 Obtained from: Brian Behlendorf <behlendorf1@llnl.gov> Submitted by: ncrogers@gmail.com (patch adaptation) Reported by: ncrogers@gmail.com Tested by: ncrogers@gmail.com, Dennis Noordsij <dennis.noordsij@alumni.helsinki.fi>, Julien Cigar <julien@perdition.city> MFC after: 10 days	2019-08-12 10:30:00 +00:00
Toomas Soome	b1b9326846	loader: support com.delphix:removing We should support removing vdev from boot pool. Update loader zfs reader to support com.delphix:removing. Reviewed by: allanjude MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D18901	2019-08-08 18:08:13 +00:00
Xin LI	0ed1d6fb00	Allow Kernel to link in both legacy libkern/zlib and new sys/contrib/zlib, with an eventual goal to convert all legacl zlib callers to the new zlib version: * Move generic zlib shims that are not specific to zlib 1.0.4 to sys/dev/zlib. * Connect new zlib (1.2.11) to the zlib kernel module, currently built with Z_SOLO. * Prefix the legacy zlib (1.0.4) with 'zlib104_' namespace. * Convert sys/opencrypto/cryptodeflate.c to use new zlib. * Remove bundled zlib 1.2.3 from ZFS and adapt it to new zlib and make it depend on the zlib module. * Fix Z_SOLO build of new zlib. PR: 229763 Submitted by: Yoshihiro Ota <ota j email ne jp> Reviewed by: markm (sys/dev/zlib/zlib_kmod.c) Relnotes: yes Differential Revision: https://reviews.freebsd.org/D19706	2019-08-01 06:35:33 +00:00
Mark Johnston	61f2f0bae6	Fix FASTTRAPIOC_GETINSTR. This ioctl is used when a breakpoint is encountered while disassembling a symbol in the target process. Since only one DTrace consumer can toggle or enumerate fasttrap probes from a given process at time, this ioctl does not appear to be used in practice.	2019-07-17 16:38:29 +00:00
Mark Johnston	eeacb3b02f	Merge the vm_page hold and wire mechanisms. The hold_count and wire_count fields of struct vm_page are separate reference counters with similar semantics. The remaining essential differences are that holds are not counted as a reference with respect to LRU, and holds have an implicit free-on-last unhold semantic whereas vm_page_unwire() callers must explicitly determine whether to free the page once the last reference to the page is released. This change removes the KPIs which directly manipulate hold_count. Functions such as vm_fault_quick_hold_pages() now return wired pages instead. Since r328977 the overhead of maintaining LRU for wired pages is lower, and in many cases vm_fault_quick_hold_pages() callers would swap holds for wirings on the returned pages anyway, so with this change we remove a number of page lock acquisitions. No functional change is intended. __FreeBSD_version is bumped. Reviewed by: alc, kib Discussed with: jeff Discussed with: jhb, np (cxgbe) Tested by: pho (previous version) Sponsored by: Netflix Differential Revision: https://reviews.freebsd.org/D19247	2019-07-08 19:46:20 +00:00
Alexander Motin	419110374a	Avoid extra taskq_dispatch() calls by DMU. DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes and os_synced_dnodes. Since the number of sublists by default is equal to number of CPUs, it will dispatch equal, potentially large, number of tasks, waking up many CPUs to handle them, even if only one or few of sublists actually have any work to do. This change adds check for empty sublists to avoid this.	2019-06-25 18:35:23 +00:00
Alexander Motin	3bae917061	Minimize aggsum_compare(&arc_size, arc_c) calls. For busy ARC situation when arc_size close to arc_c is desired. But then it is quite likely that aggsum_compare(&arc_size, arc_c) will need to flush per-CPU buckets to find exact comparison result. Doing that often in a hot path penalizes whole idea of aggsum usage there, since it replaces few simple atomic additions with dozens of lock acquisitions. Replacing aggsum_compare() with aggsum_upper_bound() in code increasing arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles allows to save ~5% of CPU time in aggsum code during sequential write to 12 ZVOLs with 16KB block size on large dual-socket system. I suppose there some minor arc_p behavior change due to lower precision of the new code, but I don't think it is a big deal, since it should affect only very small window in time (aggsum buckets are flushed every second) and in ARC size (buckets are limited to 10 average ARC blocks per CPU). MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-06-14 20:04:28 +00:00
Alexander Motin	b3b3aa2e29	Alike to ZoL disable metaslab allocation tracing code. It is too generous to collect in production debug traces that can only be read with kernel debugger. Illumos includes special code in their mdb debugger to read it, we don't. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-06-14 19:57:32 +00:00
Alexander Motin	284e53a401	Properly align struct multilist_sublist to cache line. Manual Illumos alignment does not fit us due to different kmutex_t size. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-06-14 17:09:39 +00:00
Alexander Motin	913095dc56	Move write aggregation memory copy out of vq_lock. Memory copy is too heavy operation to do under the congested lock. Moving it out reduces congestion by many times to almost invisible. Since the original zio removed from the queue, and the child zio is not executed yet, I don't see why would the copy need protection. My guess it just remained like this from the time when lock was not dropped here, which was added later to fix lock ordering issue. Multi-threaded sequential write tests with both HDD and SSD pools with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major reduction of lock congestion, saving from 15% to 35% of CPU time and increasing throughput from 10% to 40%. Reviewed by: ahrens, behlendorf, ryao MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-06-13 01:21:32 +00:00
Alexander Motin	35251e9c28	Fix comparison signedness in arc_is_overflowing(). When ARC size is very small, aggsum_lower_bound(&arc_size) may return negative values, that due to unsigned comparison caused delays, waiting for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use of signed comparison there fixes the problem. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-06-07 20:59:24 +00:00
Alexander Motin	61586dd647	Explicitly start ARC adjustment on limits change. While formally it is not necessary, but the sooner it start, the sooner it finish, and supposedly less disturbing for workload it will be. MFC after: 2 weeks	2019-06-07 19:03:17 +00:00
Andriy Gapon	d8b12a2162	Restore ARC MFU/MRU pressure Before r305323 (MFV r302991: 6950 ARC should cache compressed data) arc_read() code did this for access to a ghost buffer: arc_adapt() (from arc_get_data_buf()) arc_access(hdr, hash_lock) I.e., we first checked access to the MFU ghost/MRU ghost buffer and adapt MFU/MRU sizes (in arc_adapt()) and next move buffer from the ghost state to regular. After r305323 the sequence is different: arc_access(hdr, hash_lock); arc_hdr_alloc_pabd(hdr); I.e., we first move the buffer from the ghost state in arc_access() and then we check access to buffer in ghost state (in arc_hdr_alloc_pabd() -> arc_get_data_abd() -> arc_get_data_impl() -> arc_adapt()). This is incorrect: arc_adapt() never see access to the ghost buffer because arc_access() already migrated the buffer from the ghost state to regular. So, the fix is to restore a call to arc_adapt() before arc_access() and to suppress the call to arc_adapt() after arc_access(). Submitted by: Slawa Olhovchenkov <slw@zxy.spb.ru> MFC after: 2 weeks Sponsored by: Integros [integros.com] Differential Revision: https://reviews.freebsd.org/D19094	2019-06-07 06:35:42 +00:00
Mark Johnston	c080655467	Fix a race between fasttrap and the user breakpoint handler. When disabling the last enabled userspace probe, fasttrap clears the function pointers which hook in to the breakpoint handler. If a traced thread hit a fasttrap breakpoint before it was removed, we must ensure that it is able to call the hook; otherwise fasttrap will not consume the trap and SIGTRAP will be delievered to the thread. Synchronize with such threads by ensuring that they load the hook pointer with interrupts disabled, and by completing an SMP rendezvous after removing breakpoints and before clearing the pointers. Reported by: Alexander Alexeev <Alexander.Alexeev@dell.com> Tested by: Alexander Alexeev (earlier version) Reviewed by: cem, kib MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D20526	2019-06-06 16:03:25 +00:00
Mariusz Zaborski	8da024d941	dtrace: 64-bits registers support The registers in ilumos and FreeBSD have a different number. In the illumos, last 32-bits register defined is SS an in FreeBSD is GS. This off-by-one caused the uregs array to returns the wrong 64-bits register on amd64. Reviewed by: markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D20363	2019-06-05 22:29:05 +00:00
Alexander Motin	0b5319dda0	MFV r348585: 9683 Allow bypassing devid in vdev_disk_open() illumos/illumos-gate@6fe4f3002c Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com> This is irrelevant to FreeBSD, just to reduce divergence.	2019-06-03 20:55:52 +00:00
Alexander Motin	9b048dd219	MFV r348583: 9847 leaking dd_clones (DMU_OT_DSL_CLONES) objects illumos/illumos-gate@17fb938fd6 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2019-06-03 20:49:20 +00:00
Alexander Motin	07a5c938c9	MFV r348578: 9962 zil_commit should omit cache thrash illumos/illumos-gate@cab3a55e15 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Prakash Surya <prakash.surya@delphix.com>	2019-06-03 20:24:40 +00:00
Alexander Motin	a66a7143d4	MFV r348576: 9963 Seperate tunable for disabling ZIL vdev flush illumos/illumos-gate@f8fdf68125 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2019-06-03 20:05:43 +00:00
Alexander Motin	c9719c9a6d	MFV r348573: 9993 zil writes can get delayed in zio pipeline illumos/illumos-gate@2258ad0b75 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: George Wilson <george.wilson@delphix.com>	2019-06-03 19:25:53 +00:00
Alexander Motin	4d6afba5e0	MFV r348555: 9690 metaslab of vdev with no space maps was flushed during removal illumos/illumos-gate@4e75ba6826 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2019-06-03 19:03:24 +00:00
Alexander Motin	1b61262505	MFC r348554: 9688 aggsum_fini leaks memory illumos/illumos-gate@29bf2d68be Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2019-06-03 19:00:24 +00:00
Alexander Motin	677ef2563d	MFV r348553: 9681 ztest failure in spa_history_log_internal due to spa_rename() illumos/illumos-gate@6aee0ad769 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2019-06-03 18:32:56 +00:00
Alexander Motin	74f7070445	MFV r348552: 9682 page fault in dsl_async_clone_destroy() while opening pool illumos/illumos-gate@ade2c82828 Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2019-06-03 17:56:44 +00:00
Alexander Motin	2eff60e998	MFV r348551: 9862 fix typo in comment in vdev_impl.h illumos/illumos-gate@84927f52bd Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Author: Allan Jude <allanjude@freebsd.org>	2019-06-03 17:44:47 +00:00
Alexander Motin	bd2ae688a4	MFV r348550: 1700 Add SCSI UNMAP support illumos/illumos-gate@047c81d31d Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Saso Kiselkov <saso.kiselkov@nexenta.com> This is irrelevant to FreeBSD, just a diff reduction.	2019-06-03 17:43:32 +00:00
Alexander Motin	d40f6a585a	MFV r348548: 9617 too-frequent TXG sync causes excessive write inflation illumos/illumos-gate@7928f4baf4 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2019-06-03 17:40:11 +00:00
Alexander Motin	b3ed2d08e4	MFV r348537: 8601 memory leak in get_special_prop() illumos/illumos-gate@e19b450bec Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: John Gallagher <john.gallagher@delphix.com>	2019-06-03 17:29:57 +00:00
Alexander Motin	c066dcc074	MFV r348535: 9677 panic from zio_write_gang_block() when creating dump device on fragmented rpool illumos/illumos-gate@7341a7de4f Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Brad Lewis <brad.lewis@delphix.com>	2019-06-03 17:27:25 +00:00
Allan Jude	ad579d984f	Fix assertion in ZFS TRIM code Due to an attempt to check two conditions at once in a macro not designed as such, the assertion would always evaluate to true. #define VERIFY3_IMPL(LEFT, OP, RIGHT, TYPE) do { \ const TYPE __left = (TYPE)(LEFT); \ const TYPE __right = (TYPE)(RIGHT); \ if (!(__left OP __right)) \ assfail3(#LEFT " " #OP " " #RIGHT, \ (uintmax_t)__left, #OP, (uintmax_t)__right, \ __FILE__, __LINE__); \ _NOTE(CONSTCOND) } while (0) #define ASSERT3U(x, y, z) VERIFY3_IMPL(x, y, z, uint64_t) Mean that we compared: left = (type == ZIO_TYPE_FREE \|\| psize) OP = "<=" right = (SPA_MAXBLOCKSIZE) If the type was not FREE, 0 is less than SPA_MAXBLOCKSIZE (16MB) If the type is ZIO_TYPE_FREE, 1 is less than SPA_MAXBLOCKSIZE The constraint on psize (physical size of the FREE operation) is never checked against SPA_MAXBLOCKSIZE Reported by: Ka Ho Ng <khng300@gmail.com> Reviewed by: kevans MFC after: 2 weeks Sponsored by: Klara Systems	2019-05-29 20:34:35 +00:00
Justin Hibbits	b2aea1ad8f	powerpc/dtrace: Fix fbt function probing for ELFv2 '.' function names exist only in ELFv1. ELFv2 does away with function descriptors, and look more like they do on powerpc(32) and most other platforms, as direct function pointers. Stop blacklisting regular function names in ELFv2. Submitted by: Brandon Bergren Differential Revision: https://reviews.freebsd.org/D20346	2019-05-27 03:18:56 +00:00
Alexander Motin	83d0c3846d	Allocate buffers smaller then ABD chunk size as linear. This allows to reduce memory waste by letting UMA to put multiple small buffers into one memory page slab. The page sharing means that UMA may not be able to free memory page when some of buffers are freed, but alternatively memory used by that buffer would just be wasted from the beginning. This change follows alike change in ZoL, but unlike Linux (according to my understanding of it from comments) FreeBSD never shares slabs bigger then one memory page, so this should be even less invasive then there. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-05-22 18:43:48 +00:00
Allan Jude	57e9b361ba	Fix typo in r348068	2019-05-21 21:39:03 +00:00
Allan Jude	4e7d8292ec	ZFS: Make deadman tunables no longer read-only This allows the user to enable, disable, and adjust the I/O deadman at runtime. This can be especially useful when a pool is backed by remote storage (such as iscsi, ggated, etc). PR: 221906 Submitted by: Fabian Keil <fk@fabiankeil.de> Obtained from: ElectroBSD MFC after: 1 week Sponsored by: Klara Systems Event: Waterloo Hackathon 2019	2019-05-21 21:26:18 +00:00
Conrad Meyer	e2e050c8ef	Extract eventfilter declarations to sys/_eventfilter.h This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h" in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header pollution substantially. EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c files into appropriate headers (e.g., sys/proc.h, powernv/opal.h). As a side effect of reduced header pollution, many .c files and headers no longer contain needed definitions. The remainder of the patch addresses adding appropriate includes to fix those files. LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by sys/mutex.h since r326106 (but silently protected by header pollution prior to this change). No functional change (intended). Of course, any out of tree modules that relied on header pollution for sys/eventhandler.h, sys/lock.h, or sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.	2019-05-20 00:38:23 +00:00
Justin Hibbits	d69b94bab0	powerpc/dtrace: Actually fix stack traces Fix stack unwinding such that requesting N stack frames in lockstat will actually give you N frames, not anywhere from 0-3 as had been before. lockstat prints the mutex function instead of the caller as the reported locker, but the stack frame is detailed enough to find the real caller. MFC after: 2 weeks	2019-05-17 19:57:08 +00:00
Konstantin Belousov	7c5a46a1bc	Remove resolver_qual from DEFINE_IFUNC/DEFINE_UIFUNC macros. In all practical situations, the resolver visibility is static. Requested by: markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Approved by: so (emaste) Differential revision: https://reviews.freebsd.org/D20281	2019-05-16 22:20:54 +00:00
Alexander Motin	7763842174	Add mutex_destroy() missed in r334844. MFC after: 1 week Sponsored by: iXsystems, Inc.	2019-04-26 19:02:21 +00:00
Alexander Motin	32d8034f77	Fix minor mismerges. No functional change. MFC after: 1 week	2019-04-26 18:25:59 +00:00
Alexander Motin	48ecceba1e	Change the way FreeBSD GID inheritance is hacked. I believe previous ifdef caused NULL dereference in later zfs_log_create() on attempt to create file inside directory belonging to ephemeral group created on illumos, trying to write to log information about GID domain of the newly created file, inheriting the ephemeral GID. This patch reuses original illumos SGID code with exception that due to lack of ID mapping code on FreeBSD ephemeral GID will turn into GID_NOBODY by another ifdef inside zfs_fuid_map_id(). MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-04-19 15:44:45 +00:00
Justin Hibbits	e9aae3496e	powerpc/dtrace: Fix dtrace powerpc asm, and simplify stack walking Fix some execution bugs in the dtrace powerpc asm. addme pulls in the carry flag which we don't want, and the result wasn't recorded anyways, so the following beq to check for exit condition wasn't checking the right condition. Simplify the stack walking in dtrace_isa.c, so there's only a single walker that handles both pc and sp. This should make it easier to follow, and any bugfix that may be needed for walking only needs to be made in one place instead of two now. MFC after: 2 weeks	2019-04-13 03:32:21 +00:00
Mariusz Zaborski	a1304030b8	Introduce funlinkat syscall that always us to check if we are removing the file associated with the given file descriptor. Reviewed by: kib, asomers Reviewed by: cem, jilles, brooks (they reviewed previous version) Discussed with: pjd, and many others Differential Revision: https://reviews.freebsd.org/D14567	2019-04-06 09:34:26 +00:00
Conrad Meyer	a8a16c7128	Replace read_random(9) with more appropriate arc4rand(9) KPIs Reviewed by: ae, delphij Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D19760	2019-04-04 01:02:50 +00:00
Pawel Jakub Dawidek	af4f9e5f00	If the autoexpand pool property is turned on and vdev is healthy try to expand the pool automatically when we detect underlying GEOM provider size change. Obtained from: Fudo Security Tested in: AWS	2019-03-30 07:29:20 +00:00
Andriy Gapon	76a63ee510	Revert r345410, VOP_FSYNC change in ZFS vdev_file I overlooked the fact that that VOP_FSYNC() call is not a FreeBSD VFS call, but a macro that provides an illumos-compatible wrapper for the FreeBSD operation. PR: 236475 Reported by: lwhsu Pointyhat to: avg	2019-03-22 17:44:47 +00:00
Andriy Gapon	c1581f4f4d	ZFS vdev_file: use correct value for waitfor parameter of VOP_FSYNC PR: 236475 Reported by: asomers MFC after: 2 weeks	2019-03-22 09:11:45 +00:00
Mark Johnston	a4b59d3db6	Use an explicit comparison with VM_GUEST_NO. Reported by: jhb MFC with: r345359 Sponsored by: The FreeBSD Foundation	2019-03-21 20:07:50 +00:00
Mark Johnston	e362e590f9	Don't attempt to measure TSC skew when running as a VM guest. It simply doesn't work in general since VCPUs may migrate between physical cores. The approach used to measure skew also doesn't make much sense in a VM. PR: 218452 MFC after: 2 weeks Sponsored by: The FreeBSD Foundation	2019-03-21 02:52:22 +00:00
Alexander Motin	6bb46107d8	MFV r336930: 9284 arc_reclaim_thread has 2 jobs `arc_reclaim_thread()` calls `arc_adjust()` after calling `arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to indicate that we may no longer be `arc_is_overflowing()`. The problem is, `arc_kmem_reap_now()` can take several seconds to complete, has no impact on `arc_is_overflowing()`, but due to how the code is structured, can impact how long the ARC will remain in the `arc_is_overflowing()` state. The fix is to use seperate threads to: 1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which improves `arc_is_overflowing()` 2. keep enough free memory in the system, by calling `arc_kmem_reap_now()` plus `arc_shrink()`, which improves `arc_available_memory()`. illumos/illumos-gate@de753e34f9 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Tim Kordas <tim.kordas@joyent.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Brad Lewis <brad.lewis@delphix.com>	2019-03-15 18:59:04 +00:00
Simon J. Gerraty	f5fdf82d82	Add _PC_ACL_* to vop_stdpathconf This avoid EINVAL from tmpfs etc. Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D19512	2019-03-11 20:40:56 +00:00
Alexander Motin	aa8676f25d	Revert minor part of r344934. I tried to save some CPU time on hopeless aggregation attempts, but it seems the condition I added is overly strict, blocking also aggregation of optional I/Os in cases which previously were possible. Revert just to be safe. MFC after: 1 month	2019-03-11 17:39:09 +00:00
Alexander Motin	5ca679e3c4	MFV/ZoL: Disable LBA weighting on files and SSDs The LBA weighting makes sense on rotational media where the outer tracks have twice the bandwidth of the inner tracks. However, it is detrimental on nonrotational media such as solid state disks, where the only effect is to ensure that metaslabs enter the best-fit allocation behavior sooner, which is detrimental to performance. It also makes no sense on files where the underlying filesystem can arrange things however it wants. Author: Richard Yao <ryao@gentoo.org> Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3712 zfsonlinux/zfs@fb40095f5f To reduce code divergence this merge replaces equivalent but different FreeBSD code detecting non-rotating medium vdevs. MFC after: 1 month	2019-03-08 21:13:45 +00:00
Alexander Motin	673544c3dd	Add separate aggregation limit for non-rotating media. Before sequential scrub patches ZFS never aggregated I/Os above 128KB. Sequential scrub bumped that to 1MB, which motivation I understand for spinning disks, since it should reduce number of head seeks. But for SSDs it makes much less sense to me, especially on FreeBSD, where due to MAXPHYS limitation device will likely still see bunch of 128KB I/Os instead of one large. Having more strict aggregation limit allows to avoid allocation of large memory buffer and memcpy to/from it, that is a serious problem when bandwidth reaches few GB/s. MFC after: 1 month Sponsored by: iXsystems, Inc.	2019-03-08 19:38:52 +00:00
Alexander Motin	3a3ba532e7	MFV/ZoL: Fix zfs_vdev_aggregation_limit bounds checking Update the bounds checking for zfs_vdev_aggregation_limit so that it has a floor of zero and a maximum value of the supported block size for the pool. Additionally add an early return when zfs_vdev_aggregation_limit equals zero to disable aggregation. For very fast solid state or memory devices it may be more expensive to perform the aggregation than to issue the IO immediately. Author: Brian Behlendorf <behlendorf1@llnl.gov> zfsonlinux/zfs@a58df6f536 MFV/ZoL: Cap maximum aggregate IO size Commit `8542ef8` allowed optional IOs to be aggregated beyond the specified aggregation limit. Since the aggregation limit was also used to enforce the maximum block size, setting `zfs_vdev_aggregation_limit=16777216` could result in an attempt to allocate an ABD larger than 16M. Author: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #6259 Closes #6270 zfsonlinux/zfs@2d678f779a	2019-03-08 18:49:27 +00:00
Alexander Motin	ede8782611	Improve entropy for ZFS taskqueue selection. I just found that at least on Skylake CPUs cpu_ticks() never returns odd values, only even, and possibly has even bigger step (176/2?), that makes its lower bits very bad entropy source, leaving half of taskqueues unused. Switch to sbinuptime(), closer to upstreams, mitigates the problem by the rate conversion working as kind of hash function. In case that is somehow not enough (timer rate is too low or too divisible) mix in curcpu. MFC after: 1 week	2019-03-07 22:56:39 +00:00
Alexander Motin	551b7d3a29	Add respective tunables to few ZFS sysctls. MFC after: 1 week	2019-03-07 01:24:08 +00:00
Pawel Jakub Dawidek	b8da50d526	Improve readability of the code by making it explicit where the 'c' variable starts. It is also more consistent with similar code in this file.	2019-03-01 05:54:13 +00:00
Mark Johnston	8e7127fd91	Fix fasttrap_sig{trap,segv}(). - Don't leak the ksiginfo structure. - Hold the proc lock when sending a signal in fasttrap_sigsegv(). MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-02-26 18:20:41 +00:00
Mark Johnston	5563c675b3	Revert r344587. The fasttrap_isa.h header is needed by libdtrace, not just the kernel.	2019-02-26 17:33:56 +00:00
Mark Johnston	df59ed0787	Remove illumos-specific code from the x86 fasttrap_isa.c. The file has not been touched upstream in over a decade, and the nature of the code means that a lot of FreeBSD-specific bits are required. Remove the dead code to improve readability. No functional change intended. Discussed with: cem MFC after: 1 week Sponsored by: The FreeBSD Foundation	2019-02-26 16:34:43 +00:00
Mark Johnston	6829dae12b	Remove stub fasttrap implementations. No platforms except i386, amd64 and powerpc implement fasttrap; the fasttrap files for other arches do not contain any code and bloat the output from cscope, so just remove them. MFC after: 1 week	2019-02-26 16:31:47 +00:00
Mark Johnston	f23e684bbf	Commit a missing piece of r344452. MFC with: r344452	2019-02-21 22:56:54 +00:00
Mark Johnston	4f1b715c84	Fix a tracepoint lookup race in fasttrap_pid_probe(). fasttrap hooks the userspace breakpoint handler; the hook looks up the breakpoint address in a hash table of tracepoints. It is possible for the tracepoint to be removed by a different thread in between the breakpoint trap and the hash table lookup, in which case SIGTRAP gets delivered to the target process. Fix the problem by adding a per-process generation counter that gets incremented when a tracepoint belonging to that process is removed. Then, when a lookup fails, the trapping instruction is restarted if the thread's counter doesn't match that of the process. Reviewed by: cem MFC after: 2 weeks Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D19273	2019-02-21 22:54:17 +00:00
Pawel Jakub Dawidek	2691ae3230	Simplify the code. No functional changes. Reviewed by: rpokala	2019-02-20 00:25:45 +00:00
Pawel Jakub Dawidek	91853b8546	Simplify the code.	2019-02-19 23:53:33 +00:00
Pawel Jakub Dawidek	01e21ead90	Correct typo in the comment.	2019-02-19 23:44:00 +00:00
Pawel Jakub Dawidek	99ab63b69d	Change assertion to log the incorrect io_type we've got.	2019-02-19 23:43:15 +00:00
Pawel Jakub Dawidek	36d43b5dfe	Grabage-collect no longer used variable.	2019-02-19 23:41:23 +00:00
Pawel Jakub Dawidek	11c8759337	The way ZFS searches for its vdevs is the following: first it looks for a vdev that has the same name as the one stored in metadata and that has all VDEV labels in place. If it cannot find a GEOM provider with the given name and all VDEV labels it will scan all GEOM providers for the best match (the most VDEV labels available), but here the name is ignored. In case the ZFS pool is created, eg. using GPT partition label: # zpool create tank /dev/gpt/tank everything works, and on every import ZFS will pick /dev/gpt/tank and not /dev/da0p4. The problem occurs when da0p4 is extended and ZFS is unable to find all VDEV labels in /dev/gpt/tank anymore (the VDEV labels stored at the end of the partition are now somewhere else). In this case it will scan all GEOM providers and will pick the first one with the best match, ie. da0p4. Fix this problem by checking the VDEV/provider name even if we get the same match. If the name is the same as the one we have in pool's metadata, prefer this GEOM provider. Reported by: oshogbo, Michal Mroz <m.mroz@fudosecurity.com> Tested by: Michal Mroz <m.mroz@fudosecurity.com> Obtained from: Fudo Security	2019-02-19 23:35:55 +00:00
Pawel Jakub Dawidek	d793cf7019	In the vdev_geom_open_by_path() function we assume that vdev path starts with "/dev/". Make sure this is the case.	2019-02-19 23:22:39 +00:00
Alexander Motin	ed0a3e8637	s/Maximal/Maximum/ in sysctl description. Submitted by: smh MFC after: 1 week	2019-02-04 20:09:22 +00:00
Alexander Motin	ef08154150	Add missed tunables/sysctls for some new vdev variables. While there, make few existing sysctls writeable, since there is no reason not to. MFC after: 1 week	2019-02-04 16:13:41 +00:00
Alexander Motin	54cde30f92	Remove BIO_ORDERED flag from BIO_FLUSH sent by ZFS. In all cases where ZFS sends BIO_FLUSH, it first waits for all related writes to complete, so its BIO_FLUSH does not care about strict ordering. Removal of one makes life much easier at least for NVMe driver, which hardware has no concept of request ordering, relying completely on software. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2019-01-30 17:39:44 +00:00
Mariusz Zaborski	db009dddfd	zfs: allow to change cache flush sysctl There is no reason for this variable to be tunable. This variable is used as a barrier in few places. Discussed with: pjd MFC after: 2 weeks Sponsored by: Fudo Security	2019-01-26 13:53:00 +00:00
Sean Eric Fagan	82e20c0a72	Change ZFS quotas to return EINVAL when not present (matches man page). UFS will return EINVAL when quotas are not enabled on a filesystem; ZFS' equivalent involves not having quotas (there is not way to enable or disable quotas as such). My initial implementation had it return ENOENT, but quotactl(2) indicates EINVAL is more appropriate. MFC after: 2 weeks Approved by: mav Reviewed by: markj Reported by: Emrion <kmachine@free.fr> Sponsored by: iXsystems Inc PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234413	2019-01-11 02:53:46 +00:00
Matt Macy	27e05a1902	zfsboot: support newer ZFS versions declare v3 objset size/layout to fix userboot and possibly other loader issues - fix for userboot assertion failure in zfs_dev_close in free due to out of bounds write - fix for zfs_alloc / zfs_free mismatch assertion failure when booting GPT on BIOS	2019-01-03 22:49:11 +00:00
Andriy Gapon	4c325393f3	MFV r342532: 5882 Temporary pool names Note that this commit brings only formatting changes that were done during the final review of the illumos change, because FreeBSD got the main changes before illumos. illumos/illumos-gate@04e5635652 `04e5635652` https://www.illumos.org/issues/5882 This is an import of the temporary pool names functionality from ZoL: `e2282ef57e` `26b42f3f9d` `2f3ec90061` `00d2a8c92f` `83e9986f6e` `023bbe6f01` It is intended to assist the creation and management of virtual machines that have their rootfs on ZFS on hosts that also have their rootfs on ZFS. These situations cause SPA namespace collisions when the standard name rpool is used in both cases. The solution is either to give each guest pool a name unique to the host, which is not always desireable, or boot a VM environment containing an ISO image to install it, which is cumbersome. MFC after: 1 week Sponsored by: Panzura	2018-12-26 11:03:14 +00:00
Andriy Gapon	f050611e7f	MFV r342469: 9630 add lzc_rename and lzc_destroy to libzfs_core illumos/illumos-gate@049ba636fa `049ba636fa` https://www.illumos.org/issues/9630 Rename and destroy are very useful operations that deserve to be in libzfs_core. And they are not hard to implement too. MFC after: 2 weeks Relnotes: maybe	2018-12-26 10:37:41 +00:00
Mateusz Guzik	8ca79fbd4a	dtrace: fix userspace access on boxes with SMAP dtrace has its own routines which were not updated after SMAP support got implemented. Use ifunc just like for other routines. This in particular fixes ustack(). Reviewed by: markj Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D18542	2018-12-13 20:09:38 +00:00
Mateusz Guzik	cc426dd319	Remove unused argument to priv_check_cred. Patch mostly generated with cocinnelle: @@ expression E1,E2; @@ - priv_check_cred(E1,E2,0) + priv_check_cred(E1,E2) Sponsored by: The FreeBSD Foundation	2018-12-11 19:32:16 +00:00
Toomas Soome	7aaf685ba7	zfs: we can boot from dataset with large_dnode enabled loader has been supporting large_dnode for some time, no need to block the feature for boot dataset. Reviewed by: avg MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D18391	2018-12-03 19:35:21 +00:00
Mark Johnston	6d2e2df764	Ensure that directory entry padding bytes are zeroed. Directory entries must be padded to maintain alignment; in many filesystems the padding was not initialized, resulting in stack memory being copied out to userspace. With the ino64 work there are also some explicit pad fields in struct dirent. Add a subroutine to clear these bytes and use it in the in-tree filesystems. The NFS client is omitted for now as it was fixed separately in r340787. Reported by: Thomas Barabosch, Fraunhofer FKIE Reviewed by: kib MFC after: 3 days Sponsored by: The FreeBSD Foundation	2018-11-23 22:24:59 +00:00
Alexander Motin	eecd0a1856	Revert r340096: 9952 Block size change during zfs receive drops spill block It was reported, and I easily reproduced it, that this change triggers panic when receiving replication stream with enabled embedded blocks, when short file compressing into one embedded block changes its block size. I am not sure that the problem is in this particuler patch, not just triggered by it, but since investigation and fix will take some time, I've decided to revert this for now. PR: 198457, 233277	2018-11-21 18:18:57 +00:00
Mark Johnston	544e0a4f69	Use taskqueue_quiesce(9) to implement taskq_wait(). PR: 227784 Reviewed by: cem MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D17975	2018-11-21 17:19:08 +00:00
Justin Hibbits	cfebc0faa7	DTrace/powerpc: Fix FBT return probes The FBT fuction boundary prober was setting one return probe marker value, but the dtrace handler was expecting another. This causes a hang when tracing return probes.	2018-11-21 16:47:11 +00:00
Konstantin Belousov	1c4ca77890	Add d_off support for multiple filesystems. The d_off field has been added to the dirent structure recently. Currently filesystems don't support this feature. Support has been added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs. A stub implementation is available for cd9660, nandfs, udf and pseudofs but hasn't been tested. Motivation for this feature: our usecase is for a userspace nfs server (nfs-ganesha) with zfs. At the moment we cache direntry offsets by calling lseek once per entry, with this patch we can get the offset directly from getdirentries(2) calls which provides a significant speedup. Submitted by: Jack Halford <jack@gandi.net> Reviewed by: mckusick, pfg, rmacklem (previous versions) Sponsored by: Gandi.net MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17917	2018-11-14 14:18:35 +00:00
Alexander Motin	1fcdb58634	Do not ignore arc_adjust() return value. This covers scenario when ARC may not shrink as fast as it could: 1. arc_size < arc_c and arc_adjust() does not evict anything, returning zero to arc_reclaim_thread(); 2. arc_available_memory() reports memory pressure, which can not be satisfied by arc_kmem_reap_now(); 3. arc_shrink() reduces arc_c and calls arc_adjust(), return of which is ignored; 4. even if the last arc_adjust() could not satisfy arc_size < arc_c, arc_reclaim_thread() will still go to sleep, since the first one returned zero. Reviewed by: allanjude, markj, sef MFC after: 2 weeks Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D17927	2018-11-10 01:58:37 +00:00
Alexander Motin	b4d66a1739	9952 Block size change during zfs receive drops spill block Replication code in receive_object() falsely assumes that if received object block size is different from local, then it must be a new object and calls dmu_object_reclaim() to wipe it out. In most cases it is not a problem, since all dnode, bonus buffer and data block(s) are immediately rewritten any way, but the problem is that spill block (if used) is not. This means loss of ACLs, extended attributes, etc. This issue can be triggered in very simple way: 1. create 4KB file with 10+ ACL entries; 2. take snapshot and send it to different dataset; 3. append another 4KB to the file; 4. take another snapshot and send incrementally; 5. witness ACL loss on receive side. PR: 198457 Discussed with: mahrens MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2018-11-03 03:10:06 +00:00
Brooks Davis	1493c2ee62	Make vop_symlink take a const target path. This will enable callers to take const paths as part of syscall decleration improvements. Where doing so is easy and non-distruptive carry the const through implementations. In UFS the value is passed to an interface that must take non-const values. In ZFS, const poisoning would touch code shared with upstream and it's not worth adding diffs. Bump __FreeBSD_version for external API consumers. Reviewed by: kib (prior version) Obtained from: CheriBSD Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D17805	2018-11-02 14:42:36 +00:00
Konstantin Belousov	4f77f48884	Implement O_BENEATH and AT_BENEATH. Flags prevent open(2) and *at(2) vfs syscalls name lookup from escaping the starting directory. Supposedly the interface is similar to the same proposed Linux flags. Reviewed by: jilles (code, previous version of manpages), 0mp (manpages) Discussed with: allanjude, emaste, jonathan Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17547	2018-10-25 22:16:34 +00:00
Justin Hibbits	97a9d3b5c5	powerpc/dtrace: Use explicit bit numbers to mask out PSL_EE There seems to be a race in CI, such that dtrace_asm.S might be assembled before the genassym is completed. This causes a build failure when PSL_EE doesn't exist, and is read as 0. Get around this by explicitly specifying the bits in the mask instead.	2018-10-21 02:08:57 +00:00
Alexander Motin	2cb74ed856	Skip VDEV_IO_DONE stage only for ZIO_TYPE_FREE. Device removal code uses zio_vdev_child_io() with ZIO_TYPE_NULL parent, that never happened before. It confused FreeBSD-specific TRIM code, which does not use VDEV_IO_DONE for logical ZIO_TYPE_FREE ZIOs. As result of that stage being skipped device removal ZIOs leaked references and memory that supposed to be freed by VDEV_IO_DONE, making it stuck. It is a quick patch rather then a nice fix, but hopefully we'll be able to drop it all together when alternative TRIM implementation finally get landed. PR: 228750, 229007 Discussed with: allanjude, avg, smh Approved by: re (delphij) MFC after: 5 days Sponsored by: iXsystems, Inc.	2018-10-15 21:59:24 +00:00
John Baldwin	73efa2fbd1	Various fixes for TLB management on RISC-V. - Remove the arm64-specific cpu_cache and cpu_tlb_flush* functions. Instead, add RISC-V specific inline functions in cpufunc.h for the fence.i and sfence.vma instructions. - Catch up to changes in the arm64 pmap and remove all the cpu_dcache_* calls, pmap_is_current, pmap_l3_valid_cacheable, and PTE_NEXT bits from pmap. - Remove references to the unimplemented riscv_setttb(). - Remove unused cpu_nullop. - Add a link to the SBI doc to sbi.h. - Add support for a 4th argument in SBI calls. It's not documented but it seems implied for the asid argument to SBI_REMOVE_SFENCE_VMA_ASID. - Pass the arguments from sbi_remote_sfence() to the SEE. BBL ignores them so this is just cosmetic. - Flush icaches on other CPUs when they resume from kdb in case the debugger wrote any breakpoints while the CPUs were paused in the IPI_STOP handler. - Add SMP vs UP versions of pmap_invalidate_ similar to amd64. The UP versions just use simple fences. The SMP versions use the sbi_remove_sfence*() functions to perform TLB shootdowns. Since we don't have a valid pm_active field in the riscv pmap, just IPI all CPUs for all invalidations for now. - Remove an extraneous TLB flush from the end of pmap_bootstrap(). - Don't do a TLB flush when writing new mappings in pmap_enter(), only if modifying an existing mapping. Note that for COW faults a TLB flush is only performed after explicitly clearing the old mapping as is done in other pmaps. - Sync the i-cache on all harts before updating the PTE for executable mappings in pmap_enter and pmap_enter_quick. Previously the i-cache was only sync'd after updating the PTE in pmap_enter. - Use sbi_remote_fence() instead of smp_rendezvous in pmap_sync_icache(). Reviewed by: markj Approved by: re (gjb, kib) Sponsored by: DARPA Differential Revision: https://reviews.freebsd.org/D17414	2018-10-15 18:56:54 +00:00
Mateusz Guzik	bca84f54ce	zfs: fix a panic after failed mount r338927("zfs: depessimize zfs_root with rmlocks") failed to error check the mount before caching root vnode. Results in crashes in rrw_enter_read_impl tracing back to zfs_mount. Reported by: Mike Tancsa Tested by: allanjude Approved by: re (kib)	2018-10-14 16:14:01 +00:00
Alexander Motin	178777f516	Avoid zero-sized kmem_alloc() in vdev_compact_children(). The device evacuation code adds a dependency that vdev_compact_children() be able to properly empty the vdev_child array by setting it to NULL and zeroing vdev_children. Under Linux, kmem_alloc() and related functions return a sentinel pointer rather than NULL for zero-sized allocations. This is a part of ZoL port of device removal patch: commit `a1d477c24c` Author: Matthew Ahrens <mahrens@delphix.com> Ported-by: Tim Chase <tim@chase2k.com> Approved by: re (kib) MFC after: 1 week	2018-10-12 16:55:28 +00:00
Alexander Motin	770ce5c3bf	Add ZIO_TYPE_FREE support for indirect vdevs. Upstream code expects only ZIO_TYPE_READ and some ZIO_TYPE_WRITE requests to removed (indirect) vdevs, while on FreeBSD there is also ZIO_TYPE_FREE (TRIM). ZIO_TYPE_FREE requests do not have the data buffers, so don't need the pointer adjustment. PR: 228750, 229007 Reviewed by: allanjude, sef Approved by: re (kib) MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D17523	2018-10-12 15:14:22 +00:00
Allan Jude	c79b58ccc5	Pull in a follow-on commit to resolve a deadlock in ZFS sequential resilver (r334844) MFV/ZoL: Fix deadlock in IO pipeline commit `a76f3d0437` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Fri Mar 16 16:46:06 2018 -0700 Fix deadlock in IO pipeline In vdev_queue_aggregate() the zio_execute() bypass should not be called under the vdev queue lock. This can result in a deadlock as shown in the stack traces below. Drop the vdev queue lock then walk the parents of the aggregate IO to determine the list of component IOs to be bypassed. This can be done safely without holding the io_lock since the new aggregate IO has not yet been returned and its parents cannot change. --- THREAD 1 --- arc_read() zio_nowait() zio_vdev_io_start() vdev_queue_io() <--- mutex_enter(vq->vq_lock) vdev_queue_io_to_issue() vdev_queue_aggregate() zio_execute() vdev_queue_io_to_issue() vdev_queue_aggregate() zio_execute() zio_vdev_io_assess() zio_wait_for_children() <- mutex_enter(zio->io_lock) --- THREAD 2 --- (inverse order) arc_read() zio_change_priority() <- mutex_enter(zio->zio_lock) vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock) Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Reported by: ZFS Leadership Meeting Reviewed by: mav Approved by: re (kib) Obtained from: ZFS-on-Linux MFC after: 2 weeks Sponsored by: Klara Systems Differential Revision: https://reviews.freebsd.org/D17495	2018-10-10 22:59:15 +00:00
Allan Jude	bee8a18986	Add missing sysctls for tuning vdev queue depths for new I/O types This connects new tunables that were added but not exposed in: r329502 (zpool remove) r337007 (zpool initialize) Reviewed by: avg Approved by: re (kib) MFC after: 2 weeks Sponsored by: Klara Systems Differential Revision: https://reviews.freebsd.org/D17494	2018-10-10 22:55:31 +00:00
Allan Jude	cd00e3e1af	Resolve a hang in ZFS during vnode reclaimation This is caused by a deadlock between zil_commit() and zfs_zget() Add a way for zfs_zget() to break out of the retry loop in the common case PR: 229614 Reported by: grembo, Andreas Sommer, many others Tested by: Andreas Sommer, Vicki Pfau Reviewed by: avg (no objection) Approved by: re (gjb) MFC after: 2 months Sponsored by: Klara Systems Differential Revision: https://reviews.freebsd.org/D17460	2018-10-10 19:39:47 +00:00
Alexander Motin	7a492ba93c	Remove extra thread_exit() call left after r329802. spa_condense_indirect_thread() is no longer a thread function, but just a callback for new zthr KPI. Submitted by: allanjude Approved by: re (gjb) MFC after: 3 days	2018-10-10 16:34:53 +00:00
Alexander Motin	f3b515aea5	Fix r336951 mismerge -- use of uninitialized variable. Reported by: tsoome Approved by: re (gjb) MFC after: 3 days	2018-10-08 15:19:03 +00:00
Alexander Motin	1f55b2a4b5	Add sysctls for dbuf metadata cache variables added in r336959. Approved by: re (gjb) MFC after: 1 week	2018-10-05 16:05:59 +00:00
Allan Jude	9d967dd27d	Avoid panic when adjusting priority of a read in the face of an IO error PR: 231516 Reported by: sbruno Approved by: re (rgrimes) Obtained from: ZFS-on-Linux X-MFC-with: 334844 Sponsored by: Klara Systems MFV/ZoL: Fix zio->io_priority failed (7 < 6) assert commit `c26cf0966d` Author: Tony Hutter <hutter2@llnl.gov> Date: Tue May 29 18:13:48 2018 -0700 Fix zio->io_priority failed (7 < 6) assert This fixes an assert in vdev_queue_change_io_priority(): VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6) PANIC at vdev_queue.c:832:vdev_queue_change_io_priority() Reviewed-by: Tom Caputi <tcaputi@datto.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tony Hutter <hutter2@llnl.gov>	2018-09-29 01:26:07 +00:00
Mateusz Guzik	af534f8d99	zfs: depessimize zfs_root with rmlocks Currently vfs calls the root method on each absolute lookup and when crossing mount points. zfs_root ends up looking up the inode internally as if it was not instantianted which results in significant lock contention on systems like EPYC. Store the vnode in the mount point and protect the access with rmlocks. This is a temporary hack for 12.0. Sample result: before: make -s -j 128 buildkernel 2778.09s user 3319.45s system 8370% cpu 1:12.85 total after: make -s -j 128 buildkernel 3199.57s user 1772.78s system 8232% cpu 1:00.40 total Tested by: pho (zfs mount/unmount tests) Reviewed by: kib, mav, sef (different parts) Approved by: re (gjb) Differential Revision: https://reviews.freebsd.org/D17233	2018-09-25 17:58:06 +00:00
Alexander Motin	ae1b0b825a	MFV r338866: 9700 ZFS resilvered mirror does not balance reads illumos/illumos-gate@82f63c3c2b Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Jerry Jelinek <jerry.jelinek@joyent.com> Approved by: re (delphij)	2018-09-21 21:56:00 +00:00
Konstantin Belousov	a6ade1a07b	Fix ZFS VFS op quotactl to follow busy protocol. Reviewed by: avg, mckusick Tested by: pho Sponsored by: The FreeBSD Foundation Approved by: re (gjb) MFC after: 1 week Differential revision: https://reviews.freebsd.org/D17208	2018-09-19 14:38:01 +00:00
Mark Johnston	6368b4e471	Fix an nvpair leak in vdev_geom_read_config(). Also change the behaviour slightly: instead of freeing "config" if the last nvlist doesn't pass the tests, return the last config that did pass those tests. This matches the comment at the beginning of the function. PR: 230704 Diagnosed by: avg Reviewed by: asomers, avg Tested by: Mark Martinec <Mark.Martinec@ijs.si> Approved by: re (gjb) MFC after: 1 week Sponsored by: The FreeBSD Foundation Differential revision: https://reviews.freebsd.org/D17202	2018-09-17 16:16:57 +00:00
Eric van Gyzen	73511c241b	Set zfs_arc_meta_strategy to metadata only The previous default of "balanced" appears to have caused pathological behavior, including very poor performance and 100% CPU load in the arc_reclaim_thread. The symptoms appeared when the daily periodic run started. With this change, the system--and the ARC in particular--behaved normally during a manual daily periodic run. From Mark Johnston: The port of the balanced strategy is incomplete, since arc_prune_async() is a no-op on FreeBSD. (This also seems to imply that r337653 is a no-op.) After 12 is branched we can port the remaining bits and consider changing the default back. Submitted by: markj (essentially) Reviewed by: markj Approved by: re (gjb) Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D17156	2018-09-13 17:56:48 +00:00
Ruslan Bukin	378a495661	Add support for 'C'-compressed ISA extension to DTrace FBT provider. Approved by: re (kib) Sponsored by: DARPA, AFRL	2018-09-03 14:34:09 +00:00
Mark Johnston	a9d49f9e64	Fix the hash table lookup in fbt_destroy(). Reported and tested by: pho Approved by: re (kib) X-MFC with: r338359	2018-09-02 17:02:13 +00:00
Mark Johnston	d7965243c1	Re-compute the ARC size before computing the MFU target. This fixes an upstream regression introduced in r331404, causing overly aggressive reclamation of the ARC when under pressure. Diagnosed by: Paul <devgs@ukr.net> Approved by: re (gjb) MFC after: 3 days	2018-08-31 21:45:05 +00:00
Mark Johnston	f554293615	Re-add kstat.zfs.misc.arcstats.other_size under COMPAT_FREEBSD11. It is used by a number of applications, notably top(1). Reported by: netchild Reviewed by: allanjude Approved by: re (delphij) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16943	2018-08-30 13:42:01 +00:00
Konstantin Belousov	f0165b1ca6	Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines. Exposing max_offset and min_offset defines in public headers is causing clashes with variable names, for example when building QEMU. Based on the submission by: royger Reviewed by: alc, markj (previous version) Sponsored by: The FreeBSD Foundation (kib) MFC after: 1 week Approved by: re (marius) Differential revision: https://reviews.freebsd.org/D16881	2018-08-29 12:24:19 +00:00
Mark Johnston	394e8d20d9	Add a sysctl for the ZFS abd_scatter_enabled setting. Submitted by: Yamagi Burmeister <lists@yamagi.org> (original version) Approved by: re (rgrimes) MFC after: 3 days	2018-08-29 02:49:18 +00:00
Mark Johnston	c208cb9923	Allow multiple FBT probes to share a tracepoint. With GNU ifuncs, multiple FBT probes may correspond to the same instruction. fbt_invop() assumed that this could not happen and would return after the first probe found in the global FBT hash table, which might not be the one that's enabled. Fix the problem on x86 by linking probes that share a tracepoint and having each linked probe fire when the tracepoint is hit. PR: 230846 Approved by: re (gjb) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D16921	2018-08-28 20:21:36 +00:00
Alexander Motin	cb892a4117	Unblock speculative prefetcher also on pool creation. Fix at r331950 appeared to be incomplete, fixing only case of pool import, but not pool creation, leaving prefetcher still blocked for newly created pools. Approved by: re (gjb) MFC after: 1 week	2018-08-24 01:59:25 +00:00
Alexander Motin	2efb7660b5	Add dmu_tx_assign() error handling in zfs_unlinked_drain(). The error handling got lost during r334810, while according to the report error there may happen in case of dataset being over quota. In such case just leave the node in the unlinked list to be freed sometimes later. PR: 229887 Sponsored by: iXsystems, Inc.	2018-08-22 16:32:53 +00:00
Alexander Motin	6128ca8683	Create separate taskqueue to call zfs_unlinked_drain(). r334810 introduced zfs_unlinked_drain() dispatch to taskqueue on every deletion of a file with extended attributes. Using system_taskq for that with its multiple threads in case of multiple files deletion caused all available CPU threads to uselessly spin on busy locks, completely blocking the system. Use of single dedicated taskqueue is the only easy solution I've found, while in would be great if we could specify that some task should be executed only once at a time, but never in parallel, while many tasks could use different threads same time. Sponsored by: iXsystems, Inc.	2018-08-22 16:27:24 +00:00
Mark Johnston	b1a90834bb	Set arc_kmem_cache_reap_retry_ms to 0 and make it configurable. r329759 introduced this parameter, which controls the rate at which ZFS UMA zones are drained when the ARC reclaim thread is shrinking the ARC. The reclamation target is derived from the global free page count, and arc_shrink() only frees buffers back to UMA, so the free page count is not updated until the zones are drained. Thus, back-to-back calls to arc_shrink() within the arc_kmem_cache_reap_retry_ms interval do not provide immediate feedback to the arc_reclaim control loop, so we may free more of the ARC than needed to address a transient page shortage. As we do not implement the asynchronous zone draining added in r329759, disable the retry interval, restoring pre-r329759 behaviour. That is, we will drain the ZFS UMA zones before each attempt to shrink the ARC. Reviewed by: mav MFC after: 1 week Sponsored by: The FreeBSD Foundation	2018-08-21 16:37:37 +00:00
Matt Macy	d12e91d584	Make dnode definition uniform on !x86 gcc4 requires -fms-extensions to accept anonymous union members	2018-08-21 03:45:09 +00:00
Alexander Motin	cd2315086a	9751 Allocation throttling misplacing ditto blocks Relax allocation throttling for ditto blocks. Due to random imbalances in allocation it tends to push block copies to one vdev, that looks slightly better at the moment. Slightly less strict policy allows both improve data security and surprisingly write performance, since we don't need to touch extra metaslabs on each vdev to respect the min distance. Sponsored by: iXsystems, Inc.	2018-08-17 15:17:09 +00:00
Alexander Motin	a8e93e3cd7	9738 Fix third block copy allocations, broken at 9112. Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks. Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors later on metaslab_activate_allocator() call, leading to massive load of unneeded metaslabs and write freezes. Reviewed by: Paul Dagnelie <pcd@delphix.com>	2018-08-17 15:00:41 +00:00
Alexander Motin	6d14f2c48f	Make vfs.zfs.zio.dva_throttle_enabled sysctl writable. Not sure what I thought originally, but as I see now runtime changes are working fine, and the code seems like even designed for this.	2018-08-16 18:44:50 +00:00
Jamie Gritton	284001a222	Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating jails since FreeBSD 7. Along with the system call, put the various security.jail.allow_foo and security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or BURN_BRIDGES). These sysctls had two disparate uses: on the system side, they were global permissions for jails created via jail(2) which lacked fine-grained permission controls; inside a jail, they're read-only descriptions of what the current jail is allowed to do. The first use is obsolete along with jail(2), but keep them for the second-read-only use. Differential Revision: D14791	2018-08-16 18:40:16 +00:00
Alexander Motin	edc391e922	Add couple tunables/sysctl, missed in r336949.	2018-08-16 00:50:14 +00:00
Alexander Motin	8ce70dfcfa	Fix mismerge in r337196. ZoL did the same mistake, and fixed it with separate commit `863522b1f9`: dsl_scan_scrub_cb: don't double-account non-embedded blocks We were doing count_block() twice inside this function, once unconditionally at the beginning (intended to catch the embedded block case) and once near the end after processing the block. The double-accounting caused the "zpool scrub" progress statistics in "zpool status" to climb from 0% to 200% instead of 0% to 100%, and showed double the I/O rate it was actually seeing. This was apparently a regression introduced in commit `00c405b4b5`, which was an incorrect port of this OpenZFS commit: https://github.com/openzfs/openzfs/commit/d8a447a7 Reviewed by: Thomas Caputi <tcaputi@datto.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Steven Noonan <steven@uplinklabs.net> Closes #7720 Closes #7738 Reported by: sef	2018-08-15 21:01:57 +00:00
Matt Macy	fb8f55f586	MFV/ZoL: Add dbuf hash and dbuf cache kstats TODO: KSTAT_TYPE_NAMED support commit `5e021f56d3` Author: Giuseppe Di Natale <dinatale2@users.noreply.github.com> Date: Mon Jan 29 10:24:52 2018 -0800 Add dbuf hash and dbuf cache kstats Introduce kstats about the dbuf hash and dbuf cache to make it easier to inspect state. This should help with debugging and understanding of these portions of the codebase. Correct format of dbuf kstat file. Introduce a dbc column to dbufs kstat to indicate if a dbuf is in the dbuf cache. Introduce field filtering in the dbufstat python script. Introduce a no header option to the dbufstat python script. Introduce a test case to test basic mru->mfu list movement in the ARC. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov> Closes #6906	2018-08-12 03:15:30 +00:00
Matt Macy	13ae5c6ba8	MFV/ZoL: Fix stack dbuf_hold_impl() commit `fc5bb51f08` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Thu Aug 26 10:52:00 2010 -0700 Fix stack dbuf_hold_impl() This commit preserves the recursive function dbuf_hold_impl() but moves the local variables and function arguments to the heap to minimize the stack frame size. Enough space is initially allocated on the stack for 20 levels of recursion. This technique was based on commit 34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of traverse_visitbp(). Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2018-08-12 02:24:18 +00:00
Matt Macy	6e3d1345d9	fix build DN_MAX_BONUSLEN -> DN_OLD_MAX_BONUSLEN	2018-08-12 02:12:44 +00:00
Matt Macy	0f5add2566	Restore legacy dnode_phys layout on tier 2 arches Evidently gcc4 doesn't support anonymous union members	2018-08-12 02:09:06 +00:00
Matt Macy	104ed324dd	MFV/ZoL: Fix stack noinline commit `60948de1ef` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Thu Aug 26 10:58:36 2010 -0700 Fix stack noinline Certain function must never be automatically inlined by gcc because they are stack heavy or called recursively. This patch flags all such functions I've found as 'noinline' to prevent gcc from making the optimization. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2018-08-12 01:29:30 +00:00
Matt Macy	71d48dbda3	MFV/ZoL: Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z commit `81edd3e834` Author: Peng <peng.hse@xtaotech.com> Date: Wed Jun 8 15:22:07 2016 +0800 Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z The following scenario can result in garbage in the dn_spill field. The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR is clear to ensure the dn_spill field is cleared. Current txg = A. * A new spill buffer is created. Its dbuf is initialized with db_blkptr = NULL and it's dirtied. Current txg = B. * The spill buffer is modified. It's marked as dirty in this txg. * Additional changes make the spill buffer unnecessary because the xattr fits into the bonus buffer, so it's removed. The dbuf is undirtied in this txg, but it's still referenced and cannot be destroyed. Current txg = C. * Starts syncing of txg A * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr is NULL, dbuf_check_blkptr() is called. * The dbuf starts being written and it reaches the ready state (not done yet). * A new change makes the spill buffer necessary again. sa_build_layouts() ends up calling dbuf_find() to locate the dbuf. It finds the old dbuf because it has not been destroyed yet (it will be destroyed when the previous write is done and there are no more references). The old dbuf has db_blkptr != NULL. * txg A write is complete and the dbuf released. However it's still referenced, so it's not destroyed. Current txg = D. * Starts syncing of txg B * dbuf_sync_leaf() is called for the bonus buffer. Its contents are directly copied into the dnode, overwriting the blkptr area because, in txg B, the bonus buffer was big enough to hold the entire xattr. * At this point, the db_blkptr of the spill buffer used in txg C gets corrupted. Signed-off-by: Peng <peng.hse@xtaotech.com> Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3937	2018-08-12 01:17:32 +00:00
Matt Macy	6f06a36d47	MFV/ZoL: add dbuf stats NB: disabled pending the addition of KSTAT_TYPE_RAW support to the SPL commit `e0b0ca983d` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Wed Oct 2 17:11:19 2013 -0700 Add visibility in to cached dbufs Currently there is no mechanism to inspect which dbufs are being cached by the system. There are some coarse counters in arcstats by they only give a rough idea of what's being cached. This patch aims to improve the current situation by adding a new dbufs kstat. When read this new kstat will walk all cached dbufs linked in to the dbuf_hash. For each dbuf it will dump detailed information about the buffer. It will also dump additional information about the referenced arc buffer and its related dnode. This provides a more complete view in to exactly what is being cached. With this generic infrastructure in place utilities can be written to post-process the data to understand exactly how the caching is working. For example, the data could be processed to show a list of all cached dnodes and how much space they're consuming. Or a similar list could be generated based on dnode type. Many other ways to interpret the data exist based on what kinds of questions you're trying to answer. Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Prakash Surya <surya1@llnl.gov>	2018-08-12 01:10:18 +00:00
Matt Macy	cc0fbbb92e	MFV/ZoL: Implement large_dnode pool feature commit `50c957f702` Author: Ned Bass <bass6@llnl.gov> Date: Wed Mar 16 18:25:34 2016 -0700 Implement large_dnode pool feature Justification ------------- This feature adds support for variable length dnodes. Our motivation is to eliminate the overhead associated with using spill blocks. Spill blocks are used to store system attribute data (i.e. file metadata) that does not fit in the dnode's bonus buffer. By allowing a larger bonus buffer area the use of a spill block can be avoided. Spill blocks potentially incur an additional read I/O for every dnode in a dnode block. As a worst case example, reading 32 dnodes from a 16k dnode block and all of the spill blocks could issue 33 separate reads. Now suppose those dnodes have size 1024 and therefore don't need spill blocks. Then the worst case number of blocks read is reduced to from 33 to two--one per dnode block. In practice spill blocks may tend to be co-located on disk with the dnode blocks so the reduction in I/O would not be this drastic. In a badly fragmented pool, however, the improvement could be significant. ZFS-on-Linux systems that make heavy use of extended attributes would benefit from this feature. In particular, ZFS-on-Linux supports the xattr=sa dataset property which allows file extended attribute data to be stored in the dnode bonus buffer as an alternative to the traditional directory-based format. Workloads such as SELinux and the Lustre distributed filesystem often store enough xattr data to force spill bocks when xattr=sa is in effect. Large dnodes may therefore provide a performance benefit to such systems. Other use cases that may benefit from this feature include files with large ACLs and symbolic links with long target names. Furthermore, this feature may be desirable on other platforms in case future applications or features are developed that could make use of a larger bonus buffer area. Implementation -------------- The size of a dnode may be a multiple of 512 bytes up to the size of a dnode block (currently 16384 bytes). A dn_extra_slots field was added to the current on-disk dnode_phys_t structure to describe the size of the physical dnode on disk. The 8 bits for this field were taken from the zero filled dn_pad2 field. The field represents how many "extra" dnode_phys_t slots a dnode consumes in its dnode block. This convention results in a value of 0 for 512 byte dnodes which preserves on-disk format compatibility with older software. Similarly, the in-memory dnode_t structure has a new dn_num_slots field to represent the total number of dnode_phys_t slots consumed on disk. Thus dn->dn_num_slots is 1 greater than the corresponding dnp->dn_extra_slots. This difference in convention was adopted because, unlike on-disk structures, backward compatibility is not a concern for in-memory objects, so we used a more natural way to represent size for a dnode_t. The default size for newly created dnodes is determined by the value of a new "dnodesize" dataset property. By default the property is set to "legacy" which is compatible with older software. Setting the property to "auto" will allow the filesystem to choose the most suitable dnode size. Currently this just sets the default dnode size to 1k, but future code improvements could dynamically choose a size based on observed workload patterns. Dnodes of varying sizes can coexist within the same dataset and even within the same dnode block. For example, to enable automatically-sized dnodes, run # zfs set dnodesize=auto tank/fish The user can also specify literal values for the dnodesize property. These are currently limited to powers of two from 1k to 16k. The power-of-2 limitation is only for simplicity of the user interface. Internally the implementation can handle any multiple of 512 up to 16k, and consumers of the DMU API can specify any legal dnode value. The size of a new dnode is determined at object allocation time and stored as a new field in the znode in-memory structure. New DMU interfaces are added to allow the consumer to specify the dnode size that a newly allocated object should use. Existing interfaces are unchanged to avoid having to update every call site and to preserve compatibility with external consumers such as Lustre. The new interfaces names are given below. The versions of these functions that don't take a dnodesize parameter now just call the _dnsize() versions with a dnodesize of 0, which means use the legacy dnode size. New DMU interfaces: dmu_object_alloc_dnsize() dmu_object_claim_dnsize() dmu_object_reclaim_dnsize() New ZAP interfaces: zap_create_dnsize() zap_create_norm_dnsize() zap_create_flags_dnsize() zap_create_claim_norm_dnsize() zap_create_link_dnsize() The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The spa_maxdnodesize() function should be used to determine the maximum bonus length for a pool. These are a few noteworthy changes to key functions: * The prototype for dnode_hold_impl() now takes a "slots" parameter. When the DNODE_MUST_BE_FREE flag is set, this parameter is used to ensure the hole at the specified object offset is large enough to hold the dnode being created. The slots parameter is also used to ensure a dnode does not span multiple dnode blocks. In both of these cases, if a failure occurs, ENOSPC is returned. Keep in mind, these failure cases are only possible when using DNODE_MUST_BE_FREE. If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0. dnode_hold_impl() will check if the requested dnode is already consumed as an extra dnode slot by an large dnode, in which case it returns ENOENT. * The function dmu_object_alloc() advances to the next dnode block if dnode_hold_impl() returns an error for a requested object. This is because the beginning of the next dnode block is the only location it can safely assume to either be a hole or a valid starting point for a dnode. * dnode_next_offset_level() and other functions that iterate through dnode blocks may no longer use a simple array indexing scheme. These now use the current dnode's dn_num_slots field to advance to the next dnode in the block. This is to ensure we properly skip the current dnode's bonus area and don't interpret it as a valid dnode. zdb --- The zdb command was updated to display a dnode's size under the "dnsize" column when the object is dumped. For ZIL create log records, zdb will now display the slot count for the object. ztest ----- Ztest chooses a random dnodesize for every newly created object. The random distribution is more heavily weighted toward small dnodes to better simulate real-world datasets. Unused bonus buffer space is filled with non-zero values computed from the object number, dataset id, offset, and generation number. This helps ensure that the dnode traversal code properly skips the interior regions of large dnodes, and that these interior regions are not overwritten by data belonging to other dnodes. A new test visits each object in a dataset. It verifies that the actual dnode size matches what was stored in the ztest block tag when it was created. It also verifies that the unused bonus buffer space is filled with the expected data patterns. ZFS Test Suite -------------- Added six new large dnode-specific tests, and integrated the dnodesize property into existing tests for zfs allow and send/recv. Send/Receive ------------ ZFS send streams for datasets containing large dnodes cannot be received on pools that don't support the large_dnode feature. A send stream with large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be unrecognized by an incompatible receiving pool so that the zfs receive will fail gracefully. While not implemented here, it may be possible to generate a backward-compatible send stream from a dataset containing large dnodes. The implementation may be tricky, however, because the send object record for a large dnode would need to be resized to a 512 byte dnode, possibly kicking in a spill block in the process. This means we would need to construct a new SA layout and possibly register it in the SA layout object. The SA layout is normally just sent as an ordinary object record. But if we are constructing new layouts while generating the send stream we'd have to build the SA layout object dynamically and send it at the end of the stream. For sending and receiving between pools that do support large dnodes, the drr_object send record type is extended with a new field to store the dnode slot count. This field was repurposed from unused padding in the structure. ZIL Replay ---------- The dnode slot count is stored in the uppermost 8 bits of the lr_foid field. The bits were unused as the object id is currently capped at 48 bits. Resizing Dnodes --------------- It should be possible to resize a dnode when it is dirtied if the current dnodesize dataset property differs from the dnode's size, but this functionality is not currently implemented. Clearly a dnode can only grow if there are sufficient contiguous unused slots in the dnode block, but it should always be possible to shrink a dnode. Growing dnodes may be useful to reduce fragmentation in a pool with many spill blocks in use. Shrinking dnodes may be useful to allow sending a dataset to a pool that doesn't support the large_dnode feature. Feature Reference Counting -------------------------- The reference count for the large_dnode pool feature tracks the number of datasets that have ever contained a dnode of size larger than 512 bytes. The first time a large dnode is created in a dataset the dataset is converted to an extensible dataset. This is a one-way operation and the only way to decrement the feature count is to destroy the dataset, even if the dataset no longer contains any large dnodes. The complexity of reference counting on a per-dnode basis was too high, so we chose to track it on a per-dataset basis similarly to the large_block feature. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3542	2018-08-12 00:45:53 +00:00
Matt Macy	9f3a171221	Enable balanced arc pruning Taken from: ommit `f604673836` Author: Brian Behlendorf <behlendorf1@llnl.gov> Date: Sat May 30 09:57:53 2015 -0500 Make arc_prune() asynchronous As described in the comment above arc_adapt_thread() it is critical that the arc_adapt_thread() function never sleep while holding a hash lock. This behavior was possible in the Linux implementation because the arc_prune() logic was implemented to be synchronous. Under illumos the analogous dnlc_reduce_cache() function is asynchronous. To address this the arc_do_user_prune() function is has been reworked in to two new functions as follows: * arc_prune_async() is an asynchronous implementation which dispatches the prune callback to be run by the system taskq. This makes it suitable to use in the context of the arc_adapt_thread(). * arc_prune() is a synchronous implementation which depends on the arc_prune_async() implementation but blocks until the outstanding callbacks complete. This is used in arc_kmem_reap_now() where it is safe, and expected, that memory will be freed. This patch additionally adds the zfs_arc_meta_strategy module option while allows the meta reclaim strategy to be configured. It defaults to a balanced strategy which has been proved to work well under Linux but the illumos meta-only strategy can be enabled. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2018-08-11 22:01:52 +00:00
Matt Macy	d815f5ba09	buildworld fix: private appears to have special meaning on FreeBSD - revert to priv	2018-08-11 20:41:42 +00:00
Matt Macy	6b55e6fb04	Limit the amount of dnode metadata in the ARC In addition import most recent arc_prune_async implementation as dependency commit `25458cbef9` Author: Tim Chase <tim@chase2k.com> Date: Wed Jul 13 07:42:40 2016 -0500 Limit the amount of dnode metadata in the ARC Metadata-intensive workloads can cause the ARC to become permanently filled with dnode_t objects as they're pinned by the VFS layer. Subsequent data-intensive workloads may only benefit from about 25% of the potential ARC (arc_c_max - arc_meta_limit). In order to help track metadata usage more precisely, the other_size metadata arcstat has replaced with dbuf_size, dnode_size and bonus_size. The new zfs_arc_dnode_limit tunable, which defaults to 10% of zfs_arc_meta_limit, defines the minimum number of bytes which is desirable to be consumed by dnodes. Attempts to evict non-metadata will trigger async prune tasks if the space used by dnodes exceeds this limit. The new zfs_arc_dnode_reduce_percent tunable specifies the amount by which the excess dnode space is attempted to be pruned as a percentage of the amount by which zfs_arc_dnode_limit is being exceeded. By default, it tries to unpin 10% of the dnodes. The problem of dnode metadata pinning was observed with the following testing procedure (in this example, zfs_arc_max is set to 4GiB): - Create a large number of small files until arc_meta_used exceeds arc_meta_limit (3GiB with default tuning) and arc_prune starts increasing. - Create a 3GiB file with dd. Observe arc_mata_used. It will still be around 3GiB. - Repeatedly read the 3GiB file and observe arc_meta_limit as before. It will continue to stay around 3GiB. With this modification, space for the 3GiB file is gradually made available as subsequent demands on the ARC are made. The previous behavior can be restored by setting zfs_arc_dnode_limit to the same value as the zfs_arc_meta_limit. Signed-off-by: Tim Chase <tim@chase2k.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Issue #4345 Issue #4512 Issue #4773 Closes #4858	2018-08-11 19:45:04 +00:00
Matt Macy	90df93417e	ZFS/MFV: Use cached feature info in spa_add_feature_stats() commit `417104bdd3` Author: Ned Bass <bass6@llnl.gov> Date: Thu Feb 26 12:24:11 2015 -0800 Use cached feature info in spa_add_feature_stats() Avoid issuing I/O to the pool when retrieving feature flags information. Trying to read the ZAPs from disk means that zpool clear would hang if the pool is suspended and recovery would require a reboot. To keep the feature stats resident in memory, we hang a cached nvlist off of the spa. It is built up from disk the first time spa_add_feature_stats() is called, and refreshed thereafter using the cached feature reference counts. spa_add_feature_stats() gets called at pool import time so we can be sure the cached nvlist will be available if the pool is later suspended. Signed-off-by: Ned Bass <bass6@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #3082	2018-08-10 23:42:11 +00:00
Matt Macy	648cfe57fd	Performance optimization of AVL tree comparator functions MFV: commit `ee36c709c3` Author: Gvozden Neskovic <neskovic@gmail.com> Date: Sat Aug 27 20:12:53 2016 +0200 perf: 2.75x faster ddt_entry_compare() First 256bits of ddt_key_t is a block checksum, which are expected to be close to random data. Hence, on average, comparison only needs to look at first few bytes of the keys. To reduce number of conditional jump instructions, the result is computed as: sign(memcmp(k1, k2)). Sign of an integer 'a' can be obtained as: `(0 < a) - (a < 0)` := {-1, 0, 1} , which is computed efficiently. Synthetic performance evaluation of original and new algorithm over 1G random keys on 2.6GHz Intel(R) Xeon(R) CPU E5-2660 v3: old 6.85789 s new 2.49089 s perf: 2.8x faster vdev_queue_offset_compare() and vdev_queue_timestamp_compare() Compute the result directly instead of using conditionals perf: zfs_range_compare() Speedup between 1.1x - 2.5x, depending on compiler version and optimization level. perf: spa_error_entry_compare() `bcmp()` is not suitable for comparator use. Use `memcmp()` instead. perf: 2.8x faster metaslab_compare() and metaslab_rangesize_compare() perf: 2.8x faster zil_bp_compare() perf: 2.8x faster mze_compare() perf: faster dbuf_compare() perf: faster compares in spa_misc perf: 2.8x faster layout_hash_compare() perf: 2.8x faster space_reftree_compare() perf: libzfs: faster avl tree comparators perf: guid_compare() perf: dsl_deadlist_compare() perf: perm_set_compare() perf: 2x faster range_tree_seg_compare() perf: faster unique_compare() perf: faster vdev_cache _compare() perf: faster vdev_uberblock_compare() perf: faster fuid _compare() perf: faster zfs_znode_hold_compare() Signed-off-by: Gvozden Neskovic <neskovic@gmail.com> Signed-off-by: Richard Elling <richard.elling@gmail.com> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #5033	2018-08-10 06:42:08 +00:00
Alexander Motin	f58c851d32	Reduce taskq and context-switch cost of zio pipe When doing a read from disk, ZFS creates 3 ZIO's: a zio_null(), the logical zio_read(), and then a physical zio. Currently, each of these results in a separate taskq_dispatch(zio_execute). On high-read-iops workloads, this causes a significant performance impact. By processing all 3 ZIO's in a single taskq entry, we reduce the overhead on taskq locking and context switching. We accomplish this by allowing zio_done() to return a "next zio to execute" to zio_execute(). This results in a ~12% performance increase for random reads, from 96,000 iops to 108,000 iops (with recordsize=8k, on SSD's). Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: George Wilson <george.wilson@delphix.com> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> External-issue: DLPX-59292 Closes #7736 zfsonlinux/zfs@62840030a7	2018-08-03 02:16:45 +00:00
Alexander Motin	07ddc55096	MFV r337223: 9580 Add a hash-table on top of nvlist to speed-up operations illumos/illumos-gate@2ec7644aab Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-08-03 01:52:25 +00:00
Alexander Motin	420a1ab349	MFV r337220: 8375 Kernel memory leak in nvpair code illumos/illumos-gate@843c2111b1 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-03 01:30:03 +00:00
Alexander Motin	33bdecc5d0	MFV r337218: 7261 nvlist code should enforce name length limit illumos/illumos-gate@48dd5e630c Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-03 01:26:07 +00:00
Alexander Motin	c8c51346f5	MFV r337216: 7263 deeply nested nvlist can overflow stack illumos/illumos-gate@9ca527c3d3 Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-03 01:09:12 +00:00
Alexander Motin	0285589b38	MFV 337214: 9621 Make createtxg and guid properties public illumos/illumos-gate@e8d4a73c86 Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Josh Paetzel <josh@tcbug.org>	2018-08-03 00:24:27 +00:00
Alexander Motin	54b838a67c	MFV r337212: 9465 ARC check for 'anon_size > arc_c/2' can stall the system illumos/illumos-gate@abe1fd01ce Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Don Brady <don.brady@delphix.com>	2018-08-03 00:14:36 +00:00
Alexander Motin	904eaff3f7	MFV r337210: 9577 remove zfs_dbuf_evict_key tsd The zfs_dbuf_evict_key TSD (thread-specific data) is not necessary - we can instead pass a flag down in a few places to prevent recursive dbuf eviction. Making this change has 3 benefits: 1. The code semantics are easier to understand. 2. On Linux, performance is improved, because creating/removing TSD values (by setting to NULL vs non-NULL) is expensive, and we do it very often. 3. According to Nexenta, the current semantics can cause a deadlock when concurrently calling dmu_objset_evict_dbufs() (which is rare today, but they are working on a "parallel unmount" change that triggers this more easily) illumos/illumos-gate@c2919acbea Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-03 00:01:48 +00:00
Alexander Motin	63ddf13e0d	MFV r337208: 9591 ms_shift can be incorrectly changed in MOS config for indirect vdevs that have been historically expanded illumos/illumos-gate@11f6a9680e Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-08-02 23:56:07 +00:00
Alexander Motin	4d696665fd	MFV r337206: 9338 moved dnode has incorrect dn_next_type illumos/illumos-gate@c7fbe46df9 Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-02 23:50:03 +00:00
Alexander Motin	b9b2457429	MFV r337204: 9439 ZFS double-free due to failure to dirty indirect block illumos/illumos-gate@99a19144e8 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-02 23:46:30 +00:00
Alexander Motin	050b16aea8	MFV r337200: 9438 Holes can lose birth time info if a block has a mix of birth times Ultimately, the problem here is that when you truncate and write a file in the same transaction group, the dbuf for the indirect block will be zeroed out to deal with the truncation, and then written for the write. During this process, we will lose hole birth time information for any holes in the range. In the case where a dnode is being freed, we need to determine whether the block should be converted to a higher-level hole in the zio pipeline, and if so do it when the dnode is being synced out. illumos/illumos-gate@738e2a3ce3 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-08-02 23:43:01 +00:00
Alexander Motin	7420e614fe	Fix build after r337196 mismerge.	2018-08-02 23:40:28 +00:00
Alexander Motin	fc5b6d13a6	MFV r337197: 9456 ztest failure in zil_commit_waiter_timeout illumos/illumos-gate@b6031810da Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-08-02 23:25:49 +00:00
Alexander Motin	3a412471a4	MFV r337195: 9454 ::zfs_blkstats should count embedded blocks illumos/illumos-gate@dec267e7ea Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-02 23:23:10 +00:00
Alexander Motin	afc0e55ebb	MFV r337193: 9424 ztest failure: "unprotected error in call to Lua API (Invalid value type 'f unction' for key 'error')" illumos/illumos-gate@fe3ba4d122 Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Don Brady <don.brady@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-02 23:15:10 +00:00
Alexander Motin	01c0f8bef3	MFV r337190: 9486 reduce memory used by device removal on fragmented pools In the most fragmented real-world cases, this reduces memory used by the mapping from ~1GB to ~50MB of RAM per 1TB of storage removed. Less fragmented cases will typically also see around 50-100MB of RAM per 1TB of storage. illumos/illumos-gate@cfd63e1b1b Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-02 21:59:46 +00:00
Alexander Motin	2bce9a5316	MFV r337182: 9330 stack overflow when creating a deeply nested dataset Datasets that are deeply nested (~100 levels) are impractical. We just put a limit of 50 levels to newly created datasets. Existing datasets should work without a problem. illumos/illumos-gate@5ac95da7d6 Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-08-02 21:19:35 +00:00
Alexander Motin	3f586250d8	9539 Make zvol operations use _by_dnode routines Continues what was started in 7801 add more by-dnode routines by fully converting zvols to avoid unnecessary dnode_hold() calls. This saves a small amount of CPU time and slightly improves latencies of operations on zvols. illumos/illumos-gate@8dfe5547fb Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Rick McNeal <rick.mcneal@nexenta.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Richard Yao <richard.yao@prophetstor.com>	2018-08-02 21:07:04 +00:00
Alexander Motin	5463ca24bf	MFV r337175: 9487 Free objects when receiving full stream as clone All objects after the last written or freed object are not supposed to exist after receiving the stream. We should free them accordingly, as if a freeobjects record for them had been included in the stream. zfsonlinux/zfs@48fbb9ddbf illumos/illumos-gate@7864b8192b Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-08-02 20:33:13 +00:00
Alexander Motin	0485d5cba8	MFV r337171: 9464 txg_kick() fails to see that we are quiescing, forcing transactions to their next stages without leaving them accumulate changes Ideally we would like txg_kick() to get triggered only when we are sure that we are not syncing AND not quiescing any txg. This way we can kick an open TXG to the quiescing state when we are sure that there is nothing going on and we would benefit from the different states running concurrently. illumos/illumos-gate@fa41d87de9 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-08-02 20:18:49 +00:00
Alexander Motin	aba9f1124f	MFV r337167: 9442 decrease indirect block size of spacemaps Updates to indirect blocks of spacemaps can contribute significantly to write inflation. Therefore we want to reduce the indirect block size of spacemaps from 128K to 16K. illumos/illumos-gate@221813c13b Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Albert Lee <trisk@forkgnu.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-02 20:06:46 +00:00
Alexander Motin	3c09eaee95	MFV r337029: 9426 metaslab size can exceed offset addressable by spacemap metaslab size can exceed offset addressable by spacemap. The vdev can address up to 2^63 * SPA_MAXBLOCKSIZE (512). A metaslab can address up to 2^47 * 2^vdev_ashift. Therefore we may need to increase the number of metaslabs so that the maximum metaslab size is capped at the amount that can be addressed by the spacemap. This should happen in vdev_metaslab_set_size(). illumos/illumos-gate@b4bf0cf045 Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Don Brady <don.brady@delphix.com>	2018-08-01 03:21:17 +00:00
Alexander Motin	e74aee0b25	MFV r337027: 9328 zap code can take advantage of c99 9329 panic in zap_leaf_lookup() due to concurrent zapification illumos/illumos-gate@bf26014c55 Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-01 03:07:33 +00:00
Alexander Motin	f12261798c	MFV r337022: 9403 assertion failed in arc_buf_destroy() when concurrently reading block with checksum error This assertion (VERIFY) failure was reported when reading a block. Turns out the problem is that if we get an i/o error (ECKSUM in this case), and there are multiple concurrent ARC reads of the same block (from different clones), then the ARC will put multiple buf's on the same ANON hdr, which isn't supposed to happen, and then causes a panic when we try to arc_buf_destroy() the buf. illumos/illumos-gate@fa98e487a9 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-08-01 02:39:44 +00:00
Alexander Motin	a5c4d9335b	MFV r337020:9443 panic when scrub a v10 pool illumos/illumos-gate@bb1f424574 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Dan McDonald <danmcd@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 23:00:58 +00:00
Alexander Motin	200c27a75d	MFV r337014: 9421 zdb should detect and print out the number of "leaked" objects 9422 zfs diff and zdb should explicitly mark objects that are on the deleted queue illumos/illumos-gate@20b5dafb42 Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-31 22:50:50 +00:00
Alexander Motin	0021e1c10c	MFV r336991, r337001: 9102 zfs should be able to initialize storage devices The first access to a disk block can incur a performance penalty on some platforms (e.g. AWS's EBS, VMware VMDKs). Therefore it is recommended that volumes be "thick provisioned", where supported by the platform (VMware). Thick provisioning is time consuming and often is ignored. If the thick provision step is omitted, customers will see suboptimal performance until we have written to all parts of the LUN. ZFS should be able to initialize any unused storage to remove any first-write penalty that exists. illumos/illumos-gate@094e47e980 Reviewed by: John Wren Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com>	2018-07-31 21:06:04 +00:00
Alexander Motin	25f15f404d	MFV r336960: 9256 zfs send space estimation off by > 10% on some datasets illumos/illummos-gate@df477c0afa Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-31 01:02:22 +00:00
Alexander Motin	9abeb6d79e	MFV r336958: 9337 zfs get all is slow due to uncached metadata This project's goal is to make read-heavy channel programs and zfs(1m) administrative commands faster by caching all the metadata that they will need in the dbuf layer. This will prevent the data from being evicted, so that any future call to i.e. zfs get all won't have to go to disk (very much). illumos/illumos-gate@adb52d9262 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:58:21 +00:00
Alexander Motin	d1cf4052d0	MFV r336955: 9236 nuke spa_dbgmsg We should use zfs_dbgmsg instead of spa_dbgmsg. Or at least, metaslab_condense() should call zfs_dbgmsg because it's important and rare enough to always log. It's possible that the message in zio_dva_allocate() would be too high-frequency for zfs_dbgmsg. illumos/illumos-gate@21f7c81cc1 Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:47:27 +00:00
Alexander Motin	656d46109d	MFV r336952: 9192 explicitly pass good_writes to vdev_uberblock/label_sync Currently vdev_label_sync and vdev_uberblock_sync take a zio_t and assume that its io_private is a pointer to the good_writes count. They should instead accept this argument explicitly. illumos/illumos-gate@a3b5583021 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:37:45 +00:00
Alexander Motin	194000fa21	MFV r336950: 9290 device removal reduces redundancy of mirrors Mirrors are supposed to provide redundancy in the face of whole-disk failure and silent damage (e.g. some data on disk is not right, but ZFS hasn't detected the whole device as being broken). However, the current device removal implementation bypasses some of the mirror's redundancy. illumos/illumos-gate@3a4b1be953 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Sara Hartse <sara.hartse@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com>	2018-07-31 00:25:39 +00:00
Alexander Motin	9cd6f162c0	MFV r336948: 9112 Improve allocation performance on high-end systems On high-end systems running async sequential write workloads, especially NUMA systems with flash or NVMe storage, one significant performance bottleneck is selecting a metaslab to do allocations from. This process can be parallelized, providing significant performance increases for these workloads. illumos/illumos-gate@f78cdc34af Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com> Reviewed by: Alexander Motin <mav@FreeBSD.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-07-31 00:02:42 +00:00
Alexander Motin	6413a6d31f	MFV r336946: 9238 ZFS Spacemap Encoding V2 The current space map encoding has the following disadvantages: [1] Assuming 512 sector size each entry can represent at most 16MB for a segment. This makes the encoding very inefficient for large regions of space. [2] As vdev-wide space maps have started to be used by new features (i.e. device removal, zpool checkpoint) we've started imposing limits in the vdevs that can be used with them based on the maximum addressable offset (currently 64PB for a top-level vdev). The new remains backwards compatible with the old one. The introduced two-word entry format, besides extending the limits imposed by the single-entry layout, also includes a vdev field and some extra padding after its prefix. The extra padding after the prefix should is reserved for future usage (e.g. new prefixes for future encodings or new fields for flags). The new vdev field not only makes the space maps more self-descriptive, but also opens the doors for pool-wide space maps. One final important note is that the number of bits used for vdevs is reduced to 24 bits for blkptrs. That was decided as we don't know of any setups that use more than 16M vdevs for the time being and we wanted to fit the vdev field in the space map. In addition that gives us some extra bits in dva_t. illumos/illumos-gate@17f11284b4 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-07-30 23:47:38 +00:00
Alexander Motin	eb235f2f8e	MFV r336942: 9189 Add debug to vdev_label_read_config when txg check fails illumos/illumos-gate@b6bf6e1540 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-07-30 22:03:29 +00:00
Allan Jude	68232f2053	ZFS: Reserve DMU_BACKUP_FEATURE flags for Native Encryption and ZSTD	2018-07-24 04:38:11 +00:00
Sean Eric Fagan	1bbaf1401c	Fix a couple of typos in r334844 noticed by Richard Kojedzinszky. Submitted by: Richard Kojedzinszky Reviewed by: sef Approved by: mav	2018-07-18 16:03:40 +00:00
Justin Hibbits	7f0df9ac2b	dtrace/powerpc: Correct register indices for non-indexed registers in the trapframe Fix an off-by-one error, LR starts at index 32, not index 33, and the others follow suit.	2018-07-16 19:47:29 +00:00
Sean Eric Fagan	072ffd4b20	Fix up some missed and mis-merges from the sequential scan code (r334844). Most of the changes involve moving some code around to reduce conflicts with future merges. One of the missing changes included a notification on scrub cancellation. Approved by: mav Sponsored by: iXsystems Inc	2018-07-10 20:11:32 +00:00
Sean Eric Fagan	aad5531e71	This exposes ZFS user and group quotas via the normal quatactl(2) mechanism. (Read-only at this point, however.) In particular, this is to allow rpc.rquotad query quotas for NFS mounts, allowing users to see their quotas on the hosts using the datasets. The changes specifically: * Add new RPC entry points for querying quotas. * Changes the library routines to allow non-UFS quotas. * Changes rquotad to check for quotas on mounted filesystems, rather than being limited to entries in /etc/fstab * Lastly, adds a VFS entry-point for ZFS to query quotas. Note that this makes one unavoidable behavioural change: if quotas are enabled, then they can be queried, as opposed to the current method of checking for quotas being specified in fstab. (With ZFS, if there are user or group quotas, they're used, always.) Reviewed by: delphij, mav Approved by: mav Sponsored by: iXsystems Inc Differential Revision: https://reviews.freebsd.org/D15886	2018-07-05 22:56:13 +00:00
Matt Macy	1abd10a2ea	opensolaris compat: fix compile error when opensolaris/sys/types.h is included before stddef.h ptrdiff_t would be typedef'd twice	2018-07-03 23:45:02 +00:00
Sean Eric Fagan	69724399c4	This originated from ZFS On Linux, as `d4a72f2386` During scans (scrubs or resilvers), it sorts the blocks in each transaction group by block offset; the result can be a significant improvement. (On my test system just now, which I put some effort to introduce fragmentation into the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the changes.) I've seen similar rations on production systems. Approved by: Alexander Motin Obtained from: ZFS On Linux Relnotes: Yes (improved scrub performance, with tunables) Differential Revision: https://reviews.freebsd.org/D15562	2018-06-08 17:38:28 +00:00
Benno Rice	b3b11d6400	Break recursion involving getnewvnode and zfs_rmnode. When we're at our vnode limit, getnewvnode will call into the vnode LRU cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling represents something with extended attributes, zfs_rmnode will call zfs_zget which will attempt to allocate another vnode. If the next vnode we try to recycle is also a ZFS vnode representing something with extended attributes we can recurse further. This ends up being unbounded and can end up overflowing the stack. In order to avoid this, restructure zfs_rmnode to simply add the extended attribute directory's object ID to the unlinked set, thus not requiring the allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain which will do the work of properly marking the vnodes for unlinking. zfs_unlinked_drain is also called on mount so these will be cleaned up there. Reviewed by: avg, mav Sponsored by: iXsystems, Inc. Differential Revision: https://reviews.freebsd.org/D15342	2018-06-07 18:59:32 +00:00
Justin Hibbits	a1a990d8a4	Revert r326083, it doesn't behave as expected. Even though there do appear to be more artificial frames, with 12, stack traces no longer list at all. Revert until a better, more stable value can be determined.	2018-06-03 03:53:11 +00:00
Justin Hibbits	5e91185bb1	Protect dtrace_getpcstack() from a NULL stack pointer in a trap frame Found when trying to use lockstat on a POWER9, the stack pointer (r1) could be NULL, and result in a NULL pointer dereference, crashing the kernel.	2018-05-30 03:48:27 +00:00
Hans Petter Selasky	4f2190efce	Fix 32-bit buildworld for i386 after r334320. The 64-bit atomics defined for i386 are currently only available in the kernel space. Found by: cy@ MFC after: 1 week Sponsored by: Mellanox Technologies	2018-05-29 13:43:16 +00:00
Hans Petter Selasky	43bb1274d0	Implement atomic_add_64() and atomic_subtract_64() for the i386 target. While at it add missing _acq_ and _rel_ variants for 64-bit atomic operations under i386. Reviewed by: kib @ MFC after: 1 week Sponsored by: Mellanox Technologies	2018-05-29 11:59:02 +00:00
Andriy Gapon	620b779158	fix zfs_getpages crash when called from sendfile, followup to r329363 It turns out that sendfile_swapin() has an optimization where it may insert pointers to bogus_page into the page array that it passes to VOP_GETPAGES. That happens to work with buffer cache, because it extensively uses bogus_page internally, so it has the necessary checks. However, ZFS did not expect bogus_page as VOP_GETPAGES(9) does not document such a (ab)use of bogus_page. So, this commit adds checks and handling of bogus_page. I expect that use of bogus_page with VOP_GETPAGES will get documented sooner rather than later. Reported by: Andrew Reilly <areilly@bigpond.net.au>, delphij Tested by: Andrew Reilly <areilly@bigpond.net.au> Requested by: many MFC after: 1 week	2018-05-25 07:29:52 +00:00
Andriy Gapon	873c2703d8	Fix 'zpool create -t <tempname>' Creating a pool with a temporary name fails when we also specify custom dataset properties: this is because we mistakenly call zfs_set_prop_nvlist() on the "real" pool name which, as expected, cannot be found because the SPA is present in the namespace with the temporary name. Fix this by specifying the correct pool name when setting the dataset properties. Author: loli10K <ezomori.nozomu@gmail.com> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Obtained from: ZFS on Linux, zfsonlinux/zfs@4ceb8dd6fd MFC after: 1 week	2018-05-15 13:27:29 +00:00
Mark Johnston	5f05bda607	DTrace aarch64: Avoid calling unwind_frame() in the probe context. unwind_frame() may be instrumented by FBT, leading to recursion into dtrace_probe(). Manually inline unwind_frame() as we do with stack unwinding code for other architectures. Submitted by: Domagoj Stolfa Reviewed by: manu MFC after: 1 week Sponsored by: DARPA / AFRL Differential Revision: https://reviews.freebsd.org/D15359	2018-05-12 15:35:26 +00:00
Matt Macy	cbd92ce62e	Eliminate the overhead of gratuitous repeated reinitialization of cap_rights - Add macros to allow preinitialization of cap_rights_t. - Convert most commonly used code paths to use preinitialized cap_rights_t. A 3.6% speedup in fstat was measured with this change. Reported by: mjg Reviewed by: oshogbo Approved by: sbruno MFC after: 1 month	2018-05-09 18:47:24 +00:00
Jamie Gritton	0e5c6bd436	Make it easier for filesystems to count themselves as jail-enabled, by doing most of the work in a new function prison_add_vfs in kern_jail.c Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and the rest is taken care of. This includes adding a jail parameter like allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed. Both of these used to be a static list of known filesystems, with predefined permission bits. Reviewed by: kib Differential Revision: D14681	2018-05-04 20:54:27 +00:00
Andriy Gapon	ca7019d2ac	opensolaris system_taskq does not need to run at maximum priority In fact, this taskqueue should use "boring" threads, nothing special about them. MFC after: 2 weeks	2018-05-04 07:28:01 +00:00
Ed Maste	3804f572a3	zfs_ioctl: avoid out-of-bound read admbugs: 796 Submitted by: Domagoj Stolfa <ds815@cam.ac.uk> Reported by: Ilja Van Sprundel <ivansprundel@ioactive.com> Reviewed by: avg MFC after: 1 day	2018-05-04 00:56:41 +00:00
Mateusz Guzik	9d68f7741f	systrace: track it like sdt probes While here predict false. Note the code is wrong (regardless of this change). Dereference of the pointer can race with module unload. A fix would set the probe to a nop stub instead of NULL.	2018-04-27 15:16:34 +00:00
Mateusz Guzik	7cd794214a	dtrace: depessimize dtmalloc when dtrace is active Each malloc/free was testing dtrace_malloc_enabled and forcing extra reads from the malloc type struct to see if perhaps a dtmalloc probe was on. Treat it like lockstat and sdt: have a global bolean.	2018-04-24 01:06:20 +00:00
Mateusz Guzik	4c5209cb21	lockstat: track lockstat just like sdt probes In particular flip the frequently tested var to bool.	2018-04-24 01:04:10 +00:00
Alexander Motin	bbbac409fe	9433 Fix ARC hit rate When the compressed ARC feature was added in commit `d3c2ae1` the method of reference counting in the ARC was modified. As part of this accounting change the arc_buf_add_ref() function was removed entirely. This would have be fine but the arc_buf_add_ref() function served a second undocumented purpose of updating the ARC access information when taking a hold on a dbuf. Without this logic in place a cached dbuf would not migrate its associated arc_buf_hdr_t to the MFU list. This would negatively impact the ARC hit rate, particularly on systems with a small ARC. This change reinstates the missing call to arc_access() from dbuf_hold() by implementing a new arc_buf_access() function. Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Tim Chase <tim@chase2k.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed-by: George Melikov <mail@gmelikov.ru> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>	2018-04-16 00:54:58 +00:00
Andriy Gapon	81f187e576	allow ZFS pool to have temporary name for duration of current import The change adds -t <name> option to zpool create and -t option to zpool import in its form with an old name and a new name. This allows to import (or create) a pool under a name that's different from its real, permanent name without affecting that name. This is useful when working with VM images or images of other physical systems if they happen to have a ZFS pool with the same name as the host system. The changes come from ZoL with some small tweaks. The porting has been done by julian. The change is being submitted to OpenZFS: https://github.com/openzfs/openzfs/pull/600 Submitted by: julian Reviewed by: smh MFC after: 2 weeks Sponsored by: Panzura (porting) Differential Revision: https://reviews.freebsd.org/D14972	2018-04-12 10:37:26 +00:00
Mark Johnston	87c1cb45cd	Correct a comment. Submitted by: Domagoj Stolfa X-MFC with: r332364 Sponsored by: DARPA, AFRL	2018-04-10 14:07:02 +00:00
Mark Johnston	b32171ea5a	Set zfs_arc_free_target to v_free_target. Page daemon output is now regulated by a PID controller with a setpoint of v_free_target. Moreover, the page daemon now wakes up regularly rather than waiting for a wakeup from another thread. This means that the free page count is unlikely to drop below the old zfs_arc_free_target value, and as a result the ARC was not readily freeing pages under memory pressure. Address the immediate problem by updating zfs_arc_free_target to match the page daemon's new behaviour. Reported and tested by: truckman Discussed with: jeff X-MFC with: r329882 Differential Revision: https://reviews.freebsd.org/D14994	2018-04-10 13:56:06 +00:00
Mark Johnston	8593136428	Assert that dtrace_probe() doesn't re-enter itself. This helps catch cases where an instrumented function is called while in probe context. Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com> MFC after: 2 weeks Sponsored by: DARPA/AFRL Differential Revision: https://reviews.freebsd.org/D14863	2018-04-10 13:47:09 +00:00
Alexander Motin	8b26d76a50	9434 Speculative prefetch is blocked by device removal code. Device removal code does not set spa_indirect_vdevs_loaded for pools that never experienced device removal. At least one visual consequence of it is completely blocked speculative prefetcher. This patch sets the variable in such situations.	2018-04-03 21:16:41 +00:00
Alexander Motin	849a7ce2d5	MFV r331712: 9280 Assertion failure while running removal_with_ganging test with 4K devices illumos/illumos-gate@243952c7ee Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matt Ahrens <Matt.Ahrens@delphix.com>	2018-03-28 23:17:29 +00:00
Alexander Motin	a96086ed50	MFV 331710: 9188 increase size of dbuf cache to reduce indirect block decompression illumos/illumos-gate@268bbb2a2f With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect blocks, under a workload of random cached reads. To reduce this decompression cost, we would like to increase the size of the dbuf cache so that more indirect blocks can be stored uncompressed. If we are caching entire large files of recordsize=8K, the indirect blocks use 1/64th as much memory as the data blocks (assuming they have the same compression ratio). We suggest making the dbuf cache be 1/32nd of all memory, so that in this scenario we should be able to keep all the indirect blocks decompressed in the dbuf cache. (We want it to be more than the 1/64th that the indirect blocks would use because we need to cache other stuff in the dbuf cache as well.) In real world workloads, this won't help as dramatically as the example above, but we think it's still worth it because the risk of decreasing performance is low. The potential negative performance impact is that we will be slightly reducing the size of the ARC (by ~3%). Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Allan Jude <allanjude@freebsd.org> Reviewed by: Igor Kozhukhov <igor@dilos.org> Approved by: Garrett D'Amore <garrett@damore.org> Author: George Wilson <george.wilson@delphix.com>	2018-03-28 23:05:48 +00:00
Alexander Motin	9311abdd7e	MFV r331708: 9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value illumos/illumos-gate@9be12bd737 arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally In the case of zfs_compressed_arc_enabled=0, when the buf is returned via arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes is decremented by lsize, not psize. Switch to using arc_buf_size(buf), instead of psize, which will return psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf). Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Allan Jude <allanjude@freebsd.org>	2018-03-28 22:50:05 +00:00
Alexander Motin	5c4561f332	MFV r331706: 9235 rename zpool_rewind_policy_t to zpool_load_policy_t illumos/illumos-gate@5dafeea3eb We want to be able to pass various settings during import/open of a pool, which are not only related to rewind. Instead of adding a new policy and duplicate a bunch of code, we should just rename rewind_policy to a more generic term like load_policy. For instance, we'd like to set spa->spa_import_flags from the nvlist, rather from a flags parameter passed to spa_import as in some cases we want those flags not only for the import case, but also for the open case. One such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would allow zfs to open a pool when logs are missing. Reviewed by: Matt Ahrens <matt@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:29:06 +00:00
Alexander Motin	6cc8a8260f	MFV 331704: 9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices illumos/illumos-gate@ccef24b493 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matt Ahrens <matt@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:10:06 +00:00
Alexander Motin	2221f0d8af	MFV 331702: 9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate illumos/illumos-gate@d1de72cfa2 ztest failed with uncorrectable IO error despite having the fix for #7163. Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes it from that issue. Definitely seems like a racing condition between the vdev_validate and spa_sync: 1. Thread A (spa_sync): vdev label is updated to latest txg 2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead. 3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg. Solution: do not check txg in vdev_validate unless config lock is held. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-03-28 22:07:31 +00:00
Alexander Motin	0b0c76bc58	MFV r331695, 331700: 9166 zfs storage pool checkpoint illumos/illumos-gate@8671400134 The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with exactly that. It can be thought of as a “pool-wide snapshot” (or a variation of extreme rewind that doesn’t corrupt your data). It remembers the entire state of the pool at the point that it was taken and the user can revert back to it later or discard it. Its generic use case is an administrator that is about to perform a set of destructive actions to ZFS as part of a critical procedure. She takes a checkpoint of the pool before performing the actions, then rewinds back to it if one of them fails or puts the pool into an unexpected state. Otherwise, she discards it. With the assumption that no one else is making modifications to ZFS, she basically wraps all these actions into a “high-level transaction”. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-03-28 22:01:27 +00:00
Andriy Gapon	f4043145f2	ZFS vn_rele_async: catch up with the use of refcount(9) for the vnode use count It's not sufficient nor required to use the vnode interlock when checking if we are going to drop the last use count as the code in vputx() uses refcount (atomic) operations for both checking and decrementing the use code. Apply the same method to vn_rele_async(). While here, remove vn_rele_inactive(), a wrapper around vrele() that didn't add any value. Also, the change required making vfs_refcount_release_if_not_last() public. I've made vfs_refcount_acquire_if_not_zero() public as well. They are in sys/refcount.h now. While making the move I've dropped the vfs_ prefix. Reviewed by: mjg MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D14869	2018-03-28 08:55:31 +00:00
John Baldwin	d41e41f9f0	Remove very old and unused signal information codes. These have been supplanted by the MI signal information codes in <sys/signal.h> since 7.0. The FPE_*_TRAP ones were deprecated even earlier in 1999. PR: 226579 (exp-run) Reviewed by: kib Differential Revision: https://reviews.freebsd.org/D14637	2018-03-27 20:57:51 +00:00
Andriy Gapon	f3fe7e5eff	zfs: fix mismatch between format specifier and type vdev_dbgmsg_print_tree printed vdev_id of uint64_t type with %u format specifier. That caused subsequent parameters to be incorrectly read from the stack and lead to a crash when a wrong value was interpreted as a string pointer. This should be upstreamed. Reported by: pho MFC after: 3 days	2018-03-23 09:42:47 +00:00
Alexander Motin	4148c56f78	Reduce struct aggsum_bucket padding to fit into one cache line. Reported by: mjg	2018-03-23 02:50:38 +00:00
Alexander Motin	e76e77a972	MFV r331407: 9213 zfs: sytem typo illumos/illumos-gate@edc8ef7d92 Reviewed by: C Fraire <cfraire@me.com> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Approved by: Joshua M. Clulow <josh@sysmgr.org> Author: Toomas Soome <tsoome@me.com>	2018-03-23 02:30:29 +00:00
Alexander Motin	f222611ab0	MFV r331405: 9084 spa_*_ashift must ignore spare devices illumos/illumos-gate@b037f3dbd6 Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com>	2018-03-23 02:24:52 +00:00
Alexander Motin	b8436536c9	MFV r331400: 8484 Implement aggregate sum and use for arc counters In pursuit of improving performance on multi-core systems, we should implements fanned out counters and use them to improve the performance of some of the arc statistics. These stats are updated extremely frequently, and can consume a significant amount of CPU time. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2018-03-23 02:15:05 +00:00
Mark Johnston	1de56ac728	Revert part of r331264: disable interrupts before disabling WP. We might otherwise be preempted, leaving WP disabled while another thread runs on the CPU. Reported by: kib X-MFC with: r331264	2018-03-20 21:36:35 +00:00
Mark Johnston	7a79ce2e38	Make use of the KPI added in r331252. MFC after: 2 weeks	2018-03-20 21:16:26 +00:00
Ed Maste	fc2a8776a2	Rename assym.s to assym.inc assym is only to be included by other .s files, and should never actually be assembled by itself. Reviewed by: imp, bdrewery (earlier) Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D14180	2018-03-20 17:58:51 +00:00
Mark Johnston	95099bbad1	Fix an access of an uninitialized variable in dtrace_probe(). Reported by: Coverity, via cem MFC after: 3 days	2018-03-18 17:01:50 +00:00
Andriy Gapon	289c14e811	MFV r330973: 9164 assert: newds == os->os_dsl_dataset illumos/illumos-gate@5f5913bb83 `5f5913bb83` https://www.illumos.org/issues/9164 This issue has been reported by Alan Somers as https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877 dmu_objset_refresh_ownership() first disowns a dataset (and releases it) and then owns it again. There is an assert that the new dataset object is the same as the old dataset object. When running ZFS Test Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test: panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000 == 0xfffff80021ab4800) I see that the old dataset has dsl_dataset_evict_async() pending in ds_dbu.dbu_tqent and its ds_dbuf is NULL. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <don.brady@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <avg@FreeBSD.org> PR: 225877 Reported by: asomers MFC after: 1 week	2018-03-15 08:49:21 +00:00
Steven Hartland	a59e720e65	Prevent ZFS TRIM breaking VTOC8 partitions Update the ZFS TRIM code to ensure it respects VTOC8 partition headers as documented by the ZFS On-Disk Specification section 1.3 Before this a zpool create on a VTOC8 partitioned device would overwrite the partition metadata. Reported by: marius Reviewed by: marius agv MFC after: 1 week Sponsored by: Multiplay	2018-03-14 21:21:03 +00:00
Andriy Gapon	4a24755d68	MFV r330591: 8984 fix for 6764 breaks ACL inheritance illumos/illumos-gate@e9bacc6d1a `e9bacc6d1a` https://www.illumos.org/issues/8984 Consider a directory configured as: drwx-ws---+ 2 henson cpp 3 Jan 23 12:35 dropbox/ user:henson:rwxpdDaARWcC--:f-i----:allow owner@:--------------:f-i----:allow group@:--------------:f-i----:allow everyone@:--------------:f-i----:allow owner@:rwxpdDaARWcC--:-di----:allow group:cpp:-wx-----------:-------:allow owner@:rwxpdDaARWcC--:-------:allow A new file created in this directory ends up looking like: rw-r--r-+ 1 astudent cpp 0 Jan 23 12:39 testfile user:henson:rw-pdDaARWcC--:------I:allow owner@:--------------:------I:allow group@:--------------:------I:allow everyone@:--------------:------I:allow owner@:rw-p--aARWcCos:-------:allow group@:r-----a-R-c--s:-------:allow everyone@:r-----a-R-c--s:-------:allow with extraneous group@ and everyone@ entries allowing read access that shouldn't exist. Per Albert Lee on the zfs mailing list: "aclinherit=passthrough/passthrough-x should still ignore the requested mode when an inheritable ACE for owner@ group@, or everyone@ is present in the parent directory. It appears there was an oversight in my fix for https://www.illumos.org/issues/6764 which made calling zfs_acl_chmod from zfs_acl_inherit unconditional. I think the parent ACL check for aclinherit=passthrough needs to be reintroduced in zfs_acl_inherit." We have a large number of faculty who use dropbox directories like the example to have students submit projects. All of these directories are now allowing Reviewed by: Sam Zaydel <szaydel@racktopsystems.com> Reviewed by: Paul B. Henson <henson@acm.org> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Dominik Hassler <hadfl@omniosce.org> PR: 216886 MFC after: 2 weeks	2018-03-07 13:49:26 +00:00
Mark Johnston	1aa8a926b8	Unbreak amd64 FBT after r330539. X-MFC with: r330539	2018-03-06 15:51:59 +00:00
Andriy Gapon	f3b7b054dd	add ZFS_ENTER protection to .zfs/snapshot vnode operations that need it Those operations, zfsctl_snapdir_readdir and zfsctl_snapdir_getattr, access the filesystem's objset and it can be unstable during operations like receive and rollback. MFC after: 2 weeks	2018-02-27 14:08:54 +00:00
Alexander Motin	060bac1db8	Add sysctls/tunables for dbuf cache size. MFC after: 2 weeks	2018-02-27 01:36:43 +00:00
Alan Somers	7d3761dc72	Don't declare __assfail as static It gets called by dmu_buf_init_user, which is inline but not static. So it needs global linkage itself. Reported by: GCC-6 MFC after: 17 days X-MFC-With: 329722	2018-02-25 14:29:43 +00:00
Alan Somers	92bd443160	Implement CTASSERT using _Static_assert Prevents warnings about "unused typedef" with GCC-6 Reported by: GCC-6 MFC after: 18 days X-MFC-With: 329722	2018-02-24 16:01:21 +00:00
Andriy Gapon	de2cb430ad	another rework of getzfsvfs / getzfsvfs_impl code This change is designed to account for yet another difference between illumos and FreeBSD VFS. In FreeBSD a filesystem driver is supposed to clean up mnt_data in its VFS_UNMOUNT method because it's the last call into the driver before a struct mount object is destroyed. The VFS drains all references to the object before destroying it, but for the driver it's already as good as gone. In contrast, illumos VFS provides another method, VFS_FREEVFS, that is called when all references are drained. So, the driver can keep its data after VFS_UNMOUNT and clean it up in VFS_FREEVFS after all references are gone. This is what ZFS does on illumos. So there a reference to a filesystem is sufficient to guarantee that the ZFS specific data, aka zfsvfs_t, stays around (even if the filesystem gets unmounted). In FreeBSD we need to vfs_busy the filesystem to get the same guarantee. vfs_ref guarantees only that the struct mount is kept. The following rules should be observed in getzfsvfs / getzfsvfs_impl on FreeBSD: - if we need access to zfsvfs_t then we must use vfs_busy - if only we need to access struct mount (aka vfs_t), then vfs_ref is enough - when illumos code actually needs only the vfs_t, they still can pass the zfsvfs_t and get the vfs_t from it; that can work in FreeBSD if the filesystem is busied, but when it's just referenced then we have to pass the vfs_t explicitly - we cannot call vfs_busy while holding a dataset because that creates a LOR with dp_config_rwlock As a result: - getzfsvfs_impl now only references the filesystem, same as in illumos, but unlike illumos it has to return the vfs_t - the consumers are updated to account for the change - getzfsvfs busies the filesystem (and drops the reference from getzfsvfs_impl) Also, zfs_unmount_snap() now gets a busied a filesystem, references it and then unbusies it essentially reverting actions done in getzfsvfs. This is needed because the code may perform some checks that require the zfsvfs_t. So, those are done before the unbusying. MFC after: 2 weeks	2018-02-22 13:06:27 +00:00
Andriy Gapon	8d69fe5cc8	followup to r329556, completely remove the covered vnode assert vrele() acquires the vnode lock only if the hold count drops to zero. In other scenarios it needs only the interlock. So, zfsctl_snapdir_lookup() can race with vfs_mount_destroy() -> vrele() such that the lookup adds a new reference and then vrele() drops the mountpoint's reference and only then we check the reference count. It would be just one in this case. In fact, the assert should have been removed in r323483 when the code learned how to deal with the uncovered vnode. PR: 225795 MFC after: 4 days X-MFC with: r329556	2018-02-22 11:41:00 +00:00
Alexander Motin	dd9ceab333	MFV r329803: 9080 recursive enter of vdev_indirect_rwlock from vdev_indirect_remap() illumos/illumos-gate@bdfded42e6 A scenario came up where a callback executed by vdev_indirect_remap() on a vdev, calls vdev_indirect_remap() on the same vdev and tries to reacquire vdev_indirect_rwlock that was already acquired from the first call to vdev_indirect_remap(). The specific scenario, is that we want to remap a block pointer that is snapshoted but its dataset's remap_deadlist is not cached. So in order to add it we issue a read through a vdev_indirect_remap() on the same vdev, which brings up the aforementioned issue. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>	2018-02-22 03:54:59 +00:00
Alexander Motin	064827be34	MFV r329799, r329800: 9079 race condition in starting and ending condesing thread for indirect vdevs illumos/illumos-gate@667ec66f1b The timeline of the race condition is the following: [1] Thread A is about to finish condesing the first vdev in spa_condense_indirect_thread(), so it calls the spa_condense_indirect_complete_sync() sync task which sets the spa_condensing_indirect field to NULL. Waiting for the sync task to finish, thread A sleeps until the txg is done. When this happens, thread A will acquire spa_async_lock and set spa_condense_thread to NULL. [2] While thread A waits for the txg to finish, thread B which is running spa_sync() checks whether it should condense the second vdev in vdev_indirect_should_condense() by checking the spa_condensing_indirect field which was set to NULL by spa_condense_indirect_thread() from thread A. So it goes on and tries to spawn a new condensing thread in spa_condense_indirect_start_sync() and the aforementioned assertions fails because thread A has not set spa_condense_thread to NULL (which is basically the last thing it does before returning). The main issue here is that we rely on both spa_condensing_indirect and spa_condense_thread to signify whether a condensing thread is running. Ideally we would only use one throughout the codebase. In addition, for managing spa_condense_thread we currently use spa_async_lock which basically tights condensing to scrubing when it comes to pausing and resuming those actions during spa export. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Serapheim Dimitropoulos <serapheim@delphix.com>	2018-02-22 03:49:06 +00:00
Alexander Motin	1ea10a60f9	MFV r329793, r329795: 9075 Improve ZFS pool import/load process and corrupted pool recovery illumos/illumos-gate@6f7938128a Some work has been done lately to improve the debugability of the ZFS pool load (and import) process. This includes: https://www.illumos.org/issues/7638: Refactor spa_load_impl into several functions https://www.illumos.org/issues/8961: SPA load/import should tell us why it failed https://www.illumos.org/issues/7277: zdb should be able to print zfs_dbgmsg's To iterate on top of that, there's a few changes that were made to make the import process more resilient and crash free. One of the first tasks during the pool load process is to parse a config provided from userland that describes what devices the pool is composed of. A vdev tree is generated from that config, and then all the vdevs are opened. The Meta Object Set (MOS) of the pool is accessed, and several metadata objects that are necessary to load the pool are read. The exact configuration of the pool is also stored inside the MOS. Since the configuration provided from userland is external and might not accurately describe the vdev tree of the pool at the txg that is being loaded, it cannot be relied upon to safely operate the pool. For that reason, the configuration in the MOS is read early on. In the past, the two configurations were compared together and if there was a mismatch then the load process was aborted and an error was returned. The latter was a good way to ensure a pool does not get corrupted, however it made the pool load process needlessly fragile in cases where the vdev configuration changed or the userland configuration was outdated. Since the MOS is stored in 3 copies, the configuration provided by userland doesn't have to be perfect in order to read its contents. Hence, a new approach has been adopted: The pool is first opened with the untrusted userland configuration just so that the real configuration can be read from the MOS. The trusted MOS configuration is then used to generate a new vdev tree and the pool is re-opened. When the pool is opened with an untrusted configuration, writes are disabled to avoid accidentally damaging it. During reads, some sanity checks are performed on block pointers to see if each DVA points to a known vdev; when the configuration is untrusted, instead of panicking the system if those checks fail we simply avoid issuing reads to the invalid DVAs. This new two-step pool load process now allows rewinding pools accross vdev tree changes such as device replacement, addition, etc. Loading a pool from an external config file in a clustering environment also becomes much safer now since the pool will import even if the config is outdated and didn't, for instance, register a recent device addition. With this code in place, it became relatively easy to implement a long-sought-after feature: the ability to import a pool with missing top level (i.e. non-redundant) devices. Note that since this almost guarantees some loss Of data, this feature is for now restricted to a read-only import. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 03:15:35 +00:00
Alexander Motin	613b0d87da	8942 zfs promote .../%recv should be an error illumos/illumos-gate@add927f8c8 Reported on the ZFSonLinux https://github.com/zfsonlinux/zfs/issues/4843, fixed by https://github.com/zfsonlinux/zfs/pull/6339: If we are in the middle of an incremental zfs receive, the child .../%recv will exist. If you concurrently run zfs promote .../%recv, it will "work", but then zfs gets confused. For example, there's no obvious way to destroy the containing filesystem (because it is now a clone of its invisible child). Attempting to do this promote should be an error. We could fix this by having zfs_ioc_promote() check if zc_name contains a %, similar to zfs_ioc_rename(). Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:42:13 +00:00
Alexander Motin	a33ba3dbde	MFV r329776: 8477 Assertion failed in vdev_state_dirty(): spa_writeable(spa) illumos/illumos-gate@f4c1745bd6 Illumos 4080 allows "zpool clear" to work on readonly pools: i don't think this is the intended behaviour, we shouldn't be allowed to clear readonly pools. Probably. A fix is already in the ZFS on Linux repository to addess this issue: `92e43c1718` Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 01:00:46 +00:00
Alexander Motin	eea9be67e6	MFV r329774: 8408 dsl_props_set_sync_impl() does not handle nested nvlists correctly illumos/illumos-gate@85723e5eec When iterating over the input nvlist in dsl_props_set_sync_impl() when we don't preserve the nvpair name before looking up ZPROP_VALUE, so when we later go to process it nvpair_name() is always "value" instead of the actual property name. This results in a couple of bugs in the recv code: - received properties are not restored correctly when failing to receive an incremental send stream - received properties are not completely replaced by the new ones when successfully receiving an incremental send stream This was discovered on ZFS on Linux (fixed in `5f1346c299`) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: loli10K <ezomori.nozomu@gmail.com>	2018-02-22 00:55:25 +00:00
Alexander Motin	756595f675	MFV r329770: 9035 zfs: this statement may fall through illumos/illumos-gate@46ac8fdfc5 Reviewed by: Yuri Pankov <yuripv@yuripv.net> Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Toomas Soome <tsoome@me.com>	2018-02-22 00:47:38 +00:00
Alexander Motin	502d18a8f1	MFV r329766: 8962 zdb should work on non-idle pools illumos/illumos-gate@e144c4e6c9 Currently `zdb` consistently fails to examine non-idle pools as it fails during the `spa_load()` process. The main problem seems to be that `spa_load_verify()` fails as can be seen below: $ sudo zdb -d -G dcenter zdb: can't open 'dcenter': I/O error ZFS_DBGMSG(zdb): spa_open_common: opening dcenter spa_load(dcenter): LOADING disk vdev '/dev/dsk/c4t11d0s0': best uberblock found for spa dcenter. txg 40824950 spa_load(dcenter): using uberblock with txg=40824950 spa_load(dcenter): UNLOADING spa_load(dcenter): RELOADING spa_load(dcenter): LOADING disk vdev '/dev/dsk/c3t10d0s0': best uberblock found for spa dcenter. txg 40824952 spa_load(dcenter): using uberblock with txg=40824952 spa_load(dcenter): FAILED: spa_load_verify failed [error=5] spa_load(dcenter): UNLOADING This change makes `spa_load_verify()` a dryrun when ran from `zdb`. This is done by creating a global flag in zfs and then setting it in `zdb`. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 00:42:12 +00:00
Alexander Motin	fa607d017d	MFV r329762: 8961 SPA load/import should tell us why it failed illumos/illumos-gate@3ee8c80c74 When we fail to open or import a storage pool, we typically don't get any additional diagnostic information, just "no pool found" or "can not import". While there may be no additional user-consumable information, we should at least make this situation easier to debug/diagnose for developers and support. For example, we could start by using `zfs_dbgmsg()` to log each thing that we try when importing, and which things failed. E.g. "tried uberblock of txg X from label Y of device Z". Also, we could log each of the stages that we go through in `spa_load_impl()`. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-22 00:03:14 +00:00
Alexander Motin	df331510ab	MFV r329760: 7638 Refactor spa_load_impl into several functions illumos/illumos-gate@1fd3785ff6 spa_load_impl has grown out of proportions. It is currently over 700 lines long and makes it very hard to follow or debug the import process even for experienced ZFS developers. The objective is to split it up in a series of well commented functions. Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Approved by: Dan McDonald <danmcd@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2018-02-21 23:38:30 +00:00
Alexander Motin	b17bfcde3d	9018 Replace kmem_cache_reap_now() with kmem_cache_reap_soon() illumos/illumos-gate@36a64e6284 To prevent kmem_cache reaping from blocking other system resources, turn kmem_cache_reap_now() (which blocks) into kmem_cache_reap_soon(). Callers to kmem_cache_reap_soon() should use kmem_cache_reap_active(), which exploits #9017's new taskq_empty(). Reviewed by: Bryan Cantrill <bryan@joyent.com> Reviewed by: Dan McDonald <danmcd@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuripv@yuripv.net> Author: Tim Kordas <tim.kordas@joyent.com> FreeBSD does not use taskqueue for kmem caches reaping, so this change is less dramatic then it is on Illumos, just limiting reaping to 1 time per second. It may possibly be improved later, if needed.	2018-02-21 23:15:06 +00:00
Alexander Motin	d208c07cf3	MFV r329753: 8809 libzpool should leverage work done in libfakekernel illumos/illumos-gate@f06dce2c1f Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andrew Stormont <astormont@racktopsystems.com>	2018-02-21 21:18:04 +00:00

... 3 4 5 6 7 ...

2332 Commits