freebsd-dev

Author	SHA1	Message	Date
Andriy Gapon	2cd05c2473	MFV r318934: 8070 Add some ZFS comments illumos/illumos-gate@40713f2b24 `40713f2b24` https://www.illumos.org/issues/8070 Add some ZFS comments left by various developers at different times Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alan Somers <asomers@gmail.com> MFC after: 1 week	2017-05-26 11:49:42 +00:00
Andriy Gapon	0a07ea0e2f	MFV r318931: 8063 verify that we do not attempt to access inactive txg illumos/illumos-gate@b7b2590dd9 `b7b2590dd9` https://www.illumos.org/issues/8063 A standard practice in ZFS is to keep track of "per-txg" state. Any of the 3 active TXG's (open, quiescing, syncing) can have different values for this state. We should assert that we do not attempt to modify other (inactive) TXG's. Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 2 weeks	2017-05-26 11:37:11 +00:00
Andriy Gapon	28c5e43e36	MFV r318929: 7786 zfs`vdev_online() needs better notification about state changes illumos/illumos-gate@5f368aef86 `5f368aef86` https://www.illumos.org/issues/7786 Currently, vdev_online() will only post sysevent if previous state was "offline". It should also post the event when the state changes from "removed" or "faulted" to "healthy" or "degraded". This will fix the following scenario: - pull disk from slot A - check that hotspare has taken its place (if available) - insert disk into slot B - check that hotspare moved back to "avail" state (if spare was used) The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail to update the vdev FRU. Reviewed by: Matthew Ahrens mahrens@delphix.com Reviewed by: George Wilson george.wilson@delphix.com Approved by: Albert Lee <trisk@forkgnu.org> Author: Yuri Pankov <yuri.pankov@nexenta.com> MFC after: 1 week	2017-05-26 11:33:34 +00:00
Andriy Gapon	9c2a3c861f	MFV r318927: 8025 dbuf_read() creates unnecessary zio_root() for bonus buf illumos/illumos-gate@def4fac588 `def4fac588` https://www.illumos.org/issues/8025 dbuf_read() creates a zio_root() to track and wait for all the zio's that may happen as part of this call. However, if the blkptr_t for this buffer is NULL or a hole, we will not create any more zio's, so this zio_root() is unnecessary. This is always the case when calling dbuf_read() on a bonus buffer, because it has no blkptr (it's part of the containing dnode). For workloads that read a lot of bonus buffers (e.g. file creation and removal), creating and destroying these unnecessary zio's can decrease performance by around 3%. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prashanth Sreenivasa <pks@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-05-26 11:30:55 +00:00
Andriy Gapon	ebaf416f95	MFV r316929: 6914 kernel virtual memory fragmentation leads to hang illumos/illumos-gate@af868f46a5 `af868f46a5` https://www.illumos.org/issues/6914 FreeBSD note: only a ZFS part of the change is merged, changes to the VM subsystem are not ported (obviously). Also, now that FreeBSD has vmem(9) we don't have to ifdef-out the code that uses it. MFC after: 2 weeks	2017-05-26 11:23:16 +00:00
Andriy Gapon	8629ec8394	arc_init: make code closer to upstream by introducing 'allmem' variable All the differences in calculations are kept. A comment about arc_max being 1/2 of all memory is fixed to reflect the actual code that uses 5/8 as a factor. MFC after: 1 week	2017-05-26 11:05:56 +00:00
Andriy Gapon	cf781c9b60	zfs_putpages: assert that sa_bulk_update() must succeed Same as the upstream does in r316927. MFC after: 1 week	2017-05-26 10:37:55 +00:00
Andriy Gapon	04b7c6b337	MFV r316928: 7256 low probability race in zfs_get_data illumos/illumos-gate@0c94e1af67 `0c94e1af67` https://www.illumos.org/issues/7256 error = dmu_sync(zio, lr->lr_common.lrc_txg, zfs_get_done, zgd); ASSERT(error \|\| lr->lr_length <= zp->z_blksz); It's possible, although extremely rare, that the zfs_get_done() callback is executed before dmu_sync() returns. In that case the znode's range lock is dropped and the znode is unreferenced. Thus, the assertion can access some invalid or wrong data via the zp pointer. size variable caches the correct value of z_blksz and can be safely used here. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <andriy.gapon@clusterhq.com> MFC after: 1 week	2017-05-26 10:31:05 +00:00
Andriy Gapon	7a94dd7aee	MFC r316924: 8061 sa_find_idx_tab can be declared more type-safely illumos/illumos-gate@7f0bdb4257 `7f0bdb4257` https://www.illumos.org/issues/8061 sa_find_idx_tab() is declared as taking and returning "void *" parameters. These can be declared to be the specific types. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 1 week	2017-05-26 10:27:35 +00:00
Andriy Gapon	8816c0bb48	MFV r316925: 6101 attempt to lzc_create() a filesystem under a volume results in a panic illumos/illumos-gate@b127fe3c05 `b127fe3c05` https://www.illumos.org/issues/6101 lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to create a filesystem as a child of a volume, instead it proceeds to a crash. A crash stack obtained on FreeBSD: page fault while in kernel mode zap_leaf_lookup() fzap_lookup() zap_lookup_norm() zap_lookup() zfs_get_zplprop() zfs_fill_zplprops_impl() zfs_ioc_create() zfsdev_ioctl() devfs_ioctl_f() kern_ioctl() sys_ioctl() This crash happened with a kernel without debugging assertions. The immediate cause of crash appears to an attempt to interpret a zvol object as a zap object. For filesystems: #define MASTER_NODE_OBJ 1 For zvols: #define ZVOL_OBJ 1ULL #define ZVOL_ZAP_OBJ 2ULL So, I see two problems here: 1. an attempt to create a filesystem under a zvol should be rejected as early as possible, maybe in zfs_fill_zplprops() 2. maybe zap_lookup / zap_lockdir should reject objects that are not of one of the zap object types Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 2 weeks	2017-05-24 22:34:54 +00:00
Andriy Gapon	e73f9f8a49	MFV r316923: 8026 retire zfs_throttle_delay and zfs_throttle_resolution illumos/illumos-gate@6b03625981 `6b03625981` https://www.illumos.org/issues/8026 zfs_throttle_delay and zfs_throttle_resolution became disused since the new write throttling mechanism was introduced. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 1 week	2017-05-24 22:32:56 +00:00
Andriy Gapon	9fe5e04dfc	MFC r316921: 8027 tighten up dsl_pool_dirty_delta illumos/illumos-gate@313ae1e182 `313ae1e182` https://www.illumos.org/issues/8027 dsl_pool_dirty_delta() should not wake up waiters when dp->dp_dirty_total == zfs_dirty_data_max, because they wait for dp_dirty_total to fall strictly below the threshold. It's probably very rare for that condition to occur, but it's better to have more accurate code. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 1 week	2017-05-24 22:27:48 +00:00
Andriy Gapon	e1b8f10a5e	MFV r316920: 8023 Panic destroying a metaslab deferred range tree illumos/illumos-gate@3991b535a8 `3991b535a8` https://www.illumos.org/issues/8023 $C ffffff0011bc0970 vpanic() ffffff0011bc0a00 strlog() ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00) ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380) ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800) ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000) ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0) ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000) ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000) ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68) ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003, ffffff03e1956b10, ffffff0011bc0e68, 0) ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190) ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149() Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com> MFC after: 2 weeks	2017-05-24 22:25:26 +00:00
Andriy Gapon	5386d7295a	MFV r316917: 7968 multi-threaded spa_sync() illumos/illumos-gate@94c2d0eb22 `94c2d0eb22` https://www.illumos.org/issues/7968 spa_sync() iterates over all the dirty dnodes and processes each of them by calling dnode_sync(). If there are many dirty dnodes (e.g. because we created or removed a lot of files), the single thread of spa_sync() calling dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied concurrently in open context (e.g. due to concurrent file creation), the os_lock will experience lock contention via dnode_setdirty(). The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to use separate threads to process each of the sublists in the multilist. On the concurrent file creation microbenchmark, the performance improvement from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/ reward, once the other bottlenecks are addressed, fixing this bug will provide a medium-large performance gain and require a medium amount of effort to implement. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 22:21:24 +00:00
Andriy Gapon	2ba631553c	MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists illumos/illumos-gate@10fbdecb05 `10fbdecb05` https://www.illumos.org/issues/7970 The global tunable zfs_arc_num_sublists_per_state is used by the ARC and the dbuf cache, and other users are planned. We should change this tunable to be common to all multilists. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 22:15:16 +00:00
Andriy Gapon	1d7634429c	MFC r316915: 7801 add more by-dnode routines (lint) illumos/illumos-gate@411be58a6e `411be58a6e` MFC after: 24 days X-MFC with: r318823	2017-05-24 21:52:20 +00:00
Andriy Gapon	31fd119cc2	MFC r316914: 7801 add more by-dnode routines illumos/illumos-gate@b0c42cd470 `b0c42cd470` https://www.illumos.org/issues/7801 Add _by_dnode() routines for accessing objects given their dnode_t , this is more efficient than accessing the object by (objset_t *, uint64_t object). This change converts some but not all of the existing consumers. As performance-sensitive code paths are discovered they should be converted to use these routines. Ported from: `0eef1bde31` Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: bzzz77 <bzzz.tomas@gmail.com> MFC after: 24 days	2017-05-24 21:49:21 +00:00
Andriy Gapon	5aab788866	MFC r316913: 7869 panic in bpobj_space(): null pointer dereference illumos/illumos-gate@a3905a4592 `a3905a4592` https://www.illumos.org/issues/7869 The issue fixed by this patch is a race condition in the deadlist code. A thread executing an administrative command that uses `dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to protect the access of all its entries that the deadlist contains in an avl tree. Sync threads trying to insert a new entry in the deadlist (through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our sentinel value), we close and reopen it. Between these two operations, it is possible for the `dsl_deadlist_space_range()` thread to dereference that bpobj which is `NULL` during that window. Threads should hold the a deadlist's `dl_lock` when they manipulate its internal data so scenarios like the one above are avoided. In addition, threads should also hold the bpobj lock whenever they are allocating the subobj list of a bpobj, and not just when they actually insert the subobj to the list. This way we can avoid potential memory leaks. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Melikov <mail@gmelikov.ru> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Dan McDonald <danmcd@omniti.com> Author: Serapheim Dimitropoulos <serapheim@delphix.com> MFC after: 2 weeks	2017-05-24 21:45:52 +00:00
Andriy Gapon	930f1af491	MFC r316912: 7793 ztest fails assertion in dmu_tx_willuse_space illumos/illumos-gate@61e255ce72 `61e255ce72` https://www.illumos.org/issues/7793 Background information: This assertion about tx_space_* verifies that we are not dirtying more stuff than we thought we would. We “need” to know how much we will dirty so that we can check if we should fail this transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() — typically less than a millisecond), we call dbuf_dirty() on the exact blocks that will be modified. Once this happens, the temporary accounting in tx_space_* is unnecessary, because we know exactly what blocks are newly dirtied; we call dnode_willuse_space() to track this more exact accounting. The fundamental problem causing this bug is that dmu_tx_hold_() relies on the current state in the DMU (e.g. dn_nlevels) to predict how much will be dirtied by this transaction, but this state can change before we actually perform the transaction (i.e. call dbuf_dirty()). This bug will be fixed by removing the assertion that the tx_space_ accounting is perfectly accurate (i.e. we never dirty more than was predicted by dmu_tx_hold_()). By removing the requirement that this accounting be perfectly accurate, we can also vastly simplify it, e.g. removing most of the logic in dmu_tx_count_(). The new tx space accounting will be very approximate, and may be more or less than what is actually dirtied. It will still be used to determine if this transaction will put us over quota. Transactions that are marked by dmu_tx_mark_netfree() will be excepted from this check. We won’t make an attempt to determine how much space will be freed by the transaction — this was rarely accurate enough to determine if a transaction should be permitted when we are over quota, which is why dmu_tx_mark_netfree() was introduced in 2014. We also won’t attempt to give “credit” when overwriting existing blocks, if those blocks may be freed. This allows us to remove the do_free_accounting logic in dbuf_dirty(), and associated routines. This Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2017-05-24 21:43:34 +00:00
Andriy Gapon	3a9c923927	MFC r316907: 1300 filename normalization doesn't work for removes illumos/illumos-gate@1c17160ac5 `1c17160ac5` https://www.illumos.org/issues/1300 FreeBSD note: recent FreeBSD was not affected by the issue fixed as the name cache is completely bypassed when normalization is enabled. The change is imported for the sake of ZAP infrastructure modifications. Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> MFC after: 3 weeks	2017-05-24 21:29:31 +00:00
Konstantin Belousov	6992112349	Commit the 64-bit inode project. Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify struct dirent layout to add d_off, increase the size of d_fileno to 64-bits, increase the size of d_namlen to 16-bits, and change the required alignment. Increase struct statfs f_mntfromname[] and f_mntonname[] array length MNAMELEN to 1024. ABI breakage is mitigated by providing compatibility using versioned symbols, ingenious use of the existing padding in structures, and by employing other tricks. Unfortunately, not everything can be fixed, especially outside the base system. For instance, third-party APIs which pass struct stat around are broken in backward and forward incompatible ways. Kinfo sysctl MIBs ABI is changed in backward-compatible way, but there is no general mechanism to handle other sysctl MIBS which return structures where the layout has changed. It was considered that the breakage is either in the management interfaces, where we usually allow ABI slip, or is not important. Struct xvnode changed layout, no compat shims are provided. For struct xtty, dev_t tty device member was reduced to uint32_t. It was decided that keeping ABI compat in this case is more useful than reporting 64-bit dev_t, for the sake of pstat. Update note: strictly follow the instructions in UPDATING. Build and install the new kernel with COMPAT_FREEBSD11 option enabled, then reboot, and only then install new world. Credits: The 64-bit inode project, also known as ino64, started life many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick (mckusick) then picked up and updated the patch, and acted as a flag-waver. Feedback, suggestions, and discussions were carried by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles), and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial ports investigation followed by an exp-run by Antoine Brodin (antoine). Essential and all-embracing testing was done by Peter Holm (pho). The heavy lifting of coordinating all these efforts and bringing the project to completion were done by Konstantin Belousov (kib). Sponsored by: The FreeBSD Foundation (emaste, kib) Differential revision: https://reviews.freebsd.org/D10439	2017-05-23 09:29:05 +00:00
Mark Johnston	de3a96e3b1	Ensure that profile and tick probes provide a non-zero PC value. The idle thread may process callouts while reloading the timer in cpu_activeclock(). In this case, provide a representative value, &cpu_idle, instead of 0 for args[0] so that the active thread can be more easily identified from the probe. This addresses intermittent failures of the profile-n/tst.argtest.d test. MFC after: 2 weeks Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D10651	2017-05-15 21:44:40 +00:00
Alan Somers	7ac72c256f	vdev_geom may associate multiple vdevs per g_consumer vdev_geom.c currently uses the g_consumer's private field to point to a vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For example, when you remove a disk, the vdev's status will change to REMOVED. However, vdev_geom will sometimes attach multiple vdevs to the same GEOM consumer. If this happens, then geom events will only be propagated to one of the vdevs. Fix this by storing a linked list of vdevs in g_consumer's private field. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c * g_consumer.private now stores a linked list of vdev pointers associated with the consumer instead of just a single vdev pointer. * Change vdev_geom_set_physpath's signature to more closely match vdev_geom_set_rotation_rate * Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed that we've already accessed the consumer by the time we get here. * Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it in vdev_geom_open, after we know that the open has succeeded. PR: 218634 Reviewed by: gibbs MFC after: 1 week Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D10391	2017-05-11 16:26:56 +00:00
Justin Hibbits	675cad71e7	Fix stack tracing in dtrace for powerpc The current method only sort of works, and usually doesn't work reliably. Also, on Book-E the return address from DEBUG exceptions is not the sentinel addresses, so it won't exit the loop correctly. Fix this by better handling trap frames during unwinding, and using the common trap handler for debug traps, as the code in that segment is identical between the two. MFC after: 1 week	2017-05-11 00:23:51 +00:00
Justin Hibbits	679ea09441	Fix the encoded instruction for FBT traps on powerpc r314370 changed EXC_DTRACE to a different instruction, but neglected to make the same change to fbt, so dtrace didn't actually pick it up, resulting in entering KDB instead of trapping for dtrace. MFC after: 1 week	2017-05-10 03:47:22 +00:00
Justin Hibbits	0440a7f539	Fix check for fbt_excluded() in powerpc fbt_excluded() returns 1 if the symbol is to be excluded. Every other arch has this correct, powerpc was the only broken one MFC after: 1 week	2017-05-10 03:20:20 +00:00
Mark Johnston	23bff6073b	Fix a harmless LOR in dtrace_load(). MFC after: 1 week	2017-05-01 17:01:00 +00:00
Josh Paetzel	ba13ab83f2	Fix misport of compressed ZFS send/recv from 317414 Reported by: Michael Jung <mikej@mikej.com> Reviewed by: avg	2017-05-01 12:56:12 +00:00
Mark Johnston	babf030fd6	Get rid of some ifdef soup in the fasttrap ioctl handler. No functional change intended. MFC after: 1 week	2017-04-28 22:25:22 +00:00
Josh Paetzel	285d85ab04	MFV 316905 7740 fix for 6513 only works in hole punching case, not truncation illumos/illumos-gate@7de35a3ed0 `7de35a3ed0` https://www.illumos.org/issues/7740 The problem is that dbuf_findbp will return ENOENT if the block it's trying to find is beyond the end of the file. If that happens, we assume there is no birth time, and so we lose that information when we write out new blkptrs. We should teach dbuf_findbp to look for things that are beyond the current end, but not beyond the absolute end of the file. To verify, create a large file, truncate it to a short length, and then write beyond the end. Check with zdb to make sure that there are no holes with birth time zero (will appear as gaps). Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-28 02:11:29 +00:00
Josh Paetzel	358f157522	MFV 316900 7743 per-vdev-zaps have no initialize path on upgrade illumos/illumos-gate@555da5111b `555da5111b` https://www.illumos.org/issues/7743 When loading a pool that had been created before the existance of per-vdev zaps, on a system that knows about per-vdev zaps, the per-vdev zaps will not be allocated and initialized. This appears to be because the logic that would have done so, in spa_sync_config_object(), is not reached under normal operation. It is only reached if spa_config_dirty_list is non-empty. The fix is to add another `AVZ_ACTION_` enum that will allow this code to be reached when we detect that we're loading an old pool, even when there are no dirty configs. Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com>	2017-04-27 23:31:38 +00:00
Josh Paetzel	8ad5797208	MFV 316898 7613 ms_freetree[4] is only used in syncing context illumos/illumos-gate@5f14577801 `5f14577801` https://www.illumos.org/issues/7613 metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should replace it with two trees: the freeing tree (ranges that we are freeing this syncing txg) and the freed tree (ranges which have been freed this txg). Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-27 22:00:03 +00:00
Josh Paetzel	e3cb0e99f8	MFV 316897 7586 remove #ifdef __lint hack from dmu.h illumos/illumos-gate@4ba5b96163 `4ba5b96163` https://www.illumos.org/issues/7586 The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if we add other inline functions into header files in ZFS, especially since it is difficult to make any other solution work across all compilation targets. We should switch to disabling the lint flags that are failing instead. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-27 21:11:57 +00:00
Josh Paetzel	011275233c	MFV 316896 7580 ztest failure in dbuf_read_impl illumos/illumos-gate@1a01181fdc `1a01181fdc` https://www.illumos.org/issues/7580 We need to prevent any reader whenever we're about the zero out all the blkptrs. To do this we need to grab the dn_struct_rwlock as writer in dbuf_write_children_ready and free_children just prior to calling bzero. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2017-04-27 16:38:28 +00:00
Josh Paetzel	fa88c78914	MFV 316895 7606 dmu_objset_find_dp() takes a long time while importing pool illumos/illumos-gate@7588687e6b `7588687e6b` https://www.illumos.org/issues/7606 When importing a pool with a large number of filesystems within the same parent filesystem, we see that dmu_objset_find_dp() takes a long time. It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(), and spa_load_verify(). There are several ways to improve performance here: 1. We don't really need to do spa_check_logs() or spa_ld_claim_log_blocks() if the pool was closed cleanly. 2. spa_load_verify() uses dmu_objset_find_dp() to check that no datasets have too long of names. 3. dmu_objset_find_dp() is slow because it's doing zap_value_search() (which is O(N sibling datasets)) to determine the name of each dsl_dir when it's opened. In this case we actually know the name when we are opening it, so we can provide it and avoid the lookup. This change implements fix #3 from the above list; i.e. make dmu_objset_find_dp() provide the name of the dataset so that we don't have to search for it. Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com>	2017-04-27 15:10:45 +00:00
Josh Paetzel	c78abb8b50	MFV 316894 7252 7628 compressed zfs send / receive illumos/illumos-gate@5602294fda `5602294fda` https://www.illumos.org/issues/7252 This feature includes code to allow a system with compressed ARC enabled to send data in its compressed form straight out of the ARC, and receive data in its compressed form directly into the ARC. https://www.illumos.org/issues/7628 We should have longer, more readable versions of the ZFS send / recv options. 7628 create long versions of ZFS send / receive options Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Reviewed by: David Quigley <dpquigl@davequigley.com> Reviewed by: Thomas Caputi <tcaputi@datto.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Dan Kimmel <dan.kimmel@delphix.com>	2017-04-25 17:57:43 +00:00
Josh Paetzel	ef18459108	MFV 316891 7386 zfs get does not work properly with bookmarks illumos/illumos-gate@edb901aab9 `edb901aab9` https://www.illumos.org/issues/7386 The zfs get command does not work with the bookmark parameter while it works properly with both filesystem and snapshot: # zfs get -t all -r creation rpool/test NAME PROPERTY VALUE SOURCE rpool/test creation Fri Sep 16 15:00 2016 - rpool/test@snap creation Fri Sep 16 15:00 2016 - rpool/test#bkmark creation Fri Sep 16 15:00 2016 - # zfs get -t all -r creation rpool/test@snap NAME PROPERTY VALUE SOURCE rpool/test@snap creation Fri Sep 16 15:00 2016 - # zfs get -t all -r creation rpool/test#bkmark cannot open 'rpool/test#bkmark': invalid dataset name # The zfs get command should be modified to work properly with bookmarks too. Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Marcel Telka <marcel@telka.sk>	2017-04-21 19:53:52 +00:00
Josh Paetzel	36064ac2d5	MFV 316871 7490 real checksum errors are silenced when zinject is on illumos/illumos-gate@6cedfc397d `6cedfc397d` https://www.illumos.org/issues/7490 When zinject is on, error codes from zfs_checksum_error() can be overwritten due to an incorrect and overly-complex if condition. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Pavel Zakharov <pavel.zakharov@delphix.com>	2017-04-21 00:24:59 +00:00
Josh Paetzel	9a625bd31c	MFV 316870 7448 ZFS doesn't notice when disk vdevs have no write cache illumos/illumos-gate@295438ba32 `295438ba32` https://www.illumos.org/issues/7448 I built a SmartOS image with all the NVMe commits including 7372 (support NVMe volatile write cache) and repeated my dd testing: > #!/bin/bash > for i in `seq 1 1000`; do > dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync & > dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync & > wait > rm file00 file01 > done > Previously each dd command took ~145 seconds to finish, now it takes ~400 seconds. Eventually I figured out it is 7372 that causes unnecessary nvme_bd_sync() executions which wasted CPU cycles. If a NVMe device doesn't support a write cache, the nvme_bd_sync function will return ENOTSUP to indicate this to upper layers. It seems this returned value is ignored by ZFS, and as such this bug is not really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the nvme driver in general does support cache flush. Instead it will issue an asynchronous flush to nvme and immediately return 0, and hence ZFS will not set vdev_nowritecache here. The nvme driver will at some point process the cache flush command, and if there is no write cache on the device it will return ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This function will not check the error code and not set nowritecache. The right place to check the error code from the cache flush is in zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous cache flushes. This would also be independent of the implementation detail that some drivers can return ENOTSUP immediately. Reviewed by: Dan Fields <dan.fields@nexenta.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com> Obtained from: Illumos	2017-04-21 00:17:54 +00:00
Josh Paetzel	47e222432b	MFV 316868 7430 Backfill metadnode more intelligently illumos/illumos-gate@af346df588 `af346df588` https://www.illumos.org/issues/7430 Description and patch from brought over from the following ZoL commit: https:// github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6 Only attempt to backfill lower metadnode object numbers if at least 4096 objects have been freed since the last rescan, and at most once per transaction group. This avoids a pathology in dmu_object_alloc() that caused O(N^2) behavior for create-heavy workloads and substantially improves object creation rates. As summarized by @mahrens in #4636: "Normally, the object allocator simply checks to see if the next object is available. The slow calls happened when dmu_object_alloc() checks to see if it can backfill lower object numbers. This happens every time we move on to a new L1 indirect block (i.e. every 32 * 128 = 4096 objects). When re-checking lower object numbers, we use the on-disk fill count (blkptr_t:blk_fill) to quickly skip over indirect blocks that don?t have enough free dnodes (defined as an L2 with at least 393,216 of 524,288 dnodes free). Therefore, we may find that a block of dnodes has a low (or zero) fill count, and yet we can?t allocate any of its dnodes, because they've been allocated in memory but not yet written to disk. In this case we have to hold each of the dnodes and then notice that it has been allocated in memory. The end result is that allocating N objects in the same TXG can require CPU usage proportional to N^2." Add a tunable dmu_rescan_dnode_threshold to define the number of objects that must be freed before a rescan is performed. Don't bother to export this as a module option because testing doesn't show a compelling reason to change it. The vast majority of the performance gain comes from limit the rescan to at most once per TXG. Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Ned Bass <bass6@llnl.gov> Obtained from: Illumos	2017-04-21 00:12:47 +00:00
Gleb Smirnoff	83c9dea1ba	- Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter in place. To do per-cpu stats, convert all fields that previously were maintained in the vmmeters that sit in pcpus to counter(9). - Since some vmmeter stats may be touched at very early stages of boot, before we have set up UMA and we can do counter_u64_alloc(), provide an early counter mechanism: o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter. o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter, so that at early stages of boot, before counters are allocated we already point to a counter that can be safely written to. o For sparc64 that required a whole dummy pcpu[MAXCPU] array. Further related changes: - Don't include vmmeter.h into pcpu.h. - vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit, to match kernel representation. - struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion. This is based on benno@'s 4-year old patch: https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html Reviewed by: kib, gallatin, marius, lidl Differential Revision: https://reviews.freebsd.org/D10156	2017-04-17 17:34:47 +00:00
Gleb Smirnoff	9ed01c32e0	All these files need sys/vmmeter.h, but now they got it implicitly included via sys/pcpu.h.	2017-04-17 17:07:00 +00:00
Andriy Gapon	656074ea60	rename vfs.zfs.debug_flags to vfs.zfs.debugflags While the former name is easier to read, the "_flags" suffix has a special meaning for loader(8) and, thus, it was impossible to set the knob via loader.conf(5). The loader interpreted the setting as flags that should be passed to a kernel module named "vfs.zfs.debug". Discussed with: smh MFC after: 2 weeks	2017-04-14 15:35:07 +00:00
Alan Somers	d255847d9e	Fix vdev_geom_attach_by_guids for partitioned disks When opening a vdev whose path is unknown, vdev_geom must find a geom provider with a label whose guids match the desired vdev. However, due to partitioning, it is possible that two non-synonomous providers will share some labels. For example, if the first partition starts at the beginning of the drive, then ada0 and ada0p1 will share the first label. More troubling, if the last partition runs to the end of the drive, then ada0p3 and ada0 will share the last label. If vdev_geom opens ada0 when it should've opened ada0p3, then the pool won't be readable. If it opens ada0 when it should've opened ada0p1, then it will corrupt some other partition when it writes the 3rd and 4th labels. The easiest way to reproduce this problem is to install a mirrored root pool with the default partition layout, then swap the positions of the two boot drives and reboot. Whether the bug manifests depends on the order in which geom lists its providers, which is arbitrary. Fix this situation by modifying the search algorithm to prefer geom providers that have all four labels intact. If no such provider exists, then open whichever provider has the most. Reviewed by: mav MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D10365	2017-04-13 14:51:34 +00:00
Patrick Kelsey	67d955aab4	Corrected misspelled versions of rendezvous. The MFC will include a compat definition of smp_no_rendevous_barrier() that calls smp_no_rendezvous_barrier(). Reviewed by: gnn, kib MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D10313	2017-04-09 02:00:03 +00:00
Steven Hartland	3e856909b7	Fix expandsz 16.0E vals and vdev_min_asize of RAIDZ children When a member of a RAIDZ has been replaced with a device smaller than the original, then the top level vdev can report its expand size as 16.0E. The reduced child asize causes the RAIDZ to have a vdev_asize lower than its vdev_max_asize which then results in an underflow during the calculation of the parents expand size. Fix this by updating the vdev_asize if it shrinks, which is already protected by a check against vdev_min_asize so should always be safe. Also for RAIDZ vdevs, ensure that the sum of their child vdev_min_asize is always greater than the parents vdev_min_size. Fixes: https://www.illumos.org/issues/7885 MFC after: 2 weeks Sponsored by: Multiplay	2017-04-03 13:11:28 +00:00
Josh Paetzel	e106234416	MFV: 315989 7603 xuio_stat_wbuf_* should be declared (void) illumos/illumos-gate@99aa8b5505 `99aa8b5505` https://www.illumos.org/issues/7603 The funcs are declared k&r style, where the args are not specified: void xuio_stat_wbuf_copied(); They should be declared to take no arguments: void xuio_stat_wbuf_copied(void); Need to change both .c and .h. Author: Prashanth Sreenivasa <pks@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2017-03-27 17:27:46 +00:00
Alexander Motin	3aef5b286a	MFV r315290, r315291: 7303 dynamic metaslab selection illumos/illumos-gate@8363e80ae7 https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18 https://www.illumos.org/issues/7303 This change introduces a new weighting algorithm to improve metaslab selection. The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result, the metaslab weight now encodes the type of weighting algorithm used (size-based vs segment-based). This also introduce a new allocation tracing facility and two new dcmds to help debug allocation problems. Each zio now contains a zio_alloc_list_t structure that is populated as the zio goes through the allocations stage. Here's an example of how to use the tracing facility: > c5ec000::print zio_t io_alloc_list \| ::walk list \| ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 NOT_ALLOCATABLE ztest.0a - 0 400 0 ENOSPC ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 NOT_ALLOCATABLE ztest.0a - 0 200 0 ENOSPC ztest.0a 1 0 400 1 x 8M 17b1a00 ztest.0a > 1ff2400::print zio_t io_alloc_list \| ::walk list \| ::metaslab_trace MSID DVA ASIZE WEIGHT RESULT VDEV - 0 200 0 NOT_ALLOCATABLE mirror-2 - 0 200 0 NOT_ALLOCATABLE mirror-0 1 0 200 1 x 4M `112ae00` mirror-1 - 1 200 0 NOT_ALLOCATABLE mirror-2 - 1 200 0 NOT_ALLOCATABLE mirror-0 1 1 200 1 x 4M 112b000 mirror-1 - 2 200 0 NOT_ALLOCATABLE mirror-2 If the metaslab is using segment-based weighting then the WEIGHT column will display the number of segments available in the bucket where the allocation attempt was made. Author: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Chris Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2017-03-24 09:37:00 +00:00
Andriy Gapon	16b46572fa	zfs_putpages: use TXG_WAIT Explicit looping using TXG_NOWAIT is more verbose and may harm performance under heavy load because of multiple waits. MFC after: 1 week	2017-03-23 09:13:21 +00:00
Andriy Gapon	3d775e193e	zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregate This way we can avoid blocking the whole queue in the low memory situations. It's better to sacrifice some I/O performance by not doing the aggregation than to add an indefinite wait for more memory. Reviewed by: smh MFC after: 2 weeks Sponsored by: Panzura Differential Revision: https://reviews.freebsd.org/D9999	2017-03-23 08:59:17 +00:00
Steven Hartland	c76da62acf	Reduce ARC fragmentation threshold As ZFS can request up to SPA_MAXBLOCKSIZE memory block e.g. during zfs recv, update the threshold at which we start agressive reclamation to use SPA_MAXBLOCKSIZE (16M) instead of the lower zfs_max_recordsize which defaults to 1M. PR: 194513 Reviewed by: avg, mav MFC after: 1 month Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D10012	2017-03-17 12:34:57 +00:00
Mark Johnston	9fc47d244c	Fix a backwards comparison in the code to dump a DTrace debug buffer. PR: 217739 MFC after: 1 week	2017-03-13 18:43:00 +00:00
Andriy Gapon	520758a51d	zfs: provide a special vptocnp method for the .zfs vnode vop_stdvptocnp() doesn't work properly if .zfs directory is hidden. Reported by: swills, des Tested by: des MFC after: 1 week MFC with: r314048	2017-03-11 16:00:49 +00:00
Andriy Gapon	1a3c849840	MFV r314911: 7867 ARC space accounting leak illumos/illumos-gate@6de76ce2a9 `6de76ce2a9` https://www.illumos.org/issues/7867 It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we would fail to update the ARC space statistics. In the normal case those statistics are updated in arc_free_data_buf(). But in the arc_hdr_free_on_write() path we don't do that. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 10 days	2017-03-08 13:52:45 +00:00
Andriy Gapon	7e4b3a6fa2	MFV r314910: 7843 get_clones_stat() is suboptimal for lots of clones illumos/illumos-gate@c5bde7273e `c5bde7273e` https://www.illumos.org/issues/7843 get_clones_stat() could be very slow if a snapshot has many (thousands) clones. Clone names are added to an nvlist that's created with NV_UNIQUE_NAME. So, each time a new name is appended to the list, the whole list is searched linearly to see if that name is not already in the list. That results in the quadratic complexity. That should be easy to fix as we know in advance that we should not get any duplicate names, so we can drop NV_UNIQUE_NAME when creating the list. Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andriy Gapon <avg@FreeBSD.org> MFC after: 1 week Sponsored by: ClusterHQ	2017-03-08 13:48:26 +00:00
Martin Matuska	0132c9cd4a	Fix null pointer dereference in zfs_freebsd_setacl(). Prevents unprivileged users from panicking the kernel by calling __acl_delete_*() on files or directories inside a ZFS mount. MFC after: 3 days	2017-03-02 23:23:28 +00:00
Alexander Motin	6d1ccf40cc	Execute last ZIO of log commit synchronously. For short transactions overhead of context switch can be too large. Skipping it gives significant latency reduction. For large ones, including multiple ZIOs, latency is less critical, while throughput there may become limited by checksumming speed of single CPU core. To get best of both cases, execute last ZIO directly from calling thread context to save latency, while all others (if there are any) enqueue to taskqueues in traditional way. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2017-03-02 07:55:47 +00:00
Alexander Motin	e93f9c7708	Completely skip cache flushing for not supporting log devices. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2017-03-02 07:50:06 +00:00
Andrey V. Elsukov	19b60f70c0	Do not invoke the resize event when previous provider's size was zero. This is similar to r303637 fix for geom_disk. Reported by: avg Tested by: avg MFC after: 1 week	2017-03-01 18:03:32 +00:00
Josh Paetzel	b98d22744f	MFV 314276 7570 tunable to allow zvol SCSI unmap to return on commit of txn to ZIL illumos/illumos-gate@1c9272b861 `1c9272b861` https://www.illumos.org/issues/7570 Based on the discovery that every unmap waits for the commit of the txn to the ZIL, introducing a very high latency to unmap commands, this behavior was made into a tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is that by default SCSI unmap commands will result in space being freed within the zvol (today they are ignored and returned with good status). However, unlike the code today, instead of 18+ms per unmap, they take about 30us. With the testing done on NTFS against a Win2k12 target, the new behavior should work seamlessly. Files on the zvol that have already been set with the zfree application will continue to write 0's when deleted, and any new files created since zvol creation will send unmap commands when deleted. This behavior exists today, but with this change the unmap commands will be processed and result in reclaim of space. Author: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com>	2017-02-25 20:01:17 +00:00
Andriy Gapon	9211bb327f	l2arc: try to fix write size calculation broken by Compressed ARC commit While there, make a change to not evict a first buffer outside the requested eviciton range. To do: - give more consistent names to the size variables - upstream to OpenZFS PR: 216178 Reported by: lev Tested by: lev MFC after: 2 weeks	2017-02-25 17:03:48 +00:00
Andriy Gapon	1e1065b60f	zfs: call spa_deadman on a taskqueue thread callout(9) prohibits callout functions from sleeping. illumos mutexes are emulated using sx(9). spa_deadman() calls vdev_deadman() and the latter acquires vq_lock. As a result we can get a more confusing panic instead of a specific panic or no panic: sleepq_add: td 0xfffff80019669960 to sleep on wchan 0xfffff8001cff4d88 with sleeping prohibited This change adds another level of indirection where the deadman callout schedules spa_deadman() to be executed on taskqueue_thread. While there, use callout_schedule(0 instead of callout_reset() in spa_sync(). Discussed with: mav MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D9762	2017-02-25 16:45:53 +00:00
Josh Paetzel	029c0bfdbd	MFV 314243 6676 Race between unique_insert() and unique_remove() causes ZFS fsid change illumos/illumos-gate@40510e8eba `40510e8eba` https://www.illumos.org/issues/6676 The fsid of zfs filesystems might change after reboot or remount. The problem seems to be caused by a race between unique_insert() and unique_remove(). The unique_remove() is called from dsl_dataset_evict() which is now an asynchronous thread. In a case the dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end up with changed fsid on zfs mount. This problem is very likely caused by #5056. Steps to Reproduce Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore machines it is not so easy to reproduce. # uname -a SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris # zfs create rpool/TEST # FS=$(echo ::fsinfo \| mdb -k \| grep TEST \| awk '{print $1}') # echo $FS::print vfs_t vfs_fsid \| mdb -k vfs_fsid = { vfs_fsid.val = [ 0x54d7028a, 0x70311508 ] } # zfs umount rpool/TEST # zfs mount rpool/TEST # FS=$(echo ::fsinfo \| mdb -k \| grep TEST \| awk '{print $1}') # echo $FS::print vfs_t vfs_fsid \| mdb -k vfs_fsid = { vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ] } # Impact The persistent fsid (filesystem id) is essential for proper NFS functionality. If the fsid of a filesystem changes on remount (or after reboot) the NFS clients might not be able to automatically recover from such event and the manual remount of the NFS filesystems on every NFS client might be needed. Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Dan Vatca <dan.vatca@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com>	2017-02-25 14:45:54 +00:00
Andriy Gapon	7fa27112f3	zfs: clean up unused files and definitions MFC after: 1 month X-MFC after: r314048	2017-02-24 07:53:56 +00:00
Toomas Soome	84a6eddc43	loader: update symlink support in zfs reader As the current zfs file system is providing symlink via system attributes, need to update the code accordingly. Note, as the zfsboot code does not free the memory at this time, the object list will put some stress on the boot2 heap, eventually we should address the issue. Reviewed by: allanjude, smh Approved by: allanjude (mentor) Differential Revision: https://reviews.freebsd.org/D9706	2017-02-22 22:00:50 +00:00
Andriy Gapon	b93763e55d	zfs: move zio_taskq_basedc under SYSDC That knob is useless without SDC (or alike) scheduling class support. That is, it's unused on FreeBSD. MFC after: 4 days	2017-02-21 21:11:58 +00:00
Andriy Gapon	2b1bedaf06	zfs: lower priority of zio_write_issue threads by four The difference of one was insignificant because zio_write_issue threads ended up on the same run queues as other zio threads. See sys/priority.h and sys/runq.h for more details. Add a comment describing FreeBSD priority considerations and restore the illumos variant of the code for comparison. Obtained from: Panzura MFC after: 2 weeks Sponsored by: Panzura	2017-02-21 21:09:21 +00:00
Andriy Gapon	47c8e3d912	reimplement zfsctl (.zfs) support The current code is written on top of GFS, a library with the generic support for writing filesystems, which was ported from illumos. Because of significant differences between illumos VFS and FreeBSD VFS models, both the GFS and zfsctl code were heavily modified to work on FreeBSD. Nonetheless, they still contain quite a few ugly hacks and bugs. This is a reimplementation of the zfsctl code where the VFS-specific bits are written from scratch and only the code that interacts with the rest of ZFS is reused. Some highlights. We use two types of nodes, static and on-demand. The static nodes are used for permanent directories like .zfs, .zfs/snapshot, etc. The on-demand nodes are used for ephemeral directories that act as snapshot mount points. Initially only static nodes are created. Their vnodes are instantiated when they are looked up. The on-demand nodes and vnodes are instantiated as needed and the nodes are destroyed as soon as the corresponding vnodes are reclaimed. We also try very hard to ensure that uncovered snapshot vnodes do not linger. They are supposed to become inactive as soon as they are uncovered and we try to recycle them immediately. When a filesystem is unmounted all snapshots under .zfs are unmounted first, then all vnodes are flushed and finally the static .zfs nodes are destroyed. There are some changes outside of zfsctl code too. z_ctldir is never used directly (as it is an opaque pointer), zfsctl_root() has to be used instead. The function returns a locked vnode now, so it accepts a lock flags parameter. The function can also fail now, e.g. during force unmounting, whereas previously it was infallible. zfsctl_root_lookup() is retired, instead of it VOP_LOOKUP() on the .zfs vnode (obtained with zfsctl_root) is used. Some ideas are picked from an independent work by will. Reviewed by: asomers, smh MFC after: 1 month Relnotes: maybe Differential Revision: https://reviews.freebsd.org/D7421	2017-02-21 17:47:08 +00:00
Josh Paetzel	aedc925301	MVF: 313876 7504 kmem_reap hangs spa_sync and administrative tasks illumos/illumos-gate@405a5a0f5c https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80 https://www.illumos.org/issues/7504 We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(), which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv. Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds. Author: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com>	2017-02-17 17:52:12 +00:00
Mark Johnston	7174af791e	Directly include needed headers rather than relying on pollution. We get machine/cpu.h via kmem.h -> proc.h -> _vm_domain.h -> seq.h. Reported by: Ryan Libby Sponsored by: Dell EMC Isilon X-MFC with: r313841	2017-02-17 03:27:20 +00:00
Mark Johnston	a11ac730a7	Prevent CPU migration when checking the DTrace nofault flag on x86. dtrace_trap() consumes page and protection faults triggered by code running in DTrace probe context. Such faults occur with interrupts disabled and are detected using a per-CPU flag. Regular faults cause dtrace_trap() to be called with interrupts enabled, and nothing was ensuring that the flag was read from the correct CPU. This may result in dtrace_trap() consuming unrelated page and protection faults when DTrace is enabled, causing the fault handler to return without actually having handled the fault. Diagnosed by: Ryan Libby <rlibby@gmail.com> MFC after: 3 days Sponsored by: Dell EMC Isilon	2017-02-16 23:05:20 +00:00
Josh Paetzel	c53cc7187c	MFV 313786 7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid illumos/illumos-gate@653af1b809 `653af1b809` https://www.illumos.org/issues/7500 With the integration of: commit 0f6d88aded0d165f5954688a9b13bac76c38da84 Author: Alex Reece <alex@delphix.com> Date: Sat Jul 26 13:40:04 2014 -0800 4873 zvol unmap calls can take a very long time for larger datasets the dnode's dn_bufs field was changed from a list to a tree. As a result, the dn_unlisted_l0_blkid field is no longer necessary. Author: Stephen Blinick <stephen.blinick@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com>	2017-02-16 19:00:09 +00:00
Mark Johnston	28180eff9e	Use pget() instead of pfind() in fasttrap_pid_{enable,disable}(). Suggested by: mjg MFC after: 1 week	2017-02-15 06:07:01 +00:00
Mark Johnston	2fce30fa8f	Check for an exiting process when enabling PID provider probes. MFC after: 1 week	2017-02-15 01:35:26 +00:00
Andriy Gapon	909bacfc59	remove l2_padding_needed statistic from zfs arc It became obsolete when the Compressed ARC support was committed. MFC after: 1 week	2017-02-12 19:45:30 +00:00
Andriy Gapon	e776c4054f	check remaining space in zfs implementations of vptocnp PR: 216939 Submitted by: Iouri V. Ivliev <fbsd@any.com.ru> MFC after: 1 week	2017-02-12 19:40:59 +00:00
Alan Somers	6159fb2f9c	Fix setting birthtime in ZFS sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c * In zfs_freebsd_setattr, if the caller wants to set the birthtime, set the bits that zfs_settattr expects * In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime, expected by zfs_xvattr_set. The two levels of indirection seem excessive, but it minimizes diffs vs OpenZFS. * In zfs_setattr, check for overflow of va_birthtime (from delphij) * Remove red herring in zfs_getattr sys/cddl/contrib/opensolaris/uts/common/sys/vnode.h * Un-booby-trap some macros New tests are under review at https://github.com/pjd/pjdfstest/pull/6 Reviewed by: avg MFC after: 3 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D9353	2017-02-09 21:30:53 +00:00
George V. Neville-Neil	c499408f8b	Fix the ifdef protection and remove superfluous extern statements Reported by: Konstantin Belousov MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-02-07 01:21:18 +00:00
Mark Johnston	9613442e83	Ensure that the DOF string length is divisible by 2. It is an ASCII encoding of a hexadecimal representation of the DOF file used to enable anonymous tracing, so its length should always be even. MFC after: 1 week	2017-02-05 02:47:34 +00:00
Mark Johnston	e801af6fba	Use PC-relative relocations for USDT probe sites on i386 and amd64. When recording probe site addresses in the output DOF file, dtrace -G needs to emit relocations for the .SUNW_dof section in order to obtain the addresses of functions containing probe sites. DTrace expects the addresses to be relative to the base address of the final ELF file, and the amd64 USDT implementation was relying on some unspecified and incorrect behaviour in the base system GNU ld to achieve this. This change reimplements the probe site relocation handling to allow USDT to be used with lld and newer GNU binutils. Specifically, it makes use of R_X86_64_PC64/R_386_PC32 relocations to obtain the probe site address relative to the DOF file address, and adds and uses a new DOF relocation type which computes the final probe site address using these relative offsets. Reported by and discussed with: Rafael Espíndola MFC after: 1 month Differential Revision: https://reviews.freebsd.org/D9374	2017-02-05 02:39:12 +00:00
George V. Neville-Neil	c613d0c2ba	Files which implement the new random number system code for DTrace Submitted by: Graeme Jenkinson MFC after: 2 weeks Sponsored by: DARPA, AFRL	2017-02-03 22:40:13 +00:00
George V. Neville-Neil	00bb01a40c	Replace the implementation of DTrace's RAND subroutine for generating low-quality random numbers with a modern implementation (xoroshiro128+) that is capable of generating better quality randomness without compromising performance. Submitted by: Graeme Jenkinson Reviewed by: markj MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D9051	2017-02-03 22:26:19 +00:00
Mark Johnston	b3b5bfeb22	Sync the x86 dis_tables.c with upstream. This corresponds to the following illumos issues: 5755 want support for Intel FMA instrs 5756 want support for Intel BMI1 instrs 5757 want support for Intel BMI2 instrs 5758 want support for Intel AVX2 instrs 7204 Want broadwell rdseed and adx support 7208 Want stac/clac disasm support 7733 Need SHA Instruction dis support 7756 dis can't handle x86 SSE 3 instructions 7757 want avx2 disasm tests 7758 want SSE 4.1 disasm tests MFC after: 2 weeks	2017-02-03 03:22:47 +00:00
Baptiste Daroussin	b4b4b5304b	Revert crap accidentally committed	2017-01-28 16:31:23 +00:00
Baptiste Daroussin	814aaaa7da	Revert r312923 a better approach will be taken later	2017-01-28 16:30:14 +00:00
Mark Johnston	da5320b9d0	Fix an off-by-one in an assertion on fasttrap tracepoint sizes. FASTTRAP_MAX_INSTR_SIZE is the largest valid value of a tracepoint, so correct the assertion accordingly. This limit was hit with a 15-byte NOP. Reported by: bdrewery MFC after: 1 week Sponsored by: Dell EMC Isilon	2017-01-27 17:58:41 +00:00
Mark Johnston	61ef24a5a3	Fix initialization of "p" after r312658. CID: 1369410	2017-01-25 16:35:57 +00:00
Mark Johnston	792e2f09ee	Remove the DTRACEHIOC_ADD ioctl. This ioctl has been considered legacy by upstream since the DTrace code was first imported, and is unused. The removal also allows some simplification of dtrace_helper_slurp(). Also remove a bogus copyout in the DTRACEHIOC_ADDDOF handler. Due to a bug, it would overwrite an in-memory copy of the DOF header rather than the passed-in DOF helper. Moreover, DTRACEHIOC_ADDDOF already copies the helper back out automatically since its argument has the IOC_OUT attribute.	2017-01-23 02:21:06 +00:00
Josh Paetzel	f2be81e92c	MFV 312436 6569 large file delete can starve out write ops illumos/illumos-gate@ff5177ee8b `ff5177ee8b` https://www.illumos.org/issues/6569 The core issue I've found is that there is no throttle for how many deletes get assigned to one TXG. As a results when deleting large files we end up filling consecutive TXGs with deletes/frees, then write throttling other (more important) ops. There is an easy test case for this problem. Try deleting several large files (at least 1/2 TB) while you do write ops on the same pool. What we've seen is performance of these write ops (let's call it sideload I/O) would drop to zero. More specifically the problem is that dmu_free_long_range_impl() can/will fill up all of the dirty data in the pool "instantly", before many of the sideload ops can get in. So sideload performance will be impacted until all the files are freed. The solution we have tested at Nexenta (with positive results) creates a relatively simple throttle for how many "free" ops we let into one TXG. However this solution exposes other problems that should also be addressed. If we are to slow down freeing of data that means one has to wait even longer (assuming vnode ref count of 1) to get shell back after an rm or for NFS thread to finish the free-ing op. To avoid this the proposed solution is to call zfs_inactive() async for "large" files. Async freeing then begs for the reclaimed space to be accounted for in the zpool's "freeing" prop. The other issue with having a longer delete is the inability to export/unmount for a longer period of time. The proposed solution is to interrupt freeing of blocks when a fs is unmounted. Author: Alek Pinchuk <alek@nexenta.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Reviewed by: avg Differential Revision: D9008	2017-01-20 15:01:04 +00:00
Andrew Turner	ae69172343	Use the kernel stack in the ARM FBT DTrace provider. This is used to find the fifth argument to functions being traced, however there was an error where the userspace stack was being used. This may be invalid leading to a kernel panic if this address is unmapped. Submitted by: Graeme Jenkinson <graeme.jenkinson@cl.cam.ac.uk> MFC after: 1 week Differential Revision: https://reviews.freebsd.org/D9229	2017-01-18 13:27:24 +00:00
Mark Johnston	d01e6ad41b	Have DTrace handle faults when dereferencing a lock object pointer. MFC after: 1 week	2017-01-11 01:18:06 +00:00
Mark Johnston	4153c9b932	Ignore LC_SLEEPABLE when testing whether a mutex is adaptive. MFC after: 1 week	2017-01-11 01:15:55 +00:00
Mateusz Guzik	619ce4d72e	Revert r309619 "ifndef atomic_cas_* in cddl code" It was a temporary change to ease an import of native atomic_cas primitives. Instead, atomic_fcmpset was devised with different semantics. See r311168.	2017-01-03 21:02:30 +00:00
Mark Johnston	91371de1fa	Remove the "unused" DIF subroutine index left after r308582. These indices are input to a build-time script that generates code to validate subroutine names.	2017-01-03 00:24:12 +00:00
Mark Johnston	c71c814a97	Remove an obsolete pragma from dtrace.h. It triggers a compiler warning and has been removed upstream. MFC after: 1 week	2016-12-27 23:31:32 +00:00
George V. Neville-Neil	805e1842c8	Remove extra DOF_SEC_XLIMPORT from the DOF_SEC_ISLOADABLE macro MFC after: 2 weeks Sponsored by: DARPA, AFRL	2016-12-16 20:44:14 +00:00
Alexander Motin	c5f74c4873	Revert r310023 for now. After another look my new variable mapping was not exactly right.	2016-12-15 08:03:16 +00:00
Alexander Motin	d686b07132	Reduce diff from Illumos by better variables mapping.	2016-12-13 16:20:10 +00:00
Alexander Motin	2823b6467a	Postpone ZVOL media/block size caching till first open. At least on FreeBSD there are no legal way to access media or get its size without opening device/provider first. Postponing this caching allows to skip several disk seeks per ZVOL/snapshot during import. For HDD pool with 1 ZVOL in dev mode with 1000 snapshots this reduces pool import time from 40 to 10 seconds. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2016-12-11 19:50:39 +00:00
Alexander Motin	2fb5d72d58	Add missed vfs.zfs.zfetch.max_idistance sysctl.	2016-12-10 21:19:27 +00:00
Mark Johnston	f99a517272	Don't create FBT probes for lock owner methods. These functions may be called in DTrace probe context, so they cannot be safely traced. Moreover, they are currently only used by DTrace, so their corresponding FBT probes are not particularly useful. MFC after: 2 weeks	2016-12-10 03:13:11 +00:00
Mark Johnston	8bb9b7f17a	Consistently use fbt_excluded() on all architectures. MFC after: 2 weeks	2016-12-10 03:11:05 +00:00
Alexander Motin	9373759d13	Fix spa_alloc_tree sorting by offset in r305331. Original commit "7090 zfs should improve allocation order" declares alloc queue sorted by time and offset. But in practice io_offset is always zero, so sorting happened only by time, while order of writes with equal time was completely random. On Illumos this did not affected much thanks to using high resolution timestamps. On FreeBSD due to using much faster but low resolution timestamps it caused bad data placement on disks, affecting further read performance. This change switches zio_timestamp_compare() from comparing uninitialized io_offset to really populated io_bookmark values. I haven't decided yet what to do with timestampts, but on simple tests this change gives the same peformance results by just making code to work as declared. MFC after: 1 week	2016-12-08 15:58:03 +00:00
George V. Neville-Neil	af463464cf	Fix a kernel panic in DTrace's rw_iswriter subroutine. On FreeBSD the sense of rw_write_held() and rw_iswriter() were reversed, probably due to a cut and paste error. Using rw_iswriter() would cause the kernel to panic. Reviewed by: markj MFC after: 2 weeks Sponsored by: DARPA, AFRL Differential Revision: https://reviews.freebsd.org/D8718	2016-12-07 07:27:47 +00:00
Mateusz Guzik	ef32958e5d	ifndef atomic_cas_* in cddl code in preparation for native implementations This is a temporary change to not require all architectures to import at once. Discussed with: jhb	2016-12-06 14:08:49 +00:00
Andriy Gapon	0451d4e97b	MFV r309249: 3821 Race in rollback, zil close, and zil flush Note: there was a merge conflict resolved by me. illumos/illumos-gate@43297f973a `43297f973a` https://www.illumos.org/issues/3821 We recently had nodes with some of the latest zfs bits panic on us in a rollback-heavy environment. The following is from my preliminary analysis: Let's look at where we died: > $C ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1) ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1) ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1) ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1) ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680) ffffff01ea6b9c30 thread_start+8() If we dig in we can find that this dataset corresponds to a zone: > ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233- 9e378e89877b" ] Okay so we have a null taskq pointer. That only happens during the calls to zil_open and zil_close. If we poke around we can see that we're actually in midst of a rollback: > ::pgrep zfs \| ::printf "0x%x %s\\n" proc_t . p_user.u_psargs 0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d- 4b8ab06213c2@marlin_init 0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233- 9e378e89877b@marlin_init 0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc- 4ac4ba8fe9f8@marlin_init 0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e- 10cd45ff8e9f@marlin_init 0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c- 6d1b2751e6f5@marlin_init 0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24- 2ed45cd0da70@marlin_init Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com> MFC after: 3 weeks	2016-11-28 15:14:31 +00:00
Andriy Gapon	69bac03666	MFV r308990: 7181 race between zfs_mount and zfs_ioc_rollback illumos/illumos-gate@90f2c094b3 `90f2c094b3` https://www.illumos.org/issues/7181 zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths. dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup() before the setup is actually completed, thus an under-constructed zfsvfs becomes visible. Additionally, there is nothing to serialize the two call paths. As a result two threads can step on each other's toes. assertion failed: zilog->zl_clean_taskq == NULL, file: ../../common/fs/zfs/zil.c, line: 1772 > $c vpanic() 0xfffffffffbdf6928() zil_open+0x45(ffffff1bbc5dd000, fffffffff7993880) zfsvfs_setup+0x84(ffffffb378d77000, 0) zfs_resume_fs+0x132(ffffffb378d77000, ffffffb37ddcf000) zfs_ioc_rollback+0x96(ffffffb37ddcf000, ffffff01dcdc4cd0, ffffff01aa091000) zfsdev_ioctl+0x215(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) cdev_ioctl+0x39(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) spec_ioctl+0x60(ffffff0197737700, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) fop_ioctl+0x55(ffffff0197737700, 5a19, 80465f8, 100003, ffffff01ab318368, ffffff0004b59e58) ioctl+0x9b(7, 5a19, 80465f8) sys_syscall32+0x1f7() > ffffff1bbc5dd000::print objset_t os_zil os_zil = 0xffffff1c053cf7c0 > 0xffffff1c053cf7c0::print zilog_t zl_clean_taskq Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Andriy Gapon <andriy.gapon@clusterhq.com> MFC after: 2 weeks	2016-11-24 10:34:42 +00:00
Andriy Gapon	b55ae64b50	MFV r308988: 7199, 7200 dsl_dataset_rollback_sync may try to free already free blocks 7199 dsl_dataset_rollback_sync may try to free already free blocks 7200 no blocks must be born in a txg after a snaphot is created illumos/illumos-gate@bfaed0b91e `bfaed0b91e` https://www.illumos.org/issues/7199 dsl_dataset_rollback_sync may try to free already freed blocks when it calls dsl_destroy_head_sync_impl to destroy a temporary clone. That happens if a snapshot to which we are rolling back and from which the clone is created has some ZIL records. https://www.illumos.org/issues/7200 No new blocks must be born in a dataset in the same TXG after a snapshot of the dataset is taken. Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg and as such they would be presumed to belong o the snapshot while in fact they do not. All the datasets must be clean before sync tasks are run, so the described scenario may happen only if one of the sync tasks dirties the dataset and another sync task takes its snapshot. Then, there will be another sync pass because of the dirty data and the new blocks will be born in the same TXG when the data is written out. It seems that almost all of the existing sync tasks modify only MOS and do not dirty any objsets. The only exception that I've been able to identify so far is the rollback which can modify an objset when it zeroes out the objset's ZIL. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Brad Lewis <brad.lewis@delphix.com> Approved by: Gordon Ross <gordon.w.ross@gmail.com> Author: Andriy Gapon <andriy.gapon@clusterhq.com> MFC after: 3 weeks	2016-11-24 10:29:21 +00:00
Andriy Gapon	239c22b73d	MFV r308987: 7180 potential race between zfs_suspend_fs+zfs_resume_fs and zfs_ioc_rename illumos/illumos-gate@690041b9ca `690041b9ca` https://www.illumos.org/issues/7180 If a filesystem is not unmounted while the rename is being performed, then, for example, a concurrect zfs rollback may call zfs_suspend_fs followed by zfs_resume_fs on the same filesystem. The latter takes the filesystem's name as an argument. If the filesystem name changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os) call in zfs_resume_fs would fail resulting in a kernel panic. So far I have been able to reproduce this problem on FreeBSD where zfs rename has -u option that skips the unmounting before doing the renaming. But I think that in theory the same problem can occur on illumos as well, because the unmounting is done in userland before invoking the rename ioctl and there could be a race with, e.g., zfs mount. panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2 == 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/ zfs/zfs_vfsops.c, line: 2210 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710 vpanic() at vpanic+0x182/frame 0xfffffe004df30790 panic() at panic+0x43/frame 0xfffffe004df307f0 assfail3() at assfail3+0x2c/frame 0xfffffe004df30810 zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860 zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0 zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940 devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0 kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00 sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0 amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0 Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0 Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> MFC after: 2 weeks	2016-11-24 10:21:22 +00:00
Andriy Gapon	d15b9428bb	further fix zfs_lock() diagnostics It was very wrong to look at the vnode and znode internals without having locked the vnode first. Reported by: pho Tested by: pho MFC after: 1 week X-MFC with: r308887	2016-11-24 09:00:51 +00:00
George V. Neville-Neil	cdaa8777f7	Add tunable to disable destructive dtrace Submitted by: Joerg Pernfuss <code.jpe@gmail.com> Reviewed by: rstone, markj MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8624	2016-11-23 22:50:20 +00:00
Alan Cox	bba39b9ae3	Remove PG_CACHED-related fields from struct vmmeter, because they are no longer used. More precisely, they are always zero because the code that decremented and incremented them no longer exists. Bump __FreeBSD_version to mark this change. Reviewed by: kib, markj Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8583	2016-11-22 18:13:46 +00:00
Andriy Gapon	17055fcda7	fix unsafe modification of zfs_vnodeops when DIAGNOSTIC is enabled The idea was to avoid a false assertion in zfs_lock, but it was implemented very dangerously and incorrectly. Reported by: pho Tested by: pho MFC after: 1 week	2016-11-20 14:00:50 +00:00
Andriy Gapon	2ec31e84cc	zfs: fix up after the removal of PG_CACHED pages in r308691 PR: 214629 Reported by: mshirk@daemon-security.com Reviewed by: alc Tested by: Shawn Webb <shawn.webb@hardenedbsd.org> X-MFC with: 308691	2016-11-19 08:12:57 +00:00
Mark Johnston	188011dbf2	Support fetching RFLAGS in fasttrap_getreg(). MFC after: 1 week	2016-11-18 03:11:11 +00:00
Alexander Motin	14b5719f6a	After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. While there, untangle and unify code of z_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc.	2016-11-17 21:01:27 +00:00
Alexander Motin	eb9bfc257d	Revert r307392: I've found a way to avoid big allocations completely.	2016-11-17 20:44:51 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Mark Johnston	375c8b20dc	Remove the DTrace printt and typeref actions. These are FreeBSD-specific and were added in r178576 to provide the ability to pretty-print instances of compound types. However, the print action has long since been augmented to provide this functionality with a simpler interface. Discussed with: gnn Differential Revision: https://reviews.freebsd.org/D8478	2016-11-12 19:26:12 +00:00
Bryan Drewery	28323add09	Fix improper use of "its". Sponsored by: Dell EMC Isilon	2016-11-08 23:59:41 +00:00
Oleksandr Tymoshenko	d30e308465	Fix include order as required post r308415	2016-11-07 20:02:18 +00:00
Alexander Motin	8acf168aab	Fix ZIL records ordering when ZVOL opened both with and without FSYNC. Before this an earlier writes to a ZVOL opened without FSYNC could get to ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt), marking all log records sync if anybody opened the ZVOL with FSYNC. MFC after: 2 weeks	2016-11-01 16:03:31 +00:00
Alexander Motin	2d1d8f4c8f	Pass to zvol_log_truncate() same sync values as to zvol_log_write(). Surplus marking of TX_TRUNCATE records as sync could result in putting them into ZIL before previous writes if ones were async. MFC after: 2 weeks	2016-11-01 12:47:19 +00:00
Alexander Motin	74a148f46f	Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz.	2016-10-29 23:25:12 +00:00
Andriy Gapon	97371ba2a9	zfsbootcfg: a simple tool to set next boot (one time) options for zfsboot (gpt)zfsboot will read one-time boot directives from a special ZFS pool area. The area was previously described as "Boot Block Header", but currently it is know as Pad2, marked as reserved and is zeroed out on pool creation. The new code interprets data in this area, if any, using the same format as boot.config. The area is immediately wiped out. Failure to parse the directives results in a reboot right after the cleanup. Otherwise the boot sequence proceeds as usual. zfsbootcfg writes zfsboot arguments specified on its command line to the Pad2 area of a disk identified by vfs.zfs.boot.primary_pool and vfs.zfs.boot.primary_vdev kenv variables that are set by loader during boot. Please see the manual page for more. Thanks to all who reviewed, contributed and made suggestions! There are many potential improvements to the feature, please see the review for details. Reviewed by: wblock (docs) Discussed with: jhb, tsoome MFC after: 3 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D7612	2016-10-29 14:09:32 +00:00
Alexander Motin	471cf6ce7d	Add vdev_reopening support to vdev_geom. It allows to avoid extra GEOM providers flapping without significant need. Since GEOM got resize support, we don't need to reopen provider to get new size. If provider was orphaned and no longer valid, ZFS should already know that, and in such case reopen should be done in full as expected. MFC after: 2 weeks	2016-10-28 17:05:14 +00:00
Alexander Motin	f106f43aa2	Matching GUIDs, handle possible race on vdev detach. In case of vdev detach, causing top level mirror vdev destruction, leaf vdev changes its GUID to one of the destroyed mirror, that creates race condition when GUID in vdev label may not match one in the pool config. This change replicates logic nuance of vdev_validate() by adding special exception, matching the vdev GUID against the top level vdev GUID. Since this exception is not completely reliable (may give false positives if we fail to erase label on detached vdev), use it only as last resort. Quick way to reproduce this scenario now is detach vdev from a pool with enabled autoextend. During vdev detach autoextend logic tries to reopen remaining vdev, that always fails now since in-memory configuration is already updated, while on-disk labels are not yet. MFC after: 2 weeks	2016-10-28 16:21:31 +00:00
Alexander Motin	4be4cba048	Improve few debugging log messages.	2016-10-28 15:30:10 +00:00
Andriy Gapon	539fc86f2e	3746 ZRLs are racy illumos/illumos-gate@260af64db7 `260af64db7` https://www.illumos.org/issues/3746 From the original change log: It was possible for a reference to be added even with the lock held, and for references added just after a lock release to be lost. This bug was also independently found and reported in wesunsolve.net issues 6985013 6995524. In zrl_add(), always use an atomic operation to update the refcount. The mutex in the ZRL only guarantees that wakeups occur for waiters on the lock. It offers no protection against concurrent updates of the refcount. The only refcount transition that is safe to perform without an atomic operation is from ZRL_LOCKED back to 0, since this can only be performed by the thread which has the ZRL locked. Authored by: Will Andrews <will@freebsd.org> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Youzhong Yang <yyang@mathworks.com> PR: 204037 MFC after: 1 week	2016-10-27 07:38:07 +00:00
Alexander Motin	f0cbbdecbc	Fix panic after ZVOL renamed to name invalid for DEVFS. MFC after: 2 weeks	2016-10-24 12:24:24 +00:00
Alexander Motin	9be66df1e1	Add vfs.zfs.zil_log_limit sysctl. It is at least partially broken now, but that is another question.	2016-10-16 18:49:15 +00:00
Alexander Motin	a059d8ccbc	Optimize ZIL itx memory allocation on FreeBSD. These allocations can reach up to 128KB, while FreeBSD kernel allocator can cache allocations only up to 64KB. To avoid expensive allocations for each large ZIL write use caching zio_buf_alloc() allocator instead. To make it possible de-inline few instances of zil_itx_destroy().	2016-10-16 10:43:12 +00:00
Alexander Motin	1899e205d1	MFV r307314: 6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates Using a benchmark which creates 2 million files in one TXG, I observe that the thread running spa_sync() is on CPU almost the entire time we are syncing, and therefore can be a performance bottleneck. About 50% of the time in spa_sync() is in dmu_objset_do_userquota_updates(). The problem is that dmu_objset_do_userquota_updates() calls zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was modified (or created). In this benchmark, all the files are owned by the same user/group, so all 2 million calls to zap_increment_int() are modifying the same entry in the zap. The same issue exists for the DMU_GROUPUSED_OBJECT. We should keep an in-memory map from user to space delta while we are syncing, and when we finish, iterate over the in-memory map and modify the ZAP once per entry. This reduces the number of calls to zap_increment_int() from "number of objects modified" to "number of owners/groups of modified files". This reduced the time spent in spa_sync() in the file create benchmark by ~33%, from 11 seconds to 7 seconds. Closes #107 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@5fc46359c5	2016-10-14 12:03:04 +00:00
Alexander Motin	b3a8b04807	MFV r307313: 5120 zfs should allow large block/gzip/raidz boot pool (loader project) Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Toomas Soome <tsoome@me.com> openzfs/openzfs@c8811bd3e2 FreeBSD still does not support booting from gzip-compressed datasets, so keep one chunk of this commit out.	2016-10-14 12:01:33 +00:00
Konstantin Belousov	5975e53d40	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196	2016-10-13 14:41:05 +00:00
Konstantin Belousov	f71d08566c	Limit scope of the optimization in r306608 to dounmount() caller only. Other uses of cache_purgevfs() do rely on the cache purge for correct operations, when paths are invalidated without unmount. Reported and tested by: jkim Discussed with: mjg Sponsored by: The FreeBSD Foundation	2016-10-07 11:38:28 +00:00
Andriy Gapon	6f98c83306	implement zfs_vptocnp() using z_parent property This should allow vn_fullpath() to work even when vfs name cache is disabled for zfs, which is the case when zfs properties like casesensitivity and normalization are set non-default values. The new code should be 100% reliable for directories and "mostly" reliable for files, that is, when hardlinks across directories are not used. Reported by: Frederic Chardon <chardon.frederic@gmail.com> Reviewed by: kib (vfs contract) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8146	2016-10-07 06:29:24 +00:00
Andriy Gapon	9ba3abc30e	zfs: fix a wrong assertion for extended attributes For the extended attributes the order between z_teardown_lock and the vnode lock is different. The bug was triggered only with DIAGNOSTIC turned on. This fix is developed in cooperation with avos. PR: 213112 Reported by: avos Tested by: avos MFC after: 1 week	2016-10-04 08:09:25 +00:00
Mark Johnston	4538cee5bf	Allow tracing of functions prefixed by "__". This restriction was inherited from upstream but is not relevant on FreeBSD. Furthermore, it hindered the tracing of locking primitive subroutines. MFC after: 1 week	2016-10-02 00:35:00 +00:00
Alexander Motin	863ef2ca62	Add #ifdef _KERNEL around send_holes_without_birth_time sysctl. Reported by: avg@	2016-09-29 17:48:53 +00:00
Alexander Motin	226a11f81e	MFV r306423: 7402 Create tunable to ignore hole_birth feature Until we can resolve the numerous hole_birth bugs that have cropped up recently, and come up with a way going forwards to protect users from corruption, we should disable the hole_birth feature. Using a tunable allows those who are confident that their data is correct to continue to take advantage of the feature. Closes #188 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-29 00:00:37 +00:00
Alexander Motin	bb97118138	MFV r306422: 7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs dsl_dataset_space is looking at the ds_bp's fill count while dmu_objset_write_ready() is concurrently modifying it. This fix adds an rrwlock to protect the ds_bp. Closes #180 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-28 23:54:47 +00:00
Mark Johnston	9e579a58c3	Move implementations of uread() and uwrite() to the illumos compat layer. MFC after: 1 week	2016-09-24 21:40:14 +00:00
Andriy Gapon	d26312a4e4	fix vnode lock assertion for extended attributes directory Background. In ZFS a file with extended attributes has a special directory associated with it where each extended attribute is a file. The attribute's name is a file name and its value is a file content. When the ownership of a file with extended attributes is changed, ZFS also changes ownership of the special directory. This is where the bug was hit. The bug was introduced in r209158. Nota bene. ZFS vnode locks are typically acquired before z_teardown_lock (i.e., before ZFS_ENTER). But this is not the case for the vnodes that represent the extended attribute directory and files. Those are always locked after ZFS_ENTER. This is confusing and fragile. PR: 212702 Reported by: Christian Fuss to FreeNAS Tested by: mav MFC after: 1 week	2016-09-24 08:13:15 +00:00
Mark Johnston	36f5d07745	Re-check the systrace probe ID before calling dtrace_probe(). Otherwise there exists a narrow window during which a syscall probe can be disabled and cause a concurrently-running thread to call dtrace_probe() with an invalid probe ID. Reported by: ngie MFC after: 1 week Sponsored by: Dell EMC Isilon	2016-09-22 23:22:53 +00:00
Allan Jude	c2b475d0ee	MFV r268120: 4936 lz4 could theoretically overflow a pointer with a certain input illumos/illumos-gate@58d0718061 Reviewed by: delphij MFC after: 2 weeks Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D7850	2016-09-11 17:48:06 +00:00
Alexander Motin	20e45e033c	Switch random_get_pseudo_bytes() shim to arc4rand(). Our shim for Solaris random_get_bytes() uses read_random(), that looks reasonable, since it guaranties reliably seeded random data. On the other side Solaris random_get_pseudo_bytes() does not provide this guarantie, and its original Solaris implementation is equivalent to our arc4rand(), using software crypto without stressing slower hardware RNG.	2016-09-10 09:37:41 +00:00
Alexander Motin	4605bf63c4	MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of this patch: commit ca0cc3918a1789fa839194af2a9245f801a06b1a Author: Matthew Ahrens <mahrens@delphix.com> Date: Fri Jul 24 09:53:55 2015 -0700 5959 clean up per-dataset feature count code Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> This patch simply removes this macro from dsl_dataset.h. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-07 20:09:24 +00:00
Alexander Motin	de1fdddeda	MFV r305560: 7278 tuning zfs_arc_max does not impact arc_c_min When changing zfs_arc_max (e.g. as zdb does), it may be set to less than the default arc_c_min. arc_c_min should decrease to not be more than arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@608764bead	2016-09-07 20:05:10 +00:00
Andriy Gapon	1a82707cd7	fix zfs pool creation accidentally broken by r305331 The upstream change introduced a new load state, SPA_LOAD_CREATE, and vdev_geom code needs to be aware of it. Tested by: cy MFC after: 1 week X-MFC with: r305331	2016-09-06 06:09:12 +00:00
Alexander Motin	9b9258a12a	Missed FreeBSD-specific piece of r305338.	2016-09-03 11:17:33 +00:00
Alexander Motin	d7e781bda3	MFC r305337: 7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Closes #109 Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@d3e523d489	2016-09-03 11:00:29 +00:00
Alexander Motin	4ad4b70e77	MFV r305336: 7247 zfs receive of deduplicated stream fails This resolves two 'zfs recv' issues. First, when receiving into an existing filesystem, a snapshot created during the receive process is not added to the guid->dataset map for the stream, resulting in failed lookups for deduped streams when a WRITE_BYREF record refers to a snapshot received earlier in the stream. Second, the newly created snapshot was also not set properly, referencing the snapshot before the new receiving dataset rather than the existing filesystem. Closes #159 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Author: Chris Williamson <chris.williamson@delphix.com> openzfs/openzfs@b09697c8c1	2016-09-03 10:59:05 +00:00
Alexander Motin	070da3f779	MFV r305335: 7003 zap_lockdir() should tag hold zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Closes #108 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@0780b3eab5	2016-09-03 10:58:14 +00:00
Alexander Motin	d3ec2cdb4a	MFV r304157: 7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records illumos/illumos-gate@12b90ee2d3 https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f e1d9cfa https://www.illumos.org/issues/7230 A test failure occurred where a send stream had only a BEGIN record. This should not be possible if the send returns without error. Prevented this from happening in the future by adding an assertion to dmu_send_impl() to verify that if the function returns 0 (success) both a BEGIN and END record are present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o r END records sent), having dump_record() set flags appropriately, adding VERIFY statement to dmu_send_impl(). Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matt Krantz <matt.krantz@delphix.com>	2016-09-03 10:10:58 +00:00
Alexander Motin	7aafc9d4c8	MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr illumos/illumos-gate@bd56f80007 https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec 844a96b https://www.illumos.org/issues/7235 The function dsl_dataset_set_blkptr() is unused. We should remove it. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-03 10:09:23 +00:00
Alexander Motin	c9fa25c110	MFV r304155: 7090 zfs should improve allocation order and throttle allocations illumos/illumos-gate@0f7643c737 https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b 10c846a https://www.illumos.org/issues/7090 When write I/Os are issued, they are issued in block order but the ZIO pipelin e will drive them asynchronously through the allocation stage which can result i n blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able t o detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocatio n is successfully completed it's scheduled on a given top-level vdev. Each top- level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: George Wilson <george.wilson@delphix.com>	2016-09-03 10:04:37 +00:00
Alexander Motin	0b51a59fc7	MFV r303078: 7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer illumos/illumos-gate@926549256b https://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7 8fa08ef https://www.illumos.org/issues/7086 In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the db_blkptr, to prevent it from being changed by syncing context. Otherwise we may see that ztest got a segfault from this stack: libzpool.so.1`dva_get_dsize_sync+0x98(872f000, `b32b240`, fed7811b, 0, b4cda20, 0) libzpool.so.1`bp_get_dsize+0x60(872f000, `b32b240`, 0, 97cb780, 9d4c1a8, 0) libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530) libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1) libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1) libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780) libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a) ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0) ztest_remove+0x9f(ea294588, ea293f50, 4, 3) ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1) ztest_dmu_object_alloc_free+0x71(ea294588, 13) ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0) ztest_execute+0x89(a, 807c720, 13, 0) ztest_thread+0xea(13, 0, 0, 0) libc.so.1`_thrp_setup+0x88(f0983240) libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0) Looking into it a bit, we see that this is an embedded blockpointer, so BP_GET_NDVAS should have returned 0: b32b240::blkptr EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L Instead, it looks like another thread is modifying this blockpointer: b32b240::ugrep \| ::whatis f47a0e0c is in [ stack tid=0x19f ] ebd6ec40 is in [ stack tid=0x226 ] ea293bd0 is in [ stack tid=0x244 ] ea293be4 is in [ stack tid=0x244 ] Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-03 08:43:43 +00:00
Alexander Motin	84c3781ac9	MFV r303077: 7072 zfs fails to expand if lun added when os is in shutdown state illumos/illumos-gate@c39a2aae1e `c39a2aae1e` https://www.illumos.org/issues/7072 upstream: 38733 zfs fails to expand if lun added when os is in shutdown state DLPX-36910 spares and caches should not display expandable space DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2016-09-03 08:42:12 +00:00
Alexander Motin	efa0867fb0	MFV r302991: 6950 ARC should cache compressed data illumos/illumos-gate@dcbf3bd6a1 `dcbf3bd6a1` https://www.illumos.org/issues/6950 When reading compressed data from disk, the ARC should keep the compressed block cached and only decompress it when consumers access the block. The uncompressed data should be short-lived allowing the ARC to cache a much larger amount of data. The DMU would also maintain a smaller cache of uncompressed blocks to minimize the impact of decompressing frequently accessed blocks. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com>	2016-09-03 08:30:51 +00:00
Alexander Motin	c543b519be	MFV r304158: 7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information 7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often illumos/illumos-gate@b72b6bb10a https://github.com/illumos/illumos-gate/commit/b72b6bb10ad55121a1b352c6f68ebdc8e 20c9086 https://www.illumos.org/issues/7136 6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents whenever an aux device gets removed from a pool. However, those sysevents will be created without the vdev_guid and vdev_path fields. It would be better to always populate those fields. https://www.illumos.org/issues/7115 The addition of spa_event_notify in vdev removal code (see #6922) causes event s to be generated even if the spare failed to be removed with EBUSY. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alan Somers <asomers@gmail.com>	2016-09-01 18:37:11 +00:00
Alexander Motin	25584d12e7	MFV r302993: 7104 increase indirect block size illumos/illumos-gate@4b5c8e93ca https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3 be1db1c https://www.illumos.org/issues/7104 The current default indirect block size is 16KB. We can improve performance by increasing it to 128KB. This is especially helpful for any workload that needs to read most of the metadata, e.g. scrub/resilver, file deletion, filesystem deletion, and zfs send. We also need to fix a few space estimation errors to make the tests pass. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 18:33:39 +00:00
Alexander Motin	dd7f7cb7ac	MFV r302992: 7071 lzc_snapshot does not fill in errlist on ENOENT illumos/illumos-gate@25f7d993ad https://github.com/illumos/illumos-gate/commit/25f7d993adbfb3452ac4625b379167074 6d35ae3 https://www.illumos.org/issues/7071 upstream DLPX-40482 lzc_snapshot does not fill in errlist on ENOENT Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 18:25:49 +00:00
Alexander Motin	3d1e0e0830	MFV r302662: 6447 handful of nvpair cleanups illumos/illumos-gate@759e89be35 https://github.com/illumos/illumos-gate/commit/759e89be359f2af635e4122d147df56bc e948773 https://www.illumos.org/issues/6447 I got a patch from someone who uses nvpair code outside of illumos. It fixes a couple of gcc warnings/bugs for him. 1. silence uninitialized use warnings 2. add parentheses around assignment used as truth value 3. fix printf format specifier (ll is for integers only) 4. strstr, strspn, strcspn, and strcmp are declared in string.h, not strings.h. 5. avoid scanning integer into boolean variable Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Steve Dougherty <sdougherty@barracuda.com>	2016-09-01 15:17:39 +00:00
Alexander Motin	3421688c2d	MFV r302661: 7082 bptree_iterate() passes wrong args to zfs_dbgmsg() illumos/illumos-gate@10e67aa0db https://github.com/illumos/illumos-gate/commit/10e67aa0db0823d5464aafdd681f3c966 155c68e https://www.illumos.org/issues/7082 upstream DLPX-40542 bptree_iterate() passes wrong args to zfs_dbgmsg() Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 15:10:40 +00:00
Alexander Motin	41b9077ef6	MFV r302660: 6314 buffer overflow in dsl_dataset_name illumos/illumos-gate@9adfa60d48 https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6 92b6660 https://www.illumos.org/issues/6314 Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but dsl_dataset_name copies the datasets' name PLUS the snapshot name to it, resulting in a max of 2 * ZFS_MAXNAMELEN + '@'. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 15:08:27 +00:00
Alexander Motin	e12a269749	MFV r302659: 6931 lib/libzfs: cleanup gcc warnings illumos/illumos-gate@88f61dee20 `88f61dee20` https://www.illumos.org/issues/6931 need cleanup: CERRWARN += -_gcc=-Wno-switch CERRWARN += -_gcc=-Wno-parentheses CERRWARN += -_gcc=-Wno-unused-function Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Igor Kozhukhov <ikozhukhov@gmail.com>	2016-09-01 14:57:06 +00:00
Alexander Motin	a95a9fe945	MFV r302651: 7054 dmu_tx_hold_t should use refcount_t to track space illumos/illumos-gate@0c779ad424 https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009 189202b https://www.illumos.org/issues/7054 upstream: ee0003de7d3e598499be7ac3fe6b61efcc47cb7f DLPX-40399 dmu_tx_hold_t should use refcount_t to track space Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 14:38:25 +00:00
Alexander Motin	96bf48b8cb	MFV r302648: 7019 zfsdev_ioctl skips secpolicy when FKIOCTL is set Note that the bulk of the upstream change is not applicable to FreeBSD and the affected files are not even in the vendor area. illumos/illumos-gate@45b1747515 `45b1747515` https://www.illumos.org/issues/7019 Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set, skips all processing of secpolicy functions. This means that ZFS is not doing any kind of verification of the credentials or access rights of the caller and assuming that (as it is an in-kernel client) all such checks have already been done. This turns out to be quite a dangerous assumption, especially with respect to sdev. In general I don't think it's particularly reasonable to offload this enforcement of access rights onto other kernel subsystems when ZFS has some particular local semantics in this area (delegated datasets etc) and does not provide any kind of API to allow other subsystems to avoid code duplication when doing it. ZFS should apply its normal access policy to requests from within the kernel, and callers should take care to give it the correct credentials and call it from the correct context in order to get the results they need. You can observe the currently unfortunate consequences of this bug in any non- global zone that has access to /dev/zvol or any subset of it via sdev profiles. In particular, a zone used to contain a KVM or similar which has a single zvol passed through to it using a <device match= block in its zone XML. Even though sdev makes something of an attempt to control for whether the caller should have access to nodes in /dev/zvol, it doesn't do this correctly, or really at all in the lookup call path. So, if we have a zone that's been given access to any part of /dev/zvol, it can simply look up the full path to any other zvol on the entire system, and the node will appear and be able to be used. Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alex Wilson <alex.wilson@joyent.com>	2016-09-01 14:24:54 +00:00
Alexander Motin	13876b47d7	MFV r302647: 6922 Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device illumos/illumos-gate@63364b0ee2 https://github.com/illumos/illumos-gate/commit/63364b0ee2604783e7a55f84258888677 68eafa4 https://www.illumos.org/issues/6922 ZFS does not do a config_sync after removing an aux (spare, log, or cache) device. AFAICT this isn't being done because it is slow and was deemed unnecessary. However, it should be such a rare operation that speed doesn't matter, and not doing it results in two problems: 1) It is theoretically possible to remove an aux device from one pool and attach it to another, then lose power. When power is restored, both pools woul d think that they own the aux device. 2) Removal of the aux device doesn't send any useful sysevents to userland. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alan Somers <asomers@gmail.com>	2016-09-01 14:17:30 +00:00
Alexander Motin	1c7d88abed	MFV r302646: 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch illumos/illumos-gate@ea4a67f462 https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84 9de5371 https://www.illumos.org/issues/6980 doing zfs send -i snap1 snap2 >testfile results in internal error: Invalid argument Abort (core dumped) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 14:06:30 +00:00
Alexander Motin	4536fd9bed	MFV r302643: 6902 speed up listing of snapshots if requesting name only and sorting by name This was our change from the beginning, so just reduce the upstream diff.	2016-09-01 13:29:53 +00:00
Alexander Motin	5fd28943d6	MFV r302642: 6876 Stack corruption after importing a pool with a too-long name illumos/illumos-gate@c971037baa `c971037baa` https://www.illumos.org/issues/6876 Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-01 13:04:36 +00:00
Alexander Motin	9007a8679a	Fix kernel panic when inheriting properties without default. There are two writable hidden properties "iscsioptions" and "stmf_sbd_lu", that have no default string value. Attempt to unset them or replicate caused kernel panic. This simple bandaid seems fixes the problem nicely. MFC after: 2 weeks	2016-08-31 11:55:31 +00:00
Toomas Soome	f1624ed8c4	Bug 212114 - loader: zio_checksum_verify() must test spa for NULL pointer The issue was introduced with adding support for salted checksums, and was revealed by bhyve userboot.so. During pool discovery the loader is reading pool label from disks, and at that time the spa structure is not yet set up, so the NULL pointer is passed for spa. This condition must be checked to avoid the corruption of the memory and NULL pointer dereference. PR: 212114 Reported by: tsoome@freebsd.com Reviewed by: allanjude Approved by: allanjude (mentor) Differential Revision: https://reviews.freebsd.org/D7634	2016-08-24 16:30:15 +00:00
Toomas Soome	2c55d0903d	Add SHA512, skein, large blocks support for loader zfs. Updated sha512 from illumos. Using skein from freebsd crypto tree. Since loader itself is using 64MB memory for heap, updated zfsboot to use same, and this also allows to support zfs large blocks. Note, adding additional features does increate zfsboot code, therefore this update does increase zfsboot code to 128k, also I have ported gptldr.S update to zfsldr.S to support 64k+ code. With this update, boot1.efi has almost reached the current limit of the size set for it, so one of the future patches for boot1.efi will need to increase the limit. Currently known missing zfs features in boot loader are edonr and gzip support. Reviewed by: delphij, imp Approved by: imp (mentor) Obtained from: sha256.c update and skein_zfs.c stub from illumos. Differential Revision: https://reviews.freebsd.org/D7418	2016-08-18 00:37:07 +00:00
Mark Johnston	59ceeddecf	MFV r301526: 7035 string-related subroutines should validate input earlier Reviewed by: Alex Wilson <alex.wilson@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Patrick Mooney <pmooney@pfmooney.com> illumos/illumos-gate@771e39c3b1 MFC after: 2 weeks	2016-08-16 02:25:19 +00:00
Mark Johnston	f66200ee22	MFV r301525: 7033 ustack helper should fault on bad return values Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@a2f72b65eb MFC after: 2 weeks	2016-08-16 02:20:02 +00:00
Mark Johnston	4aea8f31b1	MFV r301524: 7034 negative record sizes should be rejected Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@0b8049bfb0 MFC after: 2 weeks	2016-08-16 02:18:34 +00:00
Mark Johnston	b7125fa9cd	MFV r296989: 6734 dtrace_canstore_statvar() fails for some valid static variables Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Bryan Cantrill <bryan@joyent.com> illumos/illumos-gate@d65f2bb4e5 MFC after: 2 weeks	2016-08-16 02:16:54 +00:00
Andriy Gapon	96762fe314	fix a zfs cross-device rename crash introduced in r303763 The problem was that 'zfsvfs' variable was not initialized if the error was detected, but in the exit path the variable was dereferenced before the error code was checked. Reported by: np MFC after: 3 days X-MFC with: r303763	2016-08-09 06:11:24 +00:00
Justin Hibbits	161c415133	Two fixups for dtrace * Use the right incantation to get the next stack pointer. Since powerpc uses special frames for traps, dereferencing the stack pointer straight up won't get us the next stack pointer in every case. * Clear EE using the correct instruction sequence. The PowerISA states that 'andi.' ANDs the register with 0\|\|<imm>, instead of sign extending or filling out the unavailable bits with 1. Even if it did sign extend, PSL_EE is 0x8000, so ~PSL_EE is 0x7fff, and the upper bits would be cleared. Use rlwinm in the 32-bit case, and a two-rotate sequence in the 64-bit case, the latter chosen to follow the output generated by gcc. MFC after: 1 week	2016-08-06 15:06:19 +00:00
Andriy Gapon	4fb51b52ef	fix .zfs-related cases in zfs_lookup that were broken by r303763 The problem is that the special .zfs nodes are not represented by znodes but by special gfs-based nodes. r303763 changed interface of zfs_dirlook such that started operating on znodes rather than on vnodes and, thus, the function became unsuitable for handling .zfs entities. The solution is to move the handling of the special cases to zfs_lookup, the only consumer of zfs_dirlook. I already had this solution applied in D7421, but for different reasons. Reported by: asomers MFC after: 3 days X-MFC with: r303763	2016-08-06 11:02:07 +00:00
Andriy Gapon	f79bc17233	zfs: honour and make use of vfs vnode locking protocol ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Most importantly many things that FreeBSD VFS manages on behalf of all filesystems are implemented in ZPL in a different way. Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS. The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by VFS. This commit removes all ZPL internal locking that protects parent-child relationships of filesystem nodes. These relationships are protected by vnode locks and the code is changed to take advantage of that fact and to properly interact with VFS. Removal of the internal locking allowed all ZPL dmu_tx_assign calls to use TXG_WAIT mode. Another victim, disputable perhaps, is ZFS support for filesystems with mixed case sensitivity. That support is not provided by the OS anyway, so in ZFS it was a buch of dead code. To do: - replace ZFS_ENTER mechanism with VFS managed / visible mechanism - replace zfs_zget with zfs_vget[f] as much as possible - get rid of not really useful now zfs_freebsd_* adapters - more cleanups of unneeded / unused code - fix / replace .zfs support PR: 209158 Reported by: many Tested by: many (thank you all!) MFC after: 5 days Sponsored by: HybridCluster / ClusterHQ Differential Revision: https://reviews.freebsd.org/D6533	2016-08-05 06:23:06 +00:00
Ruslan Bukin	98f50c44e3	Update RISC-V port to Privileged Architecture Version 1.9. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-08-02 14:50:14 +00:00
Allan Jude	4deb8929ea	Make boot code and loader check for unsupported ZFS feature flags OpenZFS uses feature flags instead of a zpool version number to track features since the split from Oracle. In addition to avoiding confusion on ZFS vs OpenZFS version numbers, this also allows features to be added to different operating systems that use OpenZFS in different order. The previous zfs boot code (gptzfsboot) and loader (zfsloader) blindly tries to read the pool, and if failed provided only a vague error message. With this change, both the boot code and loader check the MOS features list in the ZFS label and compare it against the list of features that the loader supports. If any unsupported feature is active, the pool is not considered as a candidate for booting, and a helpful diagnostic message is printed to the screen. Features that are merely enabled via zpool upgrade, but not in use, do not block booting from the pool. Submitted by: Toomas Soome <tsoome@me.com> Reviewed by: delphij, mav Relnotes: yes Differential Revision: https://reviews.freebsd.org/D6857	2016-08-01 19:37:43 +00:00
Enji Cooper	4fcae4df7e	Conditionalize code which defines sysctls per _KERNEL #ifdef guard This resolves several issues when compiling libzpool (userspace library), i.e. -Wimplicit-function-declaration and -Wmissing-declarations issues. MFC after: 2 weeks Reported by: clang Tested with: clang 3.8.1, gcc 4.2.1, gcc 5.3.0 Sponsored by: EMC / Isilon Storage Division	2016-07-31 06:34:49 +00:00
Mark Johnston	57185c52de	Restore an ifdef that should not have been removed in r303535. X-MFC-With: r303535	2016-07-30 07:05:32 +00:00
Mark Johnston	6d1ffb50fc	Include fasttrap handling for DATAMODEL_ILP32 when compiling for amd64. MFC after: 1 month	2016-07-30 03:11:53 +00:00
Ruslan Bukin	573d53050e	Remove unused variables.	2016-07-29 12:29:17 +00:00
Mark Johnston	e86e17af79	Merge {amd64,i386}/instr_size.c into x86_instr_size.c. Also reduce the diff between us and upstream: the input data model will always be DATAMODEL_NATIVE because of a bug (p_model is never set but is always initialized to 0), so we don't need to override the caller anyway. This change is also necessary to support the pid provider for 32-bit processes on amd64. MFC after: 2 weeks	2016-07-20 00:02:10 +00:00
Andriy Gapon	70e3da3892	MFV r302645: 6878 Add scrub completion info to "zpool history" illumos/illumos-gate@1825bc56e5 `1825bc56e5` https://www.illumos.org/issues/6878 Summary of changes: * Replace generic "scan done" message with "scan aborted, restarting", "scan cancelled", or "scan done" * Log number of errors using spa_get_errlog_size * Refactor scan restarting check into static function Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Nav Ravindranath <nav@delphix.com> MFC after: 2 weeks	2016-07-14 11:53:39 +00:00
Andriy Gapon	39a6b17491	MFV r302650: 6940 Cannot unlink directories when over quota illumos/illumos-gate@99189164df `99189164df` https://www.illumos.org/issues/6940 Similar to #6334, but this time with empty directories: $ zfs create tank/quota $ zfs set quota=10M tank/quota $ zfs snapshot tank/quota@snap1 $ zfs set mountpoint=/mnt/tank/quota tank/quota $ mkdir /mnt/tank/quota/dir # create an empty directory $ mkfile 11M /mnt/tank/quota/11M /mnt/tank/quota/11M: initialized 9830400 of 11534336 bytes: Disc quota exceeded $ rmdir /mnt/tank/quota/dir # now unlink the empty directory rmdir: directory "/mnt/tank/quota/dir": Disc quota exceeded From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Simon Klinkert <simon.klinkert@gmail.com> MFC after: 2 weeks	2016-07-14 11:51:01 +00:00
Andriy Gapon	fe0cc75230	MFV r302644: 6513 partially filled holes lose birth time illumos/illumos-gate@8df0bcf0df `8df0bcf0df` https://www.illumos.org/issues/6513 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com> MFC after: 3 weeks	2016-07-14 11:48:42 +00:00
Andriy Gapon	e7ed92bbbc	MFV r302641: 6844 dnode_next_offset can detect fictional holes illumos/illumos-gate@11ceac77ea `11ceac77ea` https://www.illumos.org/issues/6844 dnode_next_offset is used in a variety of places to iterate over the holes or allocated blocks in a dnode. It operates under the premise that it can iterate over the blockpointers of a dnode in open context while holding only the dn_struct_rwlock as reader. Unfortunately, this premise does not hold. When we create the zio for a dbuf, we pass in the actual block pointer in the indirect block above that dbuf. When we later zero the bp in zio_write_compress, we are directly modifying the bp. The state of the bp is now inconsistent from the perspective of dnode_next_offset: the bp will appear to be a hole until zio_dva_allocate finally finishes filling it in. In the meantime, dnode_next_offset can detect a hole in the dnode when none exists. I was able to experimentally demonstrate this behavior with the following setup: 1. Create a file with 1 million dbufs. 2. Create a thread that randomly dirties L2 blocks by writing to the first L0 block under them. 3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle of a file. 4. Do dnode_next_offset in a loop until we skip over such a non-existent hole. The fix is to ensure that it is valid to iterate over the indirect blocks in a dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP and updating the actual BP in dbuf_write_ready while holding the lock. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alex Reece <alex@delphix.com> MFC after: 3 weeks	2016-07-14 11:42:53 +00:00
Andriy Gapon	875e6e5b04	MFV r302640: 6874 rollback and receive need to reset ZPL state to what's on disk illumos/illumos-gate@1fdcbd00c9 `1fdcbd00c9` https://www.illumos.org/issues/6874 When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the ZPL doesn't completely reload the state from the DMU; some values remain cached in the zfsvfs_t. steps to reproduce: ``` #!/bin/bash -x zfs destroy -R test/fs zfs destroy -R test/recvd zfs create test/fs zfs snapshot test/fs@a zfs set userquota@$USER=1m test/fs zfs snapshot test/fs@b zfs send test/fs@a \| zfs recv test/recvd zfs send -i @a test/fs@b \| zfs recv test/recvd zfs userspace test/recvd 1. should show 1m quota dd if=/dev/urandom of=/test/recvd/file bs=1k count=1024 sync dd if=/dev/urandom of=/test/recvd/file2 bs=1k count=1024 2. should fail with ENOSPC sync zfs unmount test/recvd zfs mount test/recvd zfs userspace test/recvd 3. if bug above, now shows 1m quota dd if=/dev/urandom of=/test/recvd/file3 bs=1k count=1024 4. if bug above, now fails with ENOSPC ``` Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2016-07-14 11:39:36 +00:00
Andriy Gapon	ac3623e090	re-apply r299908: zfsctl_snapdir_lookup: clear VV_ROOT of snapshot's root The change has been undone in r301275 on the assumption that it was no longer required. But that was incorrect, because in this case (and only in this case) the snapshot root vnode is looked up before z_parent is fixed up. MFC after: 5 days	2016-07-13 15:16:51 +00:00
Mark Johnston	ca1ef36cf4	Avoid truncating the return value of DTrace predicates. Predicates are DIF objects whose return value is compared with zero to determine whether the corresponding probe body is to be executed. The return value itself is the contents of a 64-bit DIF register, but it was being truncated to an int before the comparison. This meant that a predicate such as /0x100000000/ would evaluate to false. Reported by: rwatson MFC after: 3 days	2016-07-09 22:41:21 +00:00
Steven Hartland	ae8420ed72	Fix ZFS ARC min / max tunable Due to ARC initial configuration not being done and kmem information not being available we need to blindly set zfs_arc_max and zfs_arc_min when configured via the tunable. This fixes vfs.zfs.arc_(min\|max) configuration via loader.conf broken by r302265. Approved by: re(gjb) MFC after: 1 week	2016-07-06 23:49:19 +00:00
Nathan Whitehorn	96c85efb4b	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
Alexander Motin	e36599916f	Revert r299454 and r299448. Those changes were found confusing FreeBSD libc ACL code, that doesn't differentiate ACL for directories and files, and report ACLs for all directories created after those patches as non-trivial. On the other side these changes were considered wrong from POSIX and NFSv4 points of view. Until further investigation done upstream, revert those changes locally in preparation for FreeBSD 11.0 release. Approved by: re (hrs)	2016-06-30 14:55:49 +00:00
Steven Hartland	f535b2d7d8	Allow ZFS ARC min / max to be tuned at runtime Prior to this change ZFS ARC min / max could only be changed using boot time tunables, this allows the values to be tuned at runtime using the sysctls: * vfs.zfs.arc_max * vfs.zfs.arc_min When adjusting ZFS ARC minimum the memory used will only reduce to the new minimum given memory pressure. Reviewed by: allanjude Approved by: re (gjb) MFC after: 2 weeks Relnotes: yes Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D5907	2016-06-29 07:55:45 +00:00
Andriy Gapon	0f7dcde977	fix deadlock-prone code in getzfsvfs() getzfsvfs() called vfs_busy() in the waiting mode while having a hold on a pool (via a call to dmu_objset_hold). In other words, dp_config_rwlock was held in the shared mode while a thread could be sleeping in vfs_busy(). The pool's txg sync thread needs to take dp_config_rwlock in the exclusive mode for some actions, e.g., for executing sync tasks. If the sync thread gets blocked, then any thread waiting for its sync task to get executed is also blocked. Which, in turn, could mean that vfs_busy() will keep waiting indefinitely. The solution is to use vfs_ref() in the locked section and to call vfs_busy() only after dropping other locks. Note that a reference on a struct mount object does not prevent an associated zfsvfs_t object from being destroyed. So, we have to be careful to operate only on the struct mount object until we successfully vfs_busy it. Approved by: re (gjb) MFC after: 2 weeks	2016-06-23 07:01:54 +00:00
Alan Somers	54edbcfb69	Fix uninitialized variable from r300881 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Initialize needs_update in vdev_geom_set_physpath PR: 210409 Reported by: kp Reviewed by: kp Approved by: re (hrs) MFC after: 4 weeks X-MFC-With: 300881 Sponsored by: Spectra Logic Corp	2016-06-21 15:27:16 +00:00
Konstantin Belousov	cacbedfc46	Fix gcc build. Reported andt tested by: swills Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-18 20:20:00 +00:00
Konstantin Belousov	4a0d95f810	Use vnlru_free(9) to implement dnlc_reduce_cache(). This apparently puts ARC back under the limits after the vnode pressure rework in r291244, in particular due to the kmem exhaustion. Based on patch by: mckusick Reviewed by: avg, mckusick Tested by: allanjude, madpilot Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-17 17:34:28 +00:00
Andriy Gapon	55be2f79e5	l2arc: reset b_tmp_cdata to NULL in the case of unset b_daddr The change is in arc_buf_l2_cdata_free(). Without this we can trip the assertion in arc_hdr_realloc() if INVARIANTS option is enabled. Approved by: re (kib) MFC after: 1 week	2016-06-13 18:39:13 +00:00
Andriy Gapon	aa14503e8e	zfs_vptocnp: check for an invalid znode ... which can arise after the receive or rollback and failed zfs_rezget(). Approved by: re (kib) MFC after: 1 week	2016-06-13 10:53:34 +00:00
Andriy Gapon	a68789426a	zfs: set VROOT / VV_ROOT consistently and in a single place This is a followup to r300131. A filesystem's root vnode can be reached not only through VSF_ROOT, but by other means as well. For example, via a dot-dot lookup. Also, a root vnode can get reclaimed and then re-created. For these reasons it was insufficient to clear VV_ROOT flag from a root vnode of a snapshot mounted under .zfs in zfsctl_snapdir_lookup(). So, now we set the flag in zfs_znode_sa_init() only if a vnode represent a root of a filesystem or a standalone snapshot. That is, the flag is not set for snapshots mounted under .zfs. MFC after: 2 weeks	2016-06-03 14:37:18 +00:00
Andriy Gapon	d1cf30f4f1	zfs_root: fix a potential root vnode reference leak It could happen in an unlikely case that we fail to lock the root vnode with requested flags (which appear to never include LK_NOWAIT). MFC after: 1 week	2016-06-03 14:22:12 +00:00
Alan Somers	b1a6b8dcd2	Improve the English in a comment sys/cddl/contrib/opensolaris/uts/common/sys/acl.h: Improve the english in a comment. No functional changes Submitted by: gibbs MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-06-01 22:21:42 +00:00
Andrew Turner	1cb290d2f2	Set oldfp so the check for fp == oldfp works as expected. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-31 11:32:09 +00:00
Allan Jude	0144ad3e78	Connect the SHA-512t256 and Skein hashing algorithms to ZFS Support for the new hashing algorithms in ZFS was introduced in r289422 However it was disconnected because FreeBSD lacked implementations of SHA-512 (truncated to 256 bits), and Skein. These implementations were introduced in r300921 and r300966 respectively This commit connects them to ZFS and enabled these new checksum algorithms This new algorithms are not supported by the boot blocks, so do not use them on your root dataset if you boot from ZFS. Relnotes: yes Sponsored by: ScaleEngine Inc.	2016-05-31 04:12:14 +00:00
Bryan Drewery	fdd9048a63	Avoid more literal-suffix errors with C++11	2016-05-29 00:40:29 +00:00
Alan Somers	7a0c41d5d7	zfsd(8), the ZFS fault management daemon Add zfsd, which deals with hard drive faults in ZFS pools. It manages hotspares and replements in drive slots that publish physical paths. cddl/usr.sbin/zfsd Add zfsd(8) and its unit tests cddl/usr.sbin/Makefile Add zfsd to the build lib/libdevdctl A C++ library that helps devd clients process events lib/Makefile share/mk/bsd.libnames.mk share/mk/src.libnames.mk Add libdevdctl to the build. It's a private library, unusable by out-of-tree software. etc/defaults/rc.conf By default, set zfsd_enable to NO etc/mtree/BSD.include.dist Add a directory for libdevdctl's include files etc/mtree/BSD.tests.dist Add a directory for zfsd's unit tests etc/mtree/BSD.var.dist Add /var/db/zfsd/cases, where zfsd stores case files while it's shut down. etc/rc.d/Makefile etc/rc.d/zfsd Add zfsd's rc script sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c Fix the resource.fs.zfs.statechange message. It had a number of problems: It was only being emitted on a transition to the HEALTHY state. That made it impossible for zfsd to take actions based on drives getting sicker. It compared the new state to vdev_prevstate, which is the state that the vdev had the last time it was opened. That doesn't make sense, because a vdev can change state multiple times without being reopened. vdev_set_state contains logic that will change the device's new state based on various conditions. However, the statechange event was being posted _before_ that logic took effect. Now it's being posted after. Submitted by: gibbs, asomers, mav, allanjude Reviewed by: mav, delphij Relnotes: yes Sponsored by: Spectra Logic Corp, iX Systems Differential Revision: https://reviews.freebsd.org/D6564	2016-05-28 17:43:40 +00:00
Enji Cooper	dfdbdb0c82	Fix up r300870 The sys/types.h fix I proposed was only tested with zfs(4), not with libzpool, which is where the build failure actually existed Remove vm/vm_pageout.h from arc.c and zfs_vnops.c because they're both unneeded MFC after: 1 week X-MFC with: r300865, r300870 In collaboration with: kib Submitted by: alc Sponsored by: EMC / Isilon Storage Division	2016-05-27 22:56:00 +00:00
Alan Somers	151746b244	Avoid issuing spa config updates for physical path when not necessary ZFS's configuration needs to be updated whenever the physical path for a device changes, but not when a new device is introduced. This is because new devices necessarily cause config updates, but only if they are actually accepted into the pool. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When setting the vdev's physical path, only request a config update if the physical path has changed. Don't request it when opening a device for the first time, because the config sync will happen anyway upstack. sys/geom/geom_dev.c Split g_dev_set_physpath and g_dev_set_media out of g_dev_attrchanged Submitted by: will, asomers MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6428	2016-05-27 22:32:44 +00:00
Enji Cooper	765daefd68	Unbreak the zfs(4) build vm/vm_pageout.h grew a dependency on the bool typedef in r300865 arc.c didn't include sys/types.h, which included the definition for the typedef Other items (ofed, drm2) might need to be chased for this commit. X-MFC with: r300865 MFC after: 1 week Pointyhat to: alc Sponsored by: EMC / Isilon Storage Division	2016-05-27 20:33:38 +00:00
Ruslan Bukin	bdca9b1d52	Correct the implementation of dtrace_interrupt_disable/enable. Pointed out by: andrew Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-05-27 17:58:10 +00:00
Andrew Turner	12aac6b55b	Fix dtrace_interrupt_disable and dtrace_interrupt_enable by having the former return the current status for the latter to use. Without this we could enable interrupts when they shouldn't be. It's still not quite right as it should only update the bits we care about, bit should be good enough until the correct fix can be tested. PR: 204270 Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-27 12:02:12 +00:00
Bryan Drewery	74c5734766	Use netinet/in.h to avoid include/arpa dependency for DIRDEPS_BUILD. Sponsored by: EMC / Isilon Storage Division	2016-05-26 23:20:17 +00:00
Bjoern A. Zeeb	9c759b587f	Try to unbreak the build after r300611 by including the header defining VM_MIN_KERNEL_ADDRESS. Sponsored by: DARPA/AFRL	2016-05-24 17:38:27 +00:00
Ruslan Bukin	fed1ca4b71	Add initial DTrace support for RISC-V. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-05-24 16:41:37 +00:00
Andrew Turner	0d0da76911	Mark all memory before the kernel as toxic to DTrace. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-24 13:57:23 +00:00
Andriy Gapon	fabe7e4ecc	add vop_print methods to vnode operatios of various zfsctl node types This should help with diagnostics of zfsctl problems. MFC after: 2 weeks	2016-05-18 13:21:29 +00:00
Andriy Gapon	e34c8d727b	move zfsctl_freebsd_root_lookup right next to zfsctl_root_lookup That makes it easier to reason about the code. MFC after: 5 weeks	2016-05-18 08:29:39 +00:00
Andriy Gapon	a4bbed22d2	zfsctl_common_fid: remove redundant assignment "Reinterpret cast" to zfid_short_t and assignment of zf_len do the job already. MFC after: 1 week	2016-05-18 08:26:09 +00:00
Andriy Gapon	e6d4eefe2a	zfsctl: tighten an assertion and remove an unused definition There are only two entries under .zfs and 'shares' has an ID of a special persistent object in its filesystem. MFC after: 1 week	2016-05-18 08:23:39 +00:00
Andriy Gapon	439e9b6804	zfs_root: no need to set the root flag here That was both redundant as zfs_znode_sa_init() already does the job and insufficient as the root vnode can be reached via other means. MFC after: 1 weeks	2016-05-18 08:19:41 +00:00
Andriy Gapon	74a3df2b1f	zfsctl_freebsd_root_lookup: gfs_vop_lookup may return a doomed vnode gfs code is (almsot) completely agnostic of FreeBSD VFS locking, so it does not handle doomed but not yet dead vnodes and may return them. Check for those vnodes here and retry a lookup. Note that ZFS and gfs have additional protections that ensure that a parent vnode of the current vnode is never doomed. The fixed problem is an occasional failure to lookup a 'snapshot' or 'shares' directories under .zfs. Note that for the above reason all uses of zfsctl_root_lookup() are better be replaced with VOP_LOOKUP. MFC after: 5 weeks	2016-05-18 08:02:49 +00:00
Alan Somers	5f7b3969e9	Speed up vdev_geom_open_by_guids Speedup is hard to measure because the only time vdev_geom_open_by_guids gets called on many drives at the same time is during boot. But with vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations like "zpool create" speed up by 65%. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c * Read all of a vdev's labels in parallel instead of sequentially. * In vdev_geom_read_config, don't read the entire label, including the uberblock. That's a waste of RAM. Just read the vdev config nvlist. Reduces the IO and RAM involved with tasting from 1MB to 448KB. Reviewed by: avg MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6153	2016-05-17 15:17:23 +00:00
Andriy Gapon	857a214d03	zfs_ioc_rename: fix a reversed condition FreeBSD zfs_ioc_rename() has an option, not present upstream, that allows to rename snapshots without unmounting them first. I am not sure what is a rationale for that option, but its actual behavior was the opposite of the intended behavior. That is, by default the snapshots were not unmounted. The option was introduced as part of a large update from upstream in r248498. One of the consequences was a havoc under .zfs/snapshot after the rename. The snapshots got new names but were mounted on top of directories with old names, so readdir would list the new names, but lookup would still find the old mounts. PR: 209093 Reported by: Frédéric VANNIÈRE <f.vanniere@planet-work.com> MFC after: 5 days	2016-05-17 07:56:05 +00:00
Andriy Gapon	afe674f089	do not destroy 'snapdir' when it becomes inactive That was just wrong. In fact, we can safely keep this static entry when it's inactive. Now the destructive action is moved to the reclaim method and the function is renamed from zfsctl_snapdir_inactive(0 to zfsctl_snapdir_reclaim(). Also, we can use gfs_vop_reclaim() instead of gfs_dir_inactive() + kmem_free(). Lastly, we can just assert that the node does not any children when it is reclaimed, even on the force unmount. That's because zfs_umount() does an extra vflush() pass which should destroy all snapshot-mountpoint vnodes that are the snapdir's children. MFC after: 5 weeks	2016-05-16 15:48:56 +00:00
Andriy Gapon	9c3e205296	try to recycle "snap" vnodes as soon as possible Those vnodes should not linger. "Stale" nodes may get out of synchronization with actual snapshots. For example if we destroy a snapshot and create a new one with the same name. Or when we rename a snapshot. While there fix the argument type for zfsctl_snapshot_reclaim(). Also, its original argument can be passed to gfs_vop_reclaim() directly. Bug 209093 could be related although I have not specifically verified that. Referencing just in case. PR: 209093 MFC after: 5 weeks	2016-05-16 15:37:41 +00:00
Andriy Gapon	0ab1aa90fa	fix locking in zfsctl_root_lookup Dropping the root vnode's lock after VFS_ROOT() didn't really help the fact that we acquired the lock while holding its child's, .zfs, lock while performing the operaiton. So, directly use zfs_zget() to get the root vnode. While there simplify the code in zfsctl_freebsd_root_lookup. We know that .zfs is always exclusively locked. We know that there is already a reference on *vpp, so no need for an extra one. Account for the fact that .. lookup may ask for a different lock type, not necessarily LK_EXCLUSIVE. And handle a possible failure to acquire the lock given the lock flags. MFC after: 5 weeks	2016-05-16 15:28:39 +00:00
Andriy Gapon	705e6b8170	gfs_lookup_dot() does not have to acquire any locks In fact, that was dangerous. For example, zfsctl_snapshot_reclaim() calls gfs_dir_lookup() on ".." path and that ends up calling gfs_lookup_dot() which violated locking order by acquiring the parent's directory vnode lock after the child's vnode lock. Also, the previous behavior was inconsistent as gfs_dir_lookup() returned a locked vnode for . and .. lookups, but not for any other. Now gfs_lookup_dot() just references a resulting vnode and the locking is done in its consumers, where necessary. Note that we do not enable shared locking support for any gfs / zfsctl vnodes. This commit partially reverts r273641. MFC after: 5 weeks	2016-05-16 15:13:16 +00:00
Andriy Gapon	7223645bd1	avoid deadlock between zfsctl_snapdir_lookup and zfsctl_snapshot_reclaim The former acquired a snap vnode lock while holding sd_lock while the latter does the opposite. The solution is drop sd_lock before acquiring the vnode lock. That should be okay as we are still holding a lock on the 'snapshot' directory in the exclusive mode. That lock ensures that there are no concurrent lookups in the directory and thus no concurrent mount attempts. But now we have to account for the possibility that the snap vnode might get reclaim after we drop sd_lock and before we can get the node lock. So, check for that case and retry. MFC after: 5 weeks	2016-05-16 15:03:52 +00:00
Andriy Gapon	c6cd01d924	fix a vnode reference leak caused by illumos compat traverse() This commit partially reverts r273641 which introduced the leak. It did so to accomodate for some consumers of traverse() that expected the starting vnode to stay as-is. But that introduced the leak in the case when a mounted filesystem was found and its root vnode was returned. r299914 removed the troublesome consumers and now there is no reason to keep the starting vnode. So, now the new rules are: - if there is no mounted filesystem, then nothing is changed - otherwise the starting vnode is always released - the root vnode of the mounted filesystem is returned locked and referenced in the case of success MFC after: 5 weeks X-MFC after: r299914	2016-05-16 12:15:19 +00:00
Andriy Gapon	20ec8b0f9b	fix up r299902: mount_snapshot requires that the covered vnode is locked Previously that was not strictly enforced. MFC after: 4 weeks X-MFC with: r299902	2016-05-16 11:48:43 +00:00
Andriy Gapon	cf7aa80bbd	zfsctl_ops_snapshot: remove methods should never be called We pretend that snapshots mounted under .zfs are part of the original filesystem and we try very hard to hide vnodes on top of which the snapshots are mounted. Given that I believe that the removed operations should never be called. They might have been called previously because of issues fixed in r299906, r299908 and r299913. MFC after: 5 weeks	2016-05-16 07:24:30 +00:00
Andriy Gapon	cb68fd3513	zfsctl_snapdir_lookup: always clear VV_ROOT flag of snapshot's root VV_ROOT Previosuly we did that only if the snapshot was mounted earlier, its root vnode got recycled and then we accessed it again. We never cleared the flag for a freshly mounted snapshot. That was very inconsistent and probably a source of some bugs. Or maybe that painted over some bugs which might get revealed now. We should consistently clear the flag because we try very hard to pretend that snapshots auto-mounted under .zfs are part of their original filesystem. In other words, we try to hide the fact that they are different filesystems / mountpoints. MFC after: 5 weeks	2016-05-16 06:49:09 +00:00
Andriy Gapon	4df590b5b6	add zfs_vptocnp with special handling for snapshots under .zfs The logic is similar to that already present in zfs_dirlook() to handle a dot-dot lookup on a root vnode of a snapshot mounted under .zfs/snapshot/. illumos does not have an equivalent of vop_vptocnp, so there only the lookup had to be patched up. MFC after: 4 weeks	2016-05-16 06:40:51 +00:00
Andriy Gapon	a03fb1cf6a	mount_snapshot: consolidate all error handling This makes sure that the original vnode is always unlocked and released if any error happens. MFC after: 4 weeks	2016-05-16 06:30:25 +00:00
Andriy Gapon	3055925d42	zfsctl: fix several problems with reference counts * Remove excessive references on a snapshot mountpoint vnode. zfsctl_snapdir_lookup() called VN_HOLD() on a vnode returned from zfsctl_snapshot_mknode() and the latter also had a call to VN_HOLD() on the same vnode. On top of that gfs_dir_create() already returns the vnode with the use count of 1 (set in getnewvnode). So there was 3 references on the vnode. * mount_snapshot() should keep a reference to a covered vnode. That reference is owned by the mountpoint (mounted snapshot filesystem). * Remove cryptic manipulations of a covered vnode in zfs_umount(). FreeBSD dounmount() already does the right thing and releases the covered vnode. PR: 207464 Reported by: dustinwenz@ebureau.com Tested by: Howard Powell <hpowell@lighthouseinstruments.com> MFC after: 3 weeks	2016-05-16 06:24:04 +00:00
John Baldwin	fdce57a042	Add an EARLY_AP_STARTUP option to start APs earlier during boot. Currently, Application Processors (non-boot CPUs) are started by MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until SI_SUB_SMP at which point they are released to run kernel threads. SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter the scheduler and start running threads until fairly late in the boot. This change moves SI_SUB_SMP up to just before software interrupt threads are created allowing the APs to start executing kernel threads much sooner (before any devices are probed). This allows several initialization routines that need to perform initialization on all CPUs to now perform that initialization in one step rather than having to defer the AP initialization to a second SYSINIT run at SI_SUB_SMP. It also permits all CPUs to be available for handling interrupts before any devices are probed. This last feature fixes a problem on with interrupt vector exhaustion. Specifically, in the old model all device interrupts were routed onto the boot CPU during boot. Later after the APs were released at SI_SUB_SMP, interrupts were redistributed across all CPUs. However, several drivers for multiqueue hardware allocate N interrupts per CPU in the system. In a system with many CPUs, just a few drivers doing this could exhaust the available pool of interrupt vectors on the boot CPU as each driver was allocating N * mp_ncpu vectors on the boot CPU. Now, drivers will allocate interrupts on their desired CPUs during boot meaning that only N interrupts are allocated from the boot CPU instead of N * mp_ncpu. Some other bits of code can also be simplified as smp_started is now true much earlier and will now always be true for these bits of code. This removes the need to treat the single-CPU boot environment as a special case. As a transition aid, the new behavior is available under a new kernel option (EARLY_AP_STARTUP). This will allow the option to be turned off if need be during initial testing. I plan to enable this on x86 by default in a followup commit in the next few days and to have all platforms moved over before 11.0. Once the transition is complete, the option will be removed along with the !EARLY_AP_STARTUP code. These changes have only been tested on x86. Other platform maintainers are encouraged to port their architectures over as well. The main things to check for are any uses of smp_started in MD code that can be simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in the EARLY_AP_STARTUP case (e.g. the interrupt shuffling). PR: kern/199321 Reviewed by: markj, gnn, kib Sponsored by: Netflix	2016-05-14 18:22:52 +00:00
Enji Cooper	622282b3b8	Include arpa/inet.h to get the htonl(3) definition MFC after: 2 weeks Reported by: clang Sponsored by: EMC / Isilon Storage Division	2016-05-13 11:15:33 +00:00
Conrad Meyer	3b56262303	compat/opensolaris: Don't redefined off64_t if already defined A follow-up to r299456. Reported by: gjb Sponsored by: EMC / Isilon Storage Division	2016-05-11 16:05:32 +00:00
Alexander Motin	c59a902fa3	MFV r299453: 6765 zfs_zaccess_delete() comments do not accurately reflect delete permissions for ACLs Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> openzfs/openzfs@a40149b935	2016-05-11 13:53:29 +00:00
Alexander Motin	0eb65a5367	MFV r299451: 6764 zfs issues with inheritance flags during chmod(2) with aclmode=passthrough Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Albert Lee <trisk@nexenta.com> openzfs/openzfs@1bcf0d240b	2016-05-11 13:50:34 +00:00
Alexander Motin	85a69dbf66	MFV r299449: 6763 aclinherit=restricted masks inherited permissions by group perms (groupmask) Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Albert Lee <trisk@nexenta.com> openzfs/openzfs@eebb483d0c	2016-05-11 13:48:15 +00:00

... 3 4 5 6 7 ...

1934 Commits