freebsd-dev

Author	SHA1	Message	Date
Alexander Motin	3d1e0e0830	MFV r302662: 6447 handful of nvpair cleanups illumos/illumos-gate@759e89be35 https://github.com/illumos/illumos-gate/commit/759e89be359f2af635e4122d147df56bc e948773 https://www.illumos.org/issues/6447 I got a patch from someone who uses nvpair code outside of illumos. It fixes a couple of gcc warnings/bugs for him. 1. silence uninitialized use warnings 2. add parentheses around assignment used as truth value 3. fix printf format specifier (ll is for integers only) 4. strstr, strspn, strcspn, and strcmp are declared in string.h, not strings.h. 5. avoid scanning integer into boolean variable Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Steve Dougherty <sdougherty@barracuda.com>	2016-09-01 15:17:39 +00:00
Alexander Motin	3421688c2d	MFV r302661: 7082 bptree_iterate() passes wrong args to zfs_dbgmsg() illumos/illumos-gate@10e67aa0db https://github.com/illumos/illumos-gate/commit/10e67aa0db0823d5464aafdd681f3c966 155c68e https://www.illumos.org/issues/7082 upstream DLPX-40542 bptree_iterate() passes wrong args to zfs_dbgmsg() Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 15:10:40 +00:00
Alexander Motin	41b9077ef6	MFV r302660: 6314 buffer overflow in dsl_dataset_name illumos/illumos-gate@9adfa60d48 https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6 92b6660 https://www.illumos.org/issues/6314 Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but dsl_dataset_name copies the datasets' name PLUS the snapshot name to it, resulting in a max of 2 * ZFS_MAXNAMELEN + '@'. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 15:08:27 +00:00
Alexander Motin	e12a269749	MFV r302659: 6931 lib/libzfs: cleanup gcc warnings illumos/illumos-gate@88f61dee20 `88f61dee20` https://www.illumos.org/issues/6931 need cleanup: CERRWARN += -_gcc=-Wno-switch CERRWARN += -_gcc=-Wno-parentheses CERRWARN += -_gcc=-Wno-unused-function Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Igor Kozhukhov <ikozhukhov@gmail.com>	2016-09-01 14:57:06 +00:00
Alexander Motin	a95a9fe945	MFV r302651: 7054 dmu_tx_hold_t should use refcount_t to track space illumos/illumos-gate@0c779ad424 https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009 189202b https://www.illumos.org/issues/7054 upstream: ee0003de7d3e598499be7ac3fe6b61efcc47cb7f DLPX-40399 dmu_tx_hold_t should use refcount_t to track space Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 14:38:25 +00:00
Alexander Motin	96bf48b8cb	MFV r302648: 7019 zfsdev_ioctl skips secpolicy when FKIOCTL is set Note that the bulk of the upstream change is not applicable to FreeBSD and the affected files are not even in the vendor area. illumos/illumos-gate@45b1747515 `45b1747515` https://www.illumos.org/issues/7019 Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set, skips all processing of secpolicy functions. This means that ZFS is not doing any kind of verification of the credentials or access rights of the caller and assuming that (as it is an in-kernel client) all such checks have already been done. This turns out to be quite a dangerous assumption, especially with respect to sdev. In general I don't think it's particularly reasonable to offload this enforcement of access rights onto other kernel subsystems when ZFS has some particular local semantics in this area (delegated datasets etc) and does not provide any kind of API to allow other subsystems to avoid code duplication when doing it. ZFS should apply its normal access policy to requests from within the kernel, and callers should take care to give it the correct credentials and call it from the correct context in order to get the results they need. You can observe the currently unfortunate consequences of this bug in any non- global zone that has access to /dev/zvol or any subset of it via sdev profiles. In particular, a zone used to contain a KVM or similar which has a single zvol passed through to it using a <device match= block in its zone XML. Even though sdev makes something of an attempt to control for whether the caller should have access to nodes in /dev/zvol, it doesn't do this correctly, or really at all in the lookup call path. So, if we have a zone that's been given access to any part of /dev/zvol, it can simply look up the full path to any other zvol on the entire system, and the node will appear and be able to be used. Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alex Wilson <alex.wilson@joyent.com>	2016-09-01 14:24:54 +00:00
Alexander Motin	13876b47d7	MFV r302647: 6922 Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device illumos/illumos-gate@63364b0ee2 https://github.com/illumos/illumos-gate/commit/63364b0ee2604783e7a55f84258888677 68eafa4 https://www.illumos.org/issues/6922 ZFS does not do a config_sync after removing an aux (spare, log, or cache) device. AFAICT this isn't being done because it is slow and was deemed unnecessary. However, it should be such a rare operation that speed doesn't matter, and not doing it results in two problems: 1) It is theoretically possible to remove an aux device from one pool and attach it to another, then lose power. When power is restored, both pools woul d think that they own the aux device. 2) Removal of the aux device doesn't send any useful sysevents to userland. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alan Somers <asomers@gmail.com>	2016-09-01 14:17:30 +00:00
Alexander Motin	1c7d88abed	MFV r302646: 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch illumos/illumos-gate@ea4a67f462 https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84 9de5371 https://www.illumos.org/issues/6980 doing zfs send -i snap1 snap2 >testfile results in internal error: Invalid argument Abort (core dumped) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 14:06:30 +00:00
Alexander Motin	4536fd9bed	MFV r302643: 6902 speed up listing of snapshots if requesting name only and sorting by name This was our change from the beginning, so just reduce the upstream diff.	2016-09-01 13:29:53 +00:00
Alexander Motin	5fd28943d6	MFV r302642: 6876 Stack corruption after importing a pool with a too-long name illumos/illumos-gate@c971037baa `c971037baa` https://www.illumos.org/issues/6876 Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-01 13:04:36 +00:00
Alexander Motin	9007a8679a	Fix kernel panic when inheriting properties without default. There are two writable hidden properties "iscsioptions" and "stmf_sbd_lu", that have no default string value. Attempt to unset them or replicate caused kernel panic. This simple bandaid seems fixes the problem nicely. MFC after: 2 weeks	2016-08-31 11:55:31 +00:00
Mark Johnston	59ceeddecf	MFV r301526: 7035 string-related subroutines should validate input earlier Reviewed by: Alex Wilson <alex.wilson@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Patrick Mooney <pmooney@pfmooney.com> illumos/illumos-gate@771e39c3b1 MFC after: 2 weeks	2016-08-16 02:25:19 +00:00
Mark Johnston	f66200ee22	MFV r301525: 7033 ustack helper should fault on bad return values Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@a2f72b65eb MFC after: 2 weeks	2016-08-16 02:20:02 +00:00
Mark Johnston	4aea8f31b1	MFV r301524: 7034 negative record sizes should be rejected Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@0b8049bfb0 MFC after: 2 weeks	2016-08-16 02:18:34 +00:00
Mark Johnston	b7125fa9cd	MFV r296989: 6734 dtrace_canstore_statvar() fails for some valid static variables Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Bryan Cantrill <bryan@joyent.com> illumos/illumos-gate@d65f2bb4e5 MFC after: 2 weeks	2016-08-16 02:16:54 +00:00
Andriy Gapon	96762fe314	fix a zfs cross-device rename crash introduced in r303763 The problem was that 'zfsvfs' variable was not initialized if the error was detected, but in the exit path the variable was dereferenced before the error code was checked. Reported by: np MFC after: 3 days X-MFC with: r303763	2016-08-09 06:11:24 +00:00
Andriy Gapon	4fb51b52ef	fix .zfs-related cases in zfs_lookup that were broken by r303763 The problem is that the special .zfs nodes are not represented by znodes but by special gfs-based nodes. r303763 changed interface of zfs_dirlook such that started operating on znodes rather than on vnodes and, thus, the function became unsuitable for handling .zfs entities. The solution is to move the handling of the special cases to zfs_lookup, the only consumer of zfs_dirlook. I already had this solution applied in D7421, but for different reasons. Reported by: asomers MFC after: 3 days X-MFC with: r303763	2016-08-06 11:02:07 +00:00
Andriy Gapon	f79bc17233	zfs: honour and make use of vfs vnode locking protocol ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Most importantly many things that FreeBSD VFS manages on behalf of all filesystems are implemented in ZPL in a different way. Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS. The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by VFS. This commit removes all ZPL internal locking that protects parent-child relationships of filesystem nodes. These relationships are protected by vnode locks and the code is changed to take advantage of that fact and to properly interact with VFS. Removal of the internal locking allowed all ZPL dmu_tx_assign calls to use TXG_WAIT mode. Another victim, disputable perhaps, is ZFS support for filesystems with mixed case sensitivity. That support is not provided by the OS anyway, so in ZFS it was a buch of dead code. To do: - replace ZFS_ENTER mechanism with VFS managed / visible mechanism - replace zfs_zget with zfs_vget[f] as much as possible - get rid of not really useful now zfs_freebsd_* adapters - more cleanups of unneeded / unused code - fix / replace .zfs support PR: 209158 Reported by: many Tested by: many (thank you all!) MFC after: 5 days Sponsored by: HybridCluster / ClusterHQ Differential Revision: https://reviews.freebsd.org/D6533	2016-08-05 06:23:06 +00:00
Ruslan Bukin	98f50c44e3	Update RISC-V port to Privileged Architecture Version 1.9. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-08-02 14:50:14 +00:00
Enji Cooper	4fcae4df7e	Conditionalize code which defines sysctls per _KERNEL #ifdef guard This resolves several issues when compiling libzpool (userspace library), i.e. -Wimplicit-function-declaration and -Wmissing-declarations issues. MFC after: 2 weeks Reported by: clang Tested with: clang 3.8.1, gcc 4.2.1, gcc 5.3.0 Sponsored by: EMC / Isilon Storage Division	2016-07-31 06:34:49 +00:00
Mark Johnston	57185c52de	Restore an ifdef that should not have been removed in r303535. X-MFC-With: r303535	2016-07-30 07:05:32 +00:00
Mark Johnston	6d1ffb50fc	Include fasttrap handling for DATAMODEL_ILP32 when compiling for amd64. MFC after: 1 month	2016-07-30 03:11:53 +00:00
Andriy Gapon	70e3da3892	MFV r302645: 6878 Add scrub completion info to "zpool history" illumos/illumos-gate@1825bc56e5 `1825bc56e5` https://www.illumos.org/issues/6878 Summary of changes: * Replace generic "scan done" message with "scan aborted, restarting", "scan cancelled", or "scan done" * Log number of errors using spa_get_errlog_size * Refactor scan restarting check into static function Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Nav Ravindranath <nav@delphix.com> MFC after: 2 weeks	2016-07-14 11:53:39 +00:00
Andriy Gapon	39a6b17491	MFV r302650: 6940 Cannot unlink directories when over quota illumos/illumos-gate@99189164df `99189164df` https://www.illumos.org/issues/6940 Similar to #6334, but this time with empty directories: $ zfs create tank/quota $ zfs set quota=10M tank/quota $ zfs snapshot tank/quota@snap1 $ zfs set mountpoint=/mnt/tank/quota tank/quota $ mkdir /mnt/tank/quota/dir # create an empty directory $ mkfile 11M /mnt/tank/quota/11M /mnt/tank/quota/11M: initialized 9830400 of 11534336 bytes: Disc quota exceeded $ rmdir /mnt/tank/quota/dir # now unlink the empty directory rmdir: directory "/mnt/tank/quota/dir": Disc quota exceeded From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Simon Klinkert <simon.klinkert@gmail.com> MFC after: 2 weeks	2016-07-14 11:51:01 +00:00
Andriy Gapon	fe0cc75230	MFV r302644: 6513 partially filled holes lose birth time illumos/illumos-gate@8df0bcf0df `8df0bcf0df` https://www.illumos.org/issues/6513 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com> MFC after: 3 weeks	2016-07-14 11:48:42 +00:00
Andriy Gapon	e7ed92bbbc	MFV r302641: 6844 dnode_next_offset can detect fictional holes illumos/illumos-gate@11ceac77ea `11ceac77ea` https://www.illumos.org/issues/6844 dnode_next_offset is used in a variety of places to iterate over the holes or allocated blocks in a dnode. It operates under the premise that it can iterate over the blockpointers of a dnode in open context while holding only the dn_struct_rwlock as reader. Unfortunately, this premise does not hold. When we create the zio for a dbuf, we pass in the actual block pointer in the indirect block above that dbuf. When we later zero the bp in zio_write_compress, we are directly modifying the bp. The state of the bp is now inconsistent from the perspective of dnode_next_offset: the bp will appear to be a hole until zio_dva_allocate finally finishes filling it in. In the meantime, dnode_next_offset can detect a hole in the dnode when none exists. I was able to experimentally demonstrate this behavior with the following setup: 1. Create a file with 1 million dbufs. 2. Create a thread that randomly dirties L2 blocks by writing to the first L0 block under them. 3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle of a file. 4. Do dnode_next_offset in a loop until we skip over such a non-existent hole. The fix is to ensure that it is valid to iterate over the indirect blocks in a dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP and updating the actual BP in dbuf_write_ready while holding the lock. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alex Reece <alex@delphix.com> MFC after: 3 weeks	2016-07-14 11:42:53 +00:00
Andriy Gapon	875e6e5b04	MFV r302640: 6874 rollback and receive need to reset ZPL state to what's on disk illumos/illumos-gate@1fdcbd00c9 `1fdcbd00c9` https://www.illumos.org/issues/6874 When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the ZPL doesn't completely reload the state from the DMU; some values remain cached in the zfsvfs_t. steps to reproduce: ``` #!/bin/bash -x zfs destroy -R test/fs zfs destroy -R test/recvd zfs create test/fs zfs snapshot test/fs@a zfs set userquota@$USER=1m test/fs zfs snapshot test/fs@b zfs send test/fs@a \| zfs recv test/recvd zfs send -i @a test/fs@b \| zfs recv test/recvd zfs userspace test/recvd 1. should show 1m quota dd if=/dev/urandom of=/test/recvd/file bs=1k count=1024 sync dd if=/dev/urandom of=/test/recvd/file2 bs=1k count=1024 2. should fail with ENOSPC sync zfs unmount test/recvd zfs mount test/recvd zfs userspace test/recvd 3. if bug above, now shows 1m quota dd if=/dev/urandom of=/test/recvd/file3 bs=1k count=1024 4. if bug above, now fails with ENOSPC ``` Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2016-07-14 11:39:36 +00:00
Andriy Gapon	ac3623e090	re-apply r299908: zfsctl_snapdir_lookup: clear VV_ROOT of snapshot's root The change has been undone in r301275 on the assumption that it was no longer required. But that was incorrect, because in this case (and only in this case) the snapshot root vnode is looked up before z_parent is fixed up. MFC after: 5 days	2016-07-13 15:16:51 +00:00
Mark Johnston	ca1ef36cf4	Avoid truncating the return value of DTrace predicates. Predicates are DIF objects whose return value is compared with zero to determine whether the corresponding probe body is to be executed. The return value itself is the contents of a 64-bit DIF register, but it was being truncated to an int before the comparison. This meant that a predicate such as /0x100000000/ would evaluate to false. Reported by: rwatson MFC after: 3 days	2016-07-09 22:41:21 +00:00
Steven Hartland	ae8420ed72	Fix ZFS ARC min / max tunable Due to ARC initial configuration not being done and kmem information not being available we need to blindly set zfs_arc_max and zfs_arc_min when configured via the tunable. This fixes vfs.zfs.arc_(min\|max) configuration via loader.conf broken by r302265. Approved by: re(gjb) MFC after: 1 week	2016-07-06 23:49:19 +00:00
Alexander Motin	e36599916f	Revert r299454 and r299448. Those changes were found confusing FreeBSD libc ACL code, that doesn't differentiate ACL for directories and files, and report ACLs for all directories created after those patches as non-trivial. On the other side these changes were considered wrong from POSIX and NFSv4 points of view. Until further investigation done upstream, revert those changes locally in preparation for FreeBSD 11.0 release. Approved by: re (hrs)	2016-06-30 14:55:49 +00:00
Steven Hartland	f535b2d7d8	Allow ZFS ARC min / max to be tuned at runtime Prior to this change ZFS ARC min / max could only be changed using boot time tunables, this allows the values to be tuned at runtime using the sysctls: * vfs.zfs.arc_max * vfs.zfs.arc_min When adjusting ZFS ARC minimum the memory used will only reduce to the new minimum given memory pressure. Reviewed by: allanjude Approved by: re (gjb) MFC after: 2 weeks Relnotes: yes Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D5907	2016-06-29 07:55:45 +00:00
Andriy Gapon	0f7dcde977	fix deadlock-prone code in getzfsvfs() getzfsvfs() called vfs_busy() in the waiting mode while having a hold on a pool (via a call to dmu_objset_hold). In other words, dp_config_rwlock was held in the shared mode while a thread could be sleeping in vfs_busy(). The pool's txg sync thread needs to take dp_config_rwlock in the exclusive mode for some actions, e.g., for executing sync tasks. If the sync thread gets blocked, then any thread waiting for its sync task to get executed is also blocked. Which, in turn, could mean that vfs_busy() will keep waiting indefinitely. The solution is to use vfs_ref() in the locked section and to call vfs_busy() only after dropping other locks. Note that a reference on a struct mount object does not prevent an associated zfsvfs_t object from being destroyed. So, we have to be careful to operate only on the struct mount object until we successfully vfs_busy it. Approved by: re (gjb) MFC after: 2 weeks	2016-06-23 07:01:54 +00:00
Alan Somers	54edbcfb69	Fix uninitialized variable from r300881 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Initialize needs_update in vdev_geom_set_physpath PR: 210409 Reported by: kp Reviewed by: kp Approved by: re (hrs) MFC after: 4 weeks X-MFC-With: 300881 Sponsored by: Spectra Logic Corp	2016-06-21 15:27:16 +00:00
Konstantin Belousov	cacbedfc46	Fix gcc build. Reported andt tested by: swills Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-18 20:20:00 +00:00
Konstantin Belousov	4a0d95f810	Use vnlru_free(9) to implement dnlc_reduce_cache(). This apparently puts ARC back under the limits after the vnode pressure rework in r291244, in particular due to the kmem exhaustion. Based on patch by: mckusick Reviewed by: avg, mckusick Tested by: allanjude, madpilot Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-17 17:34:28 +00:00
Andriy Gapon	55be2f79e5	l2arc: reset b_tmp_cdata to NULL in the case of unset b_daddr The change is in arc_buf_l2_cdata_free(). Without this we can trip the assertion in arc_hdr_realloc() if INVARIANTS option is enabled. Approved by: re (kib) MFC after: 1 week	2016-06-13 18:39:13 +00:00
Andriy Gapon	aa14503e8e	zfs_vptocnp: check for an invalid znode ... which can arise after the receive or rollback and failed zfs_rezget(). Approved by: re (kib) MFC after: 1 week	2016-06-13 10:53:34 +00:00
Andriy Gapon	a68789426a	zfs: set VROOT / VV_ROOT consistently and in a single place This is a followup to r300131. A filesystem's root vnode can be reached not only through VSF_ROOT, but by other means as well. For example, via a dot-dot lookup. Also, a root vnode can get reclaimed and then re-created. For these reasons it was insufficient to clear VV_ROOT flag from a root vnode of a snapshot mounted under .zfs in zfsctl_snapdir_lookup(). So, now we set the flag in zfs_znode_sa_init() only if a vnode represent a root of a filesystem or a standalone snapshot. That is, the flag is not set for snapshots mounted under .zfs. MFC after: 2 weeks	2016-06-03 14:37:18 +00:00
Andriy Gapon	d1cf30f4f1	zfs_root: fix a potential root vnode reference leak It could happen in an unlikely case that we fail to lock the root vnode with requested flags (which appear to never include LK_NOWAIT). MFC after: 1 week	2016-06-03 14:22:12 +00:00
Alan Somers	b1a6b8dcd2	Improve the English in a comment sys/cddl/contrib/opensolaris/uts/common/sys/acl.h: Improve the english in a comment. No functional changes Submitted by: gibbs MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-06-01 22:21:42 +00:00
Allan Jude	0144ad3e78	Connect the SHA-512t256 and Skein hashing algorithms to ZFS Support for the new hashing algorithms in ZFS was introduced in r289422 However it was disconnected because FreeBSD lacked implementations of SHA-512 (truncated to 256 bits), and Skein. These implementations were introduced in r300921 and r300966 respectively This commit connects them to ZFS and enabled these new checksum algorithms This new algorithms are not supported by the boot blocks, so do not use them on your root dataset if you boot from ZFS. Relnotes: yes Sponsored by: ScaleEngine Inc.	2016-05-31 04:12:14 +00:00
Bryan Drewery	fdd9048a63	Avoid more literal-suffix errors with C++11	2016-05-29 00:40:29 +00:00
Alan Somers	7a0c41d5d7	zfsd(8), the ZFS fault management daemon Add zfsd, which deals with hard drive faults in ZFS pools. It manages hotspares and replements in drive slots that publish physical paths. cddl/usr.sbin/zfsd Add zfsd(8) and its unit tests cddl/usr.sbin/Makefile Add zfsd to the build lib/libdevdctl A C++ library that helps devd clients process events lib/Makefile share/mk/bsd.libnames.mk share/mk/src.libnames.mk Add libdevdctl to the build. It's a private library, unusable by out-of-tree software. etc/defaults/rc.conf By default, set zfsd_enable to NO etc/mtree/BSD.include.dist Add a directory for libdevdctl's include files etc/mtree/BSD.tests.dist Add a directory for zfsd's unit tests etc/mtree/BSD.var.dist Add /var/db/zfsd/cases, where zfsd stores case files while it's shut down. etc/rc.d/Makefile etc/rc.d/zfsd Add zfsd's rc script sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c Fix the resource.fs.zfs.statechange message. It had a number of problems: It was only being emitted on a transition to the HEALTHY state. That made it impossible for zfsd to take actions based on drives getting sicker. It compared the new state to vdev_prevstate, which is the state that the vdev had the last time it was opened. That doesn't make sense, because a vdev can change state multiple times without being reopened. vdev_set_state contains logic that will change the device's new state based on various conditions. However, the statechange event was being posted _before_ that logic took effect. Now it's being posted after. Submitted by: gibbs, asomers, mav, allanjude Reviewed by: mav, delphij Relnotes: yes Sponsored by: Spectra Logic Corp, iX Systems Differential Revision: https://reviews.freebsd.org/D6564	2016-05-28 17:43:40 +00:00
Enji Cooper	dfdbdb0c82	Fix up r300870 The sys/types.h fix I proposed was only tested with zfs(4), not with libzpool, which is where the build failure actually existed Remove vm/vm_pageout.h from arc.c and zfs_vnops.c because they're both unneeded MFC after: 1 week X-MFC with: r300865, r300870 In collaboration with: kib Submitted by: alc Sponsored by: EMC / Isilon Storage Division	2016-05-27 22:56:00 +00:00
Alan Somers	151746b244	Avoid issuing spa config updates for physical path when not necessary ZFS's configuration needs to be updated whenever the physical path for a device changes, but not when a new device is introduced. This is because new devices necessarily cause config updates, but only if they are actually accepted into the pool. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When setting the vdev's physical path, only request a config update if the physical path has changed. Don't request it when opening a device for the first time, because the config sync will happen anyway upstack. sys/geom/geom_dev.c Split g_dev_set_physpath and g_dev_set_media out of g_dev_attrchanged Submitted by: will, asomers MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6428	2016-05-27 22:32:44 +00:00
Enji Cooper	765daefd68	Unbreak the zfs(4) build vm/vm_pageout.h grew a dependency on the bool typedef in r300865 arc.c didn't include sys/types.h, which included the definition for the typedef Other items (ofed, drm2) might need to be chased for this commit. X-MFC with: r300865 MFC after: 1 week Pointyhat to: alc Sponsored by: EMC / Isilon Storage Division	2016-05-27 20:33:38 +00:00
Ruslan Bukin	fed1ca4b71	Add initial DTrace support for RISC-V. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-05-24 16:41:37 +00:00
Andriy Gapon	fabe7e4ecc	add vop_print methods to vnode operatios of various zfsctl node types This should help with diagnostics of zfsctl problems. MFC after: 2 weeks	2016-05-18 13:21:29 +00:00
Andriy Gapon	e34c8d727b	move zfsctl_freebsd_root_lookup right next to zfsctl_root_lookup That makes it easier to reason about the code. MFC after: 5 weeks	2016-05-18 08:29:39 +00:00
Andriy Gapon	a4bbed22d2	zfsctl_common_fid: remove redundant assignment "Reinterpret cast" to zfid_short_t and assignment of zf_len do the job already. MFC after: 1 week	2016-05-18 08:26:09 +00:00
Andriy Gapon	e6d4eefe2a	zfsctl: tighten an assertion and remove an unused definition There are only two entries under .zfs and 'shares' has an ID of a special persistent object in its filesystem. MFC after: 1 week	2016-05-18 08:23:39 +00:00
Andriy Gapon	439e9b6804	zfs_root: no need to set the root flag here That was both redundant as zfs_znode_sa_init() already does the job and insufficient as the root vnode can be reached via other means. MFC after: 1 weeks	2016-05-18 08:19:41 +00:00
Andriy Gapon	74a3df2b1f	zfsctl_freebsd_root_lookup: gfs_vop_lookup may return a doomed vnode gfs code is (almsot) completely agnostic of FreeBSD VFS locking, so it does not handle doomed but not yet dead vnodes and may return them. Check for those vnodes here and retry a lookup. Note that ZFS and gfs have additional protections that ensure that a parent vnode of the current vnode is never doomed. The fixed problem is an occasional failure to lookup a 'snapshot' or 'shares' directories under .zfs. Note that for the above reason all uses of zfsctl_root_lookup() are better be replaced with VOP_LOOKUP. MFC after: 5 weeks	2016-05-18 08:02:49 +00:00
Alan Somers	5f7b3969e9	Speed up vdev_geom_open_by_guids Speedup is hard to measure because the only time vdev_geom_open_by_guids gets called on many drives at the same time is during boot. But with vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations like "zpool create" speed up by 65%. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c * Read all of a vdev's labels in parallel instead of sequentially. * In vdev_geom_read_config, don't read the entire label, including the uberblock. That's a waste of RAM. Just read the vdev config nvlist. Reduces the IO and RAM involved with tasting from 1MB to 448KB. Reviewed by: avg MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6153	2016-05-17 15:17:23 +00:00
Andriy Gapon	857a214d03	zfs_ioc_rename: fix a reversed condition FreeBSD zfs_ioc_rename() has an option, not present upstream, that allows to rename snapshots without unmounting them first. I am not sure what is a rationale for that option, but its actual behavior was the opposite of the intended behavior. That is, by default the snapshots were not unmounted. The option was introduced as part of a large update from upstream in r248498. One of the consequences was a havoc under .zfs/snapshot after the rename. The snapshots got new names but were mounted on top of directories with old names, so readdir would list the new names, but lookup would still find the old mounts. PR: 209093 Reported by: Frédéric VANNIÈRE <f.vanniere@planet-work.com> MFC after: 5 days	2016-05-17 07:56:05 +00:00
Andriy Gapon	afe674f089	do not destroy 'snapdir' when it becomes inactive That was just wrong. In fact, we can safely keep this static entry when it's inactive. Now the destructive action is moved to the reclaim method and the function is renamed from zfsctl_snapdir_inactive(0 to zfsctl_snapdir_reclaim(). Also, we can use gfs_vop_reclaim() instead of gfs_dir_inactive() + kmem_free(). Lastly, we can just assert that the node does not any children when it is reclaimed, even on the force unmount. That's because zfs_umount() does an extra vflush() pass which should destroy all snapshot-mountpoint vnodes that are the snapdir's children. MFC after: 5 weeks	2016-05-16 15:48:56 +00:00
Andriy Gapon	9c3e205296	try to recycle "snap" vnodes as soon as possible Those vnodes should not linger. "Stale" nodes may get out of synchronization with actual snapshots. For example if we destroy a snapshot and create a new one with the same name. Or when we rename a snapshot. While there fix the argument type for zfsctl_snapshot_reclaim(). Also, its original argument can be passed to gfs_vop_reclaim() directly. Bug 209093 could be related although I have not specifically verified that. Referencing just in case. PR: 209093 MFC after: 5 weeks	2016-05-16 15:37:41 +00:00
Andriy Gapon	0ab1aa90fa	fix locking in zfsctl_root_lookup Dropping the root vnode's lock after VFS_ROOT() didn't really help the fact that we acquired the lock while holding its child's, .zfs, lock while performing the operaiton. So, directly use zfs_zget() to get the root vnode. While there simplify the code in zfsctl_freebsd_root_lookup. We know that .zfs is always exclusively locked. We know that there is already a reference on *vpp, so no need for an extra one. Account for the fact that .. lookup may ask for a different lock type, not necessarily LK_EXCLUSIVE. And handle a possible failure to acquire the lock given the lock flags. MFC after: 5 weeks	2016-05-16 15:28:39 +00:00
Andriy Gapon	705e6b8170	gfs_lookup_dot() does not have to acquire any locks In fact, that was dangerous. For example, zfsctl_snapshot_reclaim() calls gfs_dir_lookup() on ".." path and that ends up calling gfs_lookup_dot() which violated locking order by acquiring the parent's directory vnode lock after the child's vnode lock. Also, the previous behavior was inconsistent as gfs_dir_lookup() returned a locked vnode for . and .. lookups, but not for any other. Now gfs_lookup_dot() just references a resulting vnode and the locking is done in its consumers, where necessary. Note that we do not enable shared locking support for any gfs / zfsctl vnodes. This commit partially reverts r273641. MFC after: 5 weeks	2016-05-16 15:13:16 +00:00
Andriy Gapon	7223645bd1	avoid deadlock between zfsctl_snapdir_lookup and zfsctl_snapshot_reclaim The former acquired a snap vnode lock while holding sd_lock while the latter does the opposite. The solution is drop sd_lock before acquiring the vnode lock. That should be okay as we are still holding a lock on the 'snapshot' directory in the exclusive mode. That lock ensures that there are no concurrent lookups in the directory and thus no concurrent mount attempts. But now we have to account for the possibility that the snap vnode might get reclaim after we drop sd_lock and before we can get the node lock. So, check for that case and retry. MFC after: 5 weeks	2016-05-16 15:03:52 +00:00
Andriy Gapon	c6cd01d924	fix a vnode reference leak caused by illumos compat traverse() This commit partially reverts r273641 which introduced the leak. It did so to accomodate for some consumers of traverse() that expected the starting vnode to stay as-is. But that introduced the leak in the case when a mounted filesystem was found and its root vnode was returned. r299914 removed the troublesome consumers and now there is no reason to keep the starting vnode. So, now the new rules are: - if there is no mounted filesystem, then nothing is changed - otherwise the starting vnode is always released - the root vnode of the mounted filesystem is returned locked and referenced in the case of success MFC after: 5 weeks X-MFC after: r299914	2016-05-16 12:15:19 +00:00
Andriy Gapon	20ec8b0f9b	fix up r299902: mount_snapshot requires that the covered vnode is locked Previously that was not strictly enforced. MFC after: 4 weeks X-MFC with: r299902	2016-05-16 11:48:43 +00:00
Andriy Gapon	cf7aa80bbd	zfsctl_ops_snapshot: remove methods should never be called We pretend that snapshots mounted under .zfs are part of the original filesystem and we try very hard to hide vnodes on top of which the snapshots are mounted. Given that I believe that the removed operations should never be called. They might have been called previously because of issues fixed in r299906, r299908 and r299913. MFC after: 5 weeks	2016-05-16 07:24:30 +00:00
Andriy Gapon	cb68fd3513	zfsctl_snapdir_lookup: always clear VV_ROOT flag of snapshot's root VV_ROOT Previosuly we did that only if the snapshot was mounted earlier, its root vnode got recycled and then we accessed it again. We never cleared the flag for a freshly mounted snapshot. That was very inconsistent and probably a source of some bugs. Or maybe that painted over some bugs which might get revealed now. We should consistently clear the flag because we try very hard to pretend that snapshots auto-mounted under .zfs are part of their original filesystem. In other words, we try to hide the fact that they are different filesystems / mountpoints. MFC after: 5 weeks	2016-05-16 06:49:09 +00:00
Andriy Gapon	4df590b5b6	add zfs_vptocnp with special handling for snapshots under .zfs The logic is similar to that already present in zfs_dirlook() to handle a dot-dot lookup on a root vnode of a snapshot mounted under .zfs/snapshot/. illumos does not have an equivalent of vop_vptocnp, so there only the lookup had to be patched up. MFC after: 4 weeks	2016-05-16 06:40:51 +00:00
Andriy Gapon	3055925d42	zfsctl: fix several problems with reference counts * Remove excessive references on a snapshot mountpoint vnode. zfsctl_snapdir_lookup() called VN_HOLD() on a vnode returned from zfsctl_snapshot_mknode() and the latter also had a call to VN_HOLD() on the same vnode. On top of that gfs_dir_create() already returns the vnode with the use count of 1 (set in getnewvnode). So there was 3 references on the vnode. * mount_snapshot() should keep a reference to a covered vnode. That reference is owned by the mountpoint (mounted snapshot filesystem). * Remove cryptic manipulations of a covered vnode in zfs_umount(). FreeBSD dounmount() already does the right thing and releases the covered vnode. PR: 207464 Reported by: dustinwenz@ebureau.com Tested by: Howard Powell <hpowell@lighthouseinstruments.com> MFC after: 3 weeks	2016-05-16 06:24:04 +00:00
Alexander Motin	c59a902fa3	MFV r299453: 6765 zfs_zaccess_delete() comments do not accurately reflect delete permissions for ACLs Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> openzfs/openzfs@a40149b935	2016-05-11 13:53:29 +00:00
Alexander Motin	0eb65a5367	MFV r299451: 6764 zfs issues with inheritance flags during chmod(2) with aclmode=passthrough Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Albert Lee <trisk@nexenta.com> openzfs/openzfs@1bcf0d240b	2016-05-11 13:50:34 +00:00
Alexander Motin	85a69dbf66	MFV r299449: 6763 aclinherit=restricted masks inherited permissions by group perms (groupmask) Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Albert Lee <trisk@nexenta.com> openzfs/openzfs@eebb483d0c	2016-05-11 13:48:15 +00:00
Alexander Motin	2a219f349e	MFV r299442: 6762 POSIX write should imply DELETE_CHILD on directories - and some additional considerations Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> openzfs/openzfs@d316fffc9c	2016-05-11 13:43:20 +00:00
Alexander Motin	42a54f9745	MFV r299440: 6736 ZFS per-vdev ZAPs Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Joe Stein <joe.stein@delphix.com> openzfs/openzfs@215198a6ad	2016-05-11 12:54:00 +00:00
Alexander Motin	7d54dbae83	MFV r299438: 6842 Fix empty xattr dir causing lockup Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Chunwei Chen <tuxoko@gmail.com> openzfs/openzfs@02525cd08f	2016-05-11 12:46:07 +00:00
Alexander Motin	d7ff478705	MFV r299436: 6843 Make xattr dir truncate and remove in one tx Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Chunwei Chen <tuxoko@gmail.com> openzfs/openzfs@399cc7d5d9	2016-05-11 12:43:54 +00:00
Alexander Motin	0b99ac761e	MFV r299434: 6841 Undirty freed spill blocks Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Tim Chase <tim@chase2k.com> openzfs/openzfs@445e67805d	2016-05-11 12:38:07 +00:00
Ruslan Bukin	d7dc6bae03	Implement FBT provider (MD part) for DTrace on MIPS. Tested on MIPS64. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-05-05 13:54:50 +00:00
Alan Somers	c9a807447d	Fix a use-after-free when "zpool import" fails clear vd->vdev_tsd in vdev_geom_close_locked instead of vdev_geom_detach. In the latter function, it would fail to happen in certain circumstances where cp->private was unset. Ideally, the latter should never happen, but it can happen when vdev open fails, or where spares are involved. MFC after: 4 weeks X-MFC-With: 298786 Sponsored by: Spectra Logic Corp	2016-04-29 21:29:37 +00:00
Alan Somers	663f649ff6	Refactor vdev_geom_attach and friends to reduce code duplication sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Move checks for provider's sectorsize and mediasize into a single location in vdev_geom_attach. Remove the zfs::vdev::taste class; it's ok to use the regular vdev class for tasting. Consolidate guid checks into a single location in vdev_attach_ok. Consolidate some error handling code from vdev_geom_attach into vdev_geom_detach, closing a resource leak of geom consumers in the process. Reviewed by: avg MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D5974	2016-04-29 15:23:51 +00:00
Mark Johnston	676a03fa6a	Increase DTRACE_FUNCNAMELEN from 128 to 192. This allows for the long function components encountered in www/firefox. This constant is part of DTrace's userland ABI, so this change may not be MFC'ed. PR: 207735	2016-04-25 18:44:11 +00:00
Mark Johnston	328d8adb9b	Allow DOF sections with excessively long probe function components. Without this change, DTrace will refuse to load a DOF section if the function component of any of its probes exceeds DTRACE_FUNCNAMELEN (128). Probes in C++ programs can have very long function components. Rather than rejecting all probes if a single probe exceeds the limit, simply skip the invalid probe and emit a warning. This ensures that valid probes are instantiated. PR: 207735 MFC after: 2 weeks	2016-04-25 18:40:57 +00:00
Andriy Gapon	dbbcddb426	MFV r298471: 6052 decouple lzc_create() from the implementation details illumos/illumos-gate@26455f9efc `26455f9efc` https://www.illumos.org/issues/6052 At the moment type parameter of lzc_create() is of dmu_objset_type_t type. That exposes an implementation detail and requires sys/fs/zfs.h to be included in libzfs_core.h creating unnecessary coupling between libzfs_core interface and ZFS internals. I think that dmu_objset_type_t should be replaced with a libzfs_core enumeration of supported dataset types. For ABI reasons the new enumeration could be bit-compatible with dmu_objset_type_t. For example: typedef enum { LZC_DST_ZFS = 2, LZC_DST_ZVOL } lzc_dataset_type_t; Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <andriy.gapon@clusterhq.com> MFC after: 2 weeks Sponsored by: ClusterHQ	2016-04-22 13:00:27 +00:00
Mark Johnston	6c2806594b	Make the second argument of dtrace_invop() a trapframe pointer. Currently this argument is a pointer into the stack which is used by FBT to fetch the first five probe arguments. On all non-x86 architectures it's simply the trapframe address, so this change has no functional impact. On amd64 it's a pointer into the trapframe such that stack[1 .. 5] gives the first five argument registers, which are deliberately grouped together in the amd64 trapframe definition. A trapframe argument simplifies the invop handlers on !x86 and makes the x86 FBT invop handler easier to understand. Moreover, it allows for invop handlers that may want to modify the register set of the interrupted thread.	2016-04-17 23:08:47 +00:00
Andriy Gapon	e01dd79f9a	zfs_rezget: z_vnode can not be NULL if zp is valid MFC after: 3 weeks	2016-04-16 07:41:56 +00:00
Andriy Gapon	c2d36fc5cd	zfs: enable vn_io_fault support Note that now we have to account for possible partial writes in dmu_write_uio_dbuf(). It seems that on illumos either all or none of the data are expected to be written. But the partial writes are quite expected when vn_io_fault support is enabled. Reviewed by: kib MFC after: 7 weeks Differential Revision: https://reviews.freebsd.org/D2790	2016-04-16 07:35:53 +00:00
Alan Somers	739f4ae3b1	Don't corrupt ZFS label's physpath attribute when booting while a disk is missing Prior to this change, vdev_geom_open_by_path would call vdev_geom_attach prior to verifying the device's GUIDs. vdev_geom_attach calls vdev_geom_attrchange to set the physpath in the vdev object. The result is that if the disk could not be found, then the labels for other disks in the same TLD would overwrite the missing disk's physpath with the physpath of whichever disk currently has the same devname as the missing one used to have. MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-04-15 16:36:17 +00:00
Alan Somers	c29088b5c7	Add more debugging statements in vdev_geom.c Log a debugging message whenever geom functions fail in vdev_geom_attach. Printing these messages is controlled by vfs.zfs.debug MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-04-14 23:14:41 +00:00
Alan Somers	f0ac053088	Update a debugging message in vdev_geom_open_by_guids for consistency with similar messages elsewhere in the file. MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-04-14 19:20:31 +00:00
Alan Somers	4e3ab010a2	Fix rare double free in vdev_geom_attrchanged sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Don't drop the g_topology_lock before freeing old_physpath. That opens up a race where one thread can call vdev_geom_attrchanged, set old_physpath, drop the g_topology_lock, then block trying to acquire the SCL_STATE lock. Then another thread can come into vdev_geom_attrchanged, set old_physpath to the same value, and proceed to free it. When the first thread resumes, it will free the same location. It turns out that the SCL_STATE lock isn't needed. It was originally added by gibbs to protect vd->vdev_physpath while updating the same. However, the update process subsequently was switched to an atomic operation (a pointer swap). Now, there is no need for the SCL_STATE lock, and hence no need to drop the g_topology_lock. Reviewed by: delphij MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D5413	2016-04-12 19:11:14 +00:00
Andriy Gapon	c3249989ef	l2arc: make sure that all writes honor ashift of a cache device Previously uncompressed buffers did not obey that rule. Type of b_asize is changed to uint64_t for consistency, given that this is a zeta-byte filesystem. l2arc_compress_buf is renamed to l2arc_transform_buf to better reflect its new utility. Now not only we ensure that a compressed buffer has a size aligned to ashift, but we also allocate a properly sized temporary buffer if the original buffer is not compressed and it has an odd size. This ensures that all I/O to the cache device is always ashift-aligned, in terms of both a request offset and a request size. If the aligned data is larger than the original data, then we have to use a temporary buffer when reading it as well. Also, enhance physical zio alignment checks using vdev_logical_ashift. On FreeBSD we have this information, so we can make stricter assertions. Reviewed by: smh, mav MFC after: 1 month Sponsored by: ClusterHQ Differential Revision: https://reviews.freebsd.org/D2789	2016-04-12 06:56:35 +00:00
Andriy Gapon	6a50036052	Revert r297396 Modify "4958 zdb trips assert on pools with ashift >= 0xe" A better fix is following.	2016-04-12 06:54:18 +00:00
Alexander Motin	78aec5c610	MFV r297831: 6322 ZFS indirect block predictive prefetch Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Author: Alexander Motin <mav@FreeBSD.org> Improve speculative prefetch of indirect blocks. Scalability of many operations on wide ZFS pool can be limited by requirement to prefetch indirect blocks first. Recently added asynchronous indirect block read partially helped, but did not solve the problem completely. This patch extends existing prefetcher functionality to explicitly work with indirect blocks. Before this change prefetcher issued reads for up to 8MB of data in advance. With this change it also issues indirect block reads for up to 64MB of data in advance, so that when it will be time to actually read those data, it can be done immediately. Alike effect can be achieved by just increasing maximal data prefetch distance, but at higher memory cost. Also this change introduces indirect block prefetch for rewrite operations, that was never done before. Previously ARC miss for Indirect blocks regularly blocked rewrites, converting perfectly aligned asynchronous operations into synchronous read-write pairs, significantly reducing maximal rewrite speed. While being there this issue was also fixed: - prefetch was done always, even if caching for the dataset was completely disabled. Testing on FreeBSD with zvol on top of 6x striped 2x mirrored pool of 12 assorted HDDs shown me such performance numbers: ------- BEFORE -------- Write 491363677 bytes/sec Read 312430631 bytes/sec Rewrite 97680464 bytes/sec -------- AFTER -------- Write 493524146 bytes/sec Read 438598079 bytes/sec Rewrite 277506044 bytes/sec Closes #65 Closes #80 openzfs/openzfs@792fd28ac0	2016-04-11 21:09:15 +00:00
Steven Hartland	2dcee04b3a	Only include sysctl in kernel build Only include sysctl in kernel builds fixing warning about implicit declaration of function 'sysctl_handle_int'. PR: 204140 MFC after: 1 week X-MFC-With: r297813 Sponsored by: Multiplay	2016-04-11 13:17:11 +00:00
Steven Hartland	7bc47b4ea3	Only include sysctl in kernel build Only include sysctl in kernel builds fixing warning about implicit declaration of function 'sysctl_handle_int'. Sponsored by: Multiplay	2016-04-11 08:57:54 +00:00
Andriy Gapon	1da2e1e353	zio: align use of "no dump" flag between use_uma and !use_uma cases At the moment no ZFS buffers are included into a crash dump unless ZFS_DEBUG (or INVARIANTS) kernel option is enabled. That's not very helpful for debugging of ZFS problems, because important information often resides in metadata buffers. This change switches the dumping behavior when UMA is used from the illumos behavior to a more useful behavior that we have on FreeBSD when ZFS buffers are allocated via malloc. Reviewed by: smh, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D5892	2016-04-11 07:11:20 +00:00
Mark Johnston	b529028676	Implement support for boot-time DTrace. This allows one to enable DTrace probes relatively early during boot, during SI_SUB_DTRACE_ANON, before dtrace(1) can invoked. The desired enabling is created using dtrace -A, which writes a /boot/dtrace.dof file and uses nextboot(8) to ensure that DTrace kernel modules are loaded and that the DOF file describing the enabling is loaded by loader(8) during the subsequent boot. The trace output can then be fetched with dtrace -a. With this commit, boot-time DTrace is only functional on i386 and amd64: on other architectures, the high-resolution timer frequency is initialized during SI_SUB_CLOCKS and is thus not available when the anonymous tracing state is initialized. On x86, the TSC is used and is thus available earlier. MFC after: 1 month Relnotes: yes	2016-04-10 01:25:48 +00:00
Alexander Motin	eaee150e3f	MFV r297760: 6418 zpool should have a label clearing command Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Author: Will Andrews <will@firepipe.net> Closes #83 Closes #32 openzfs/openzfs@9663688425 FreeBSD already had `zpool labelclear` functionality, so this is mostly just a diff reduction. MFC after: 1 month	2016-04-09 20:30:50 +00:00
Andriy Gapon	c8ff459286	zio write issue threads should have lower (numerically greater) priority This is because they might do data compression which is quite CPU expensive. The original code is correct for illumos, because there a higher priority corresponds to a greater number. MFC after: 2 weeks	2016-04-08 11:58:24 +00:00
Alexander Motin	309b1c7ade	Alike to r293708 relax pool check in vdev_geom_open_by_path(). This made impossible spare disk open by known path, which kind of worked only because the same fix was applied to vdev_geom_attach_by_guids() in r293708. MFC after: 1 week	2016-04-07 12:54:44 +00:00
Edward Tomasz Napierala	ae34b6ff96	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
Andriy Gapon	e881d8757c	remove emulation of VFS_HOLD and VFS_RELE from opensolaris compat On FreeBSD VFS_HOLD/VN_RELE were mapped to MNT_REF/MNT_REL that manipulate mnt_ref. But the job of properly maintaining the reference count is already automatically performed by insmntque(9) and delmntque(9). So, in effect all ZFS vnodes referenced the corresponding mountpoint twice. That was completely harmless, but we want to be very explicit about what FreeBSD VFS APIs are used, because illumos VFS_HOLD and FreeBSD MNT_REF provide quite different guarantees with respect to the held vfs_t / mountpoint. On illumos VFS_HOLD is sufficient to guarantee that vfs_t.vfs_data stays valid. On the other hand, on FreeBSD MNT_REF does not provide the same guarantee about mnt_data. We have to use vfs_busy() to get that guarantee. Thus, the calls to VFS_HOLD/VFS_RELE on vnode init and fini are removed. VFS_HOLD calls are replaced with vfs_busy in the ioctl handlers. And because vfs_busy has a richer interface that can not be dumbed down in all cases it's better to explicitly use it rather than trying to mask it behind VFS_HOLD. This change fixes a panic that could result from a race between zfs_umount() and zfs_ioc_rollback(). We observed a case where zfsvfs_free() tried to destroy data that zfsvfs_teardown() was still using. That happened because there was nothing to prevent unmounting of a ZFS filesystem that was in between zfs_suspend_fs() and zfs_resume_fs(). Reviewed by: kib, smh MFC after: 3 weeks Sponsored by: ClusterHQ Differential Revision: https://reviews.freebsd.org/D2794	2016-04-02 16:25:46 +00:00
Alexander Motin	baf8ceac9e	MFV r297506: 6738 zfs send stream padding needs documentation Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Eli Rosenthal <eli.rosenthal@delphix.com> illumos/illumos-gate@c20404ff77	2016-04-02 08:36:24 +00:00
Alexander Motin	4cf6fde5e8	MFV r297504: 6681 zfs list burning lots of time in dodefault() via dsl_prop_* Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@d09e4475f6	2016-04-02 08:28:46 +00:00
Gleb Smirnoff	a8b2b39cce	Fix an error in r292373. Use proper count to update "pages in" counter. Noticed by: pfg via Coverity	2016-03-31 21:15:00 +00:00
Alexander Motin	30a0e024ee	Plug open count leak on zvol rename. MFC after: 2 weeks	2016-03-30 16:54:18 +00:00
Alexander Motin	b39dea9308	Switch from using make_dev_p() to make_dev_s() to close races.	2016-03-30 16:48:57 +00:00
Alexander Motin	86b0daa373	Modify "4958 zdb trips assert on pools with ashift >= 0xe". Unlike Illumos FreeBSD has concept of logical ashift, that specifies really minimal vdev block size that can be accessed. This knowledge allows properly pad physical I/O and correctly assert its alignment. This change fixes L2ARC write errors when device has logical sector size above 512 bytes. MFC after: 1 month	2016-03-29 19:18:34 +00:00
Alexander Motin	78b127f2fc	Pass through error code from make_dev_p(). ENAMETOOLONG is much more informative in logs then ENXIO. MFC after: 1 week	2016-03-28 08:12:29 +00:00
Alexander Motin	52c9b0b539	Unify ignoring EEXIST from zvol_create_minor(). This fixes creation of zvol devices for snapshots during zfs receive, that previously failed with "ZFS WARNING: Unable to create ZVOL" message. This solution is not perfect, but IMHO better then it was before. MFC after: 2 weeks	2016-03-24 10:10:41 +00:00
Mark Johnston	48cc2d5e22	Remove unused variables dtrace_in_probe and dtrace_in_probe_addr.	2016-03-17 18:55:54 +00:00
Alexander Motin	5db0866658	Make ZFS ignore stripe sizes above SPA_MAXASHIFT (8KB). If device has stripe size bigger then maximal sector size supported by ZFS, there is nothing can be done to avoid read-modify-write cycles. Taking that stripe size into account will only reduce space efficiency and pointlessly bother user with warnings that can not be fixed. Discussed with: smh	2016-03-10 16:39:46 +00:00
Alexander Motin	eef192d85c	Make ZFS more picky to GEOM stripe sizes and offsets. Use of misaligned or non-power-of-2 stripes is not really useful for ZFS, since increased ashift won't help to avoid read-modify-write cycles, and only reduce pool space efficiency and compression rates.	2016-03-10 14:18:14 +00:00
Alexander Motin	a151f3a7ef	MFV r296609: 6370 ZFS send fails to transmit some holes Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Stefan Ring <stefanrin@gmail.com> Reviewed by: Steven Burgess <sburgess@datto.com> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> In certain circumstances, "zfs send -i" (incremental send) can produce a stream which will result in incorrect sparse file contents on the target. The problem manifests as regions of the received file that should be sparse (and read a zero-filled) actually contain data from a file that was deleted (and which happened to share this file's object ID). Note: this can happen only with filesystems (not zvols, because they do not free (and thus can not reuse) object IDs). Note: This can happen only if, since the incremental source (FromSnap), a file was deleted and then another file was created, and the new file is sparse (i.e. has areas that were never written to and should be implicitly zero-filled). We suspect that this was introduced by 4370 (applies only if hole_birth feature is enabled), and made worse by 5243 (applies if hole_birth feature is disabled, and we never send any holes). The bug is caused by the hole birth feature. When an object is deleted and replaced, all the holes in the object have birth time zero. However, zfs send cannot tell that the holes are new since the file was replaced, so it doesn't send them in an incremental. As a result, you can end up with invalid data when you receive incremental send streams. As a short-term fix, we can always send holes with birth time 0 (unless it's a zvol or a dataset where we can guarantee that no objects have been reused). Closes #37 openzfs/openzfs@adef853162	2016-03-10 09:01:19 +00:00
Alexander Motin	7370229e8d	Add new IOCTL compat shims for ABI breakage caused by r296510: MFV r296505: 6531 Provide mechanism to artificially limit disk performance	2016-03-09 11:16:15 +00:00
Alexander Motin	8d0e2eb06b	MFV r296529: 6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt() 6673 want a macro to convert seconds to nanoseconds and vice-versa Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Eli Rosenthal <eli.rosenthal@delphix.com> illumos/illumos-gate@a8f6344fa0	2016-03-08 18:28:24 +00:00
Alexander Motin	468bca03ef	MFV r296527: 6659 nvlist_free(NULL) is a no-op Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Robert Mustacchi <rm@joyent.com> Author: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> illumos/illumos-gate@aab83bb83b	2016-03-08 18:11:38 +00:00
Alexander Motin	26802705d3	MFV r296522: 6541 Pool feature-flag check defeated if "verify" is included in the dedup property value Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: ilovezfs <ilovezfs@icloud.com> illumos/illumos-gate@971640e6aa	2016-03-08 17:58:02 +00:00
Alexander Motin	178f2c2b8e	MFV r296520: 6562 Refquota on receive doesn't account for overage Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@5f7a8e6d75	2016-03-08 17:53:42 +00:00
Alexander Motin	7a90077752	MFV r296518: 5027 zfs large block support (add copyright) Author: Matthew Ahrens <matt@mahrens.org> illumos/illumos-gate@c3d26abc9e	2016-03-08 17:51:09 +00:00
Alexander Motin	c892984b84	MFV r296515: 6536 zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com> Reviewed by: Kim Shrier <kshrier@racktopsystems.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andrew Stormont <astormont@racktopsystems.com> illumos/illumos-gate@880094b606	2016-03-08 17:43:21 +00:00
Alexander Motin	253159febf	MFV r296513: 6450 scrub/resilver unnecessarily traverses snapshots created after the scrub started Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@38d6103674	2016-03-08 17:34:58 +00:00
Alexander Motin	4427252c14	MFV r296511: 6537 Panic on zpool scrub with DEBUG kernel Reviewed by: Steve Gonczi <gonczi@comcast.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Gary Mills <gary_mills@fastmail.fm> illumos/illumos-gate@8c04a1fa3f	2016-03-08 17:32:24 +00:00
Alexander Motin	1b63fd68f4	MFV r296505: 6531 Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@97e8130957	2016-03-08 17:27:13 +00:00
Mark Johnston	380344a7af	Fix fasttrap tracepoint locking. Upstream, tracepoints are protected by per-CPU mutexes. An unlinked tracepoint may be freed once all the tracepoint mutexes have been acquired and released - this is done in fasttrap_mod_barrier(). This mechanism was not properly ported: in some places, the proc lock is used in place of a tracepoint lock, and in others the locking is omitted entirely. This change implements tracepoint locking with an rmlock, where the read lock is used in fasttrap probe context. As a side effect, this fixes a recursion on the proc lock when the raise action is used from a userland probe. MFC after: 1 month	2016-03-08 00:43:03 +00:00
Mark Johnston	6b1bddce00	Remove the fasttrap implementation for sparc. Other machine-dependent code required for DTrace on sparc is not present in the tree, so there's no point to keeping the fasttrap code.	2016-03-08 00:18:46 +00:00
Mark Johnston	acaa855f6e	MFV r296306: 6604 harden DIF bounds checking Reviewed by: Alex Wilson <alex.wilson@joyent.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Bryan Cantrill <bryan@joyent.com> illumos/illumos-gate@1c0cef67db MFC after: 2 weeks	2016-03-08 00:14:14 +00:00
Steven Hartland	e283644b87	Removed unused label and fix mutex_exit order Remove unused done label from zfs_setacl fixing PVS-Studio V729. Fix mutex_exit order to mirror the mutex_enter order. MFC after: 1 week Sponsored by: Multiplay	2016-02-25 03:01:24 +00:00
Warner Losh	c55f57071a	Create an API to reset a struct bio (g_reset_bio). This is mandatory for all struct bio you get back from g_{new,alloc}_bio. Temporary bios that you create on the stack or elsewhere should use this before first use of the bio, and between uses of the bio. At the moment, it is nothing more than a wrapper around bzero, but that may change in the future. The wrapper also removes one place where we encode the size of struct bio in the KBI.	2016-02-17 17:16:02 +00:00
Andriy Gapon	984777c43f	MFV r294821: 6529 Properly handle updates of variably-sized SA entries. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Andriy Gapon <avg@icyb.net.ua> illumos/illumos-gate@e7e978b1f7 During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Another problem was that index into attr_desc[] was increased even when an attribute was removed. If that attribute was not the last attribute, then the last attribute was lost.	2016-02-01 15:40:40 +00:00
Alan Somers	d4b9233a96	Add a sysctl to allow ZFS pools backed by zvols Change 294329 removed the ability to build ZFS pools that are backed by zvols, because having that ability (even if it's not used) leads to deadlocks. By popular demand, I'm adding an off-by-default sysctl to reenable that ability. Reviewed by: lidl, delphij MFC after: Never Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4998	2016-01-29 17:08:26 +00:00
Ruslan Bukin	28029b68c0	Welcome the RISC-V 64-bit kernel. This is the final step required allowing to compile and to run RISC-V kernel and userland from HEAD. RISC-V is a completely open ISA that is freely available to academia and industry. Thanks to all the people involved! Special thanks to Andrew Turner, David Chisnall, Ed Maste, Konstantin Belousov, John Baldwin and Arun Thomas for their help. Thanks to Robert Watson for organizing this project. This project sponsored by UK Higher Education Innovation Fund (HEIF5) and DARPA CTSRD project at the University of Cambridge Computer Laboratory. FreeBSD/RISC-V project home: https://wiki.freebsd.org/riscv Reviewed by: andrew, emaste, kib Relnotes: Yes Sponsored by: DARPA, AFRL Sponsored by: HEIF5 Differential Revision: https://reviews.freebsd.org/D4982	2016-01-29 15:12:31 +00:00
Alexander Motin	5a97f48082	MFV r294819: 6495 Fix mutex leak in dmu_objset_find_dp Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Albert Lee <trisk@omniti.com> Author: Steven Hartland <steven.hartland@multiplay.co.uk> illumos/illumos-gate@2bad22584d	2016-01-26 13:45:41 +00:00
Alexander Motin	1cb4625f18	MFV r294816: 4986 receiving replication stream fails if any snapshot exceeds refquota Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Author: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@5878fad70d	2016-01-26 13:37:30 +00:00
Alexander Motin	75b810aee6	MFV r294814: 6393 zfs receive a full send as a clone Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com> illumos/illumos-gate@68ecb2ec93 This allows to do a full (non-incremental send) and receive it as a clone of an existing dataset. It can leverage nopwrite to share blocks with the origin. This can be used to change the relationship of datasets on the target. For example, maybe on the source you have: A ---- B ---- C And you have sent to the target a full of B, and the incremental B->C: B ---- C You later realize that you want to have A on the target. You will have to do a full send of A, but nopwrite can save you space on the target if you receive it as a clone of B, assuming that A and B have some blocks inxi common: B ---- C \ A	2016-01-26 13:14:39 +00:00
Alexander Motin	d2385b31f5	MFV r294812: 6434 sa_find_sizes() may compute wrong SA header size Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Andriy Gapon <avg@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: James Pan <jiaming.pan@yahoo.com> illumos/illumos-gate@3502ed6e7c	2016-01-26 13:03:01 +00:00
Alexander Motin	49b7f6ef02	MFV r294810: 6414 vdev_config_sync could be simpler Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Will Andrews <will@firepipe.net> illumos/illumos-gate@eb5bb58421	2016-01-26 12:58:58 +00:00
Alexander Motin	70c71b4722	MFV r294808: 6421 Add missing multilist_destroy calls to arc_fini Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@57deb23282	2016-01-26 12:54:03 +00:00
Alexander Motin	6c941579b9	MFV r294806: 6388 Failure of userland copy should return EFAULT Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Richard Yao <ryao@gentoo.org> illumos/illumos-gate@c71c00bbe8	2016-01-26 12:52:16 +00:00
Alexander Motin	8ad8374efe	MFV r294804: 6386 Fix function call with uninitialized value in vdev_inuse Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Richard Yao <ryao@gentoo.org> illumos/illumos-gate@5bdd995ddb	2016-01-26 12:50:14 +00:00
Alexander Motin	02404a5ad2	MFV r294802: 6334 Cannot unlink files when over quota Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Simon Klinkert <simon.klinkert@gmail.com> illumos/illumos-gate@6575bca013	2016-01-26 12:48:10 +00:00
Alexander Motin	2360c716f9	MFV r294800: 6385 Fix unlocking order in zfs_zget Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Andriy Gapon <avg@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Richard Yao <ryao@gentoo.org> illumos/illumos-gate@eaef6a96de	2016-01-26 12:44:49 +00:00
Alexander Motin	81754f9788	MFV r294798: 6292 exporting a pool while an async destroy is running can leave entries in the deferred tree Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Fabian Keil <fk@fabiankeil.de> Approved by: Gordon Ross <gordon.ross@nexenta.com> illumos/illumos-gate@a443cc80c7	2016-01-26 12:37:23 +00:00
Alexander Motin	82abccb272	MFV r294796: 6319 assertion failed in zio_ddt_write: bp->blk_birth == txg Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@b39b744be7 This is revert of 5693.	2016-01-26 12:33:58 +00:00
Alexander Motin	27a8d05bd7	MFV r294793: 6367 spa_config_tryenter incorrectly handles the multiple-lock case Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Approved by: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@e495b6e673	2016-01-26 12:28:53 +00:00
Edward Tomasz Napierala	aa9b057c08	Fix ru_oublocks accounting for ZFS. There are two code paths that can be called from zfs_write() - one of them, through dmu_write(), was handled correctly; the other wasn't. Reviewed by: avg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D4923	2016-01-23 12:13:09 +00:00
Alan Somers	34a484f353	Quell harmless CID about unchecked return value in nvlist_get_guids. The return value doesn't need to be checked, because nvlist_get_guid's callers check the returned values of the guids. Coverity CID: 1341869 MFC after: 1 week X-MFC-With: 292066 Sponsored by: Spectra Logic Corp	2016-01-19 23:16:24 +00:00
Alan Somers	f7b60097b5	Disallow zvol-backed ZFS pools Using zvols as backing devices for ZFS pools is fraught with panics and deadlocks. For example, attempting to online a missing device in the presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better to completely disable vdev_geom from ever opening a zvol. The solution relies on setting a thread-local variable during vdev_geom_open, and returning EOPNOTSUPP during zvol_open if that thread-local variable is set. Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent was to prevent a recursive mutex acquisition panic. However, the new check for the thread-local variable also fixes that problem. Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this function was set to panic. But it can occur that a device disappears during tasting, and it causes no problems to ignore this departure. Reviewed by: delphij MFC after: 1 week Relnotes: yes Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4986	2016-01-19 17:00:25 +00:00
Dimitry Andric	9516209bf2	MFV r294101: 6527 Possible access beyond end of string in zpool comment Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Gordon Ross <gwr@nexenta.com> illumos/illumos-gate@2bd7a8d078 This fixes erroneous double increments of the 'check' variable in a loop in spa_prop_validate(). I ran into this in the clang380-import branch, where clang 3.8.0 warns about it. (It is already fixed there.) MFC after: 3 days	2016-01-15 21:45:53 +00:00
Alan Somers	cbedc01c9a	Fix race condition involving ZFS remove events When a ZFS drive disappears, ZFS sends a resource.fs.zfs.removed event to userland. A userland program like zfsd(8) can use that event, for example to activate a hotspare. The current code contains a race condition: vdev_geom will sent the sysevent _before_ spa.c would update the vdev's status, causing userland processes to see pool state that does not reflect the device removal. This change moves the sysevent to spa.c, closing the race. Reviewed by: delphij, Sean Eric Fagan MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4902	2016-01-14 18:19:05 +00:00
Alan Somers	53f6862723	Fix importing l2arc device by guid After r292066, vdev_geom verifies both the vdev and pool guids of device labels during open. However, spare and l2arc devices don't have pool guids, so opening them by guid will fail (opening by path, when the pathname is known, still succeeds). This change allows a vdev to be opened by guid if the label contains no pool_guid, which is the case for inactive spares and l2arc devices. PR: 292066 Reported by: delphij Reviewed by: delphij, smh MFC after: 2 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4861	2016-01-11 22:15:46 +00:00
Alan Somers	4e7787a9e9	Record physical path information in ZFS Vdevs sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: If available, record the physical path of a vdev in ZFS meta-data. Do this both when opening the vdev, and when receiving an attribute change notification from GEOM. Make vdev_geom_close() synchronous instead of deferring its work to a GEOM event handler. There is no benefit to deferring the work and this prevents a future open call from referencing a consumer that is scheduled for destruction. The close followed by an immediate open will occur during a vdev reprobe triggered by any type of I/O error. Consolidate vdev_geom_close() and vdev_geom_detach() into vdev_geom_close() and vdev_geom_close_locked(). This also moves the cross linking operations between vdev and GEOM consumer into a single place (linking in vdev_geom_attach() and unlinking in vdev_geom_close_locked()). Submitted by: gibbs, asomers MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4524	2016-01-11 17:57:26 +00:00
Allan Jude	7a3f5d11fb	Replace sys/crypto/sha2/sha2.c with lib/libmd/sha512c.c cperciva's libmd implementation is 5-30% faster The same was done for SHA256 previously in r263218 cperciva's implementation was lacking SHA-384 which I implemented, validated against OpenSSL and the NIST documentation Extend sbin/md5 to create sha384(1) Chase dependancies on sys/crypto/sha2/sha2.{c,h} and replace them with sha512{c.c,.h} Reviewed by: cperciva, des, delphij Approved by: secteam, bapt (mentor) MFC after: 2 weeks Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D3929	2015-12-27 17:33:59 +00:00
Gleb Smirnoff	f17f88d3e0	Fix breakage caused by r292373 in ZFS/FUSE/NFS/SMBFS. With the new VOP_GETPAGES() KPI the "count" argument counts pages already, and doesn't need to be translated from bytes to pages. While here make it consistent that rbehind and rahead are updated only if we doesn't return error. Pointy hat to: glebius	2015-12-16 23:48:50 +00:00
Gleb Smirnoff	b0cd20172d	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-12-16 21:30:45 +00:00
Alan Somers	670ffd5e5c	Change an important error message from ZFS_LOG to printf Submitted by: gibbs MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2015-12-11 00:04:13 +00:00
Alan Somers	62ac7dd2bf	During vdev_geom_open, require that the vdev guids match the device's label except during split, add, or create operations. This fixes a bug where the wrong disk could be returned, and higher layers of ZFS would immediately eject it again. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: o When opening by GUID, require both the pool and vdev GUIDs to match. While it is highly unlikely for two vdevs to have the same vdev GUIDs, the ZFS storage pool allocator only guarantees they are unique within a pool. o Modify the open behavior to: - If we are opening a vdev that hasn't previously been opened, open by path without checking GUIDs. - Otherwise, open by path and verify GUIDs. - If that fails, search all geom providers for a device with matching GUIDs. - If that fails, return ENOENT. Submitted by: gibbs, asomers Reviewed by: smh MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4486	2015-12-10 21:46:21 +00:00
Mark Johnston	1639290749	MFV r289003: 6271 dtrace caused excessive fork time Author: Bryan Cantrill <bryan@joyent.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Gordon Ross <gwr@nexenta.com> illumos/illumos-gate@7bd3c1d12d	2015-12-07 21:49:32 +00:00
Mark Johnston	6e0f204c3f	Modify DTRACEHIOC_ADDDOF to copy the DOF section from the target process. r281257 added support for lazyload mode by allowing dtrace(1) to register a DOF section on behalf of a traced process. This was implemented by having libdtrace copy the DOF section into a heap-allocated buffer and passing its address to the ioctl handler. However, DTrace uses the DOF section address as a lookup key in certain cases, so the ioctl handler should be given the target process' DOF section address instead. This change modifies the ADDDOF handler to copy the DOF section in from the target process, rather than from dtrace(1).	2015-12-07 21:44:05 +00:00
Mark Johnston	711fbd17ec	Add helper functions proc_readmem() and proc_writemem(). These helper functions can be used to read in or write a buffer from or to an arbitrary process' address space. Without them, this can only be done using proc_rwmem(), which requires the caller to fill out a uio. This is onerous and results in code duplication; the new functions provide a simpler interface which is sufficient for most existing callers of proc_rwmem(). This change also adds a manual page for proc_rwmem() and the new functions. Reviewed by: jhb, kib Differential Revision: https://reviews.freebsd.org/D4245	2015-12-07 21:33:15 +00:00
Stanislav Sedov	314eeef290	Make the number of fasttrap probes and the size of the trace points hash table tunable via sysctl or kernel tunables. Illumos allows this parameters to be changed via the fasttrap.conf configuration file, but FreeBSD code hardcoded the parameters. Expose them under the kern.dtrace.fasttrap sysctl tree. MFC after: 2 weeks	2015-12-01 00:24:54 +00:00
Steven Hartland	465fed1c17	Switch zfs_panic_recover to panic for bad DVA As reported by Coverity a null pointer de-reference panic would be triggered when zfs_recover was set so switch to straight panic as it can never be recovered. Reported by: Coverity Scan MFC after: 1 X-MFC-With: r290401 Sponsored by: Multiplay	2015-11-06 20:45:19 +00:00
Steven Hartland	d0d400133f	Provide information about bad DVA Provide information about which vdev has an issue with a bad DVA. MFC after: 1 week Sponsored by: Multiplay	2015-11-05 17:12:41 +00:00
Steven Hartland	ab66c9067a	Allow zfs_recover to be changed at runtime MFC after: 1 week Sponsored by: Multiplay	2015-11-05 17:00:42 +00:00
Andrew Turner	e5ca5f2abd	Fix the open solaris atomic functions on arm64. Without this we may use the wrong value in the comparison, leading to incorrectly setting the new value. This has been observed in the ZFS code. Without this we can lose track of the reference count in a zrlock object. We should move to use the generic atomic functions, however as this has been observed I would prefer to have this working, then move to the generic functions. PR: 204037 Sponsored by: ABT Systems Ltd	2015-11-05 16:55:27 +00:00
Andriy Gapon	c34d46ff59	zfs: allow the lookup of extended attributes of an unlinked file That's required for extattr_get_fd(2) and the like to work properly. PR: 203201 MFC after: 17 days	2015-11-02 10:07:21 +00:00
Andriy Gapon	abc37121c4	l2arc: do not call trim_map_free() for blocks with zero b_asize b_asize can be zero if the block is compressed into an empty block (ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless zero-sized trimming is not attempted. The logic for calling trim_map_free() is extracted into a new function l2arc_trim() to minimize code duplication. PR: 203473 Reported by: Willem Jan Withagen <wjw@digiware.nl> Tested by: Willem Jan Withagen <wjw@digiware.nl> MFC after: 11 days	2015-10-30 12:00:34 +00:00
Alexander Motin	6b513e2853	MFV r289561: 6328 Fix cstyle errors in zfs codebase Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> illumos/illumos-gate@9a686fbc18	2015-10-19 08:25:37 +00:00
Alexander Motin	62ed65eb78	MFV r289526: 5561 support root pools on EFI/GPT partitioned disks 5125 update zpool/libzfs to manage bootable whole disk pools (EFI/GPT labeled disks) Reviewed by: Jean McCormack <jean.mccormack@nexenta.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com> illumos/illumos-gate@1a902ef862 This is NOP changes for FreeBSD.	2015-10-18 18:08:33 +00:00
Alexander Motin	ab866a3d61	Fix ZFS ABI compat shims for `zfs receive` after r289362. Difference appeared much less drammatic then seemed originally.	2015-10-17 07:32:46 +00:00
Alexander Motin	43f774f296	MFV r289310: 4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@45818ee124 This is only a partial merge of respective ZFS infrastructure changes. At this moment FreeBSD kernel has no those crypto algorithms, so the parts of the code to enable them are commented out. When they are implemented, it will be trivial to plug them in.	2015-10-16 14:45:21 +00:00
Alexander Motin	c70e61feed	MFV r289312: 2605 want to resume interrupted zfs send Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@9c3fd1216f For more info, see: - slides http://www.slideshare.net/MatthewAhrens/openzfs-send-and-receive - video https://www.youtube.com/watch?v=iY44jPMvxog - manpage changes (for zfs resume -s and zfs send -t) - upcoming talk at the OpenZFS Developer Summit The TL;DR is: Use "zfs receive -s" to save the partially received state on failure. On failure, get the receive token with "zfs get receive_resume_token <fs>" Resume the send with "zfs send -t <token_value>" Relnotes: yes	2015-10-15 08:47:32 +00:00
Alexander Motin	3dbe12b067	MFV r289308: 6267 dn_bonus evicted too early Reviewed by: Richard Yao <ryao@gentoo.org> Reviewed by: Xin LI <delphij@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Justin T. Gibbs <gibbs@FreeBSD.org> illumos/illumos-gate@d2058105c6	2015-10-14 10:38:05 +00:00
Alexander Motin	fc6f8dee4c	MFV r289306: 6295 metaslab_condense's dbgmsg should include vdev id Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@freebsd.org> Reviewed by: Xin Li <delphij@freebsd.org> Reviewed by: Justin Gibbs <gibbs@scsiguy.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Joe Stein <joe.stein@delphix.com> illumos/illumos-gate@daec38ecb4	2015-10-14 10:31:50 +00:00
Alexander Motin	422891c28a	MFV r289304: 6293 ztest failure: error == 28 (0xc == 0x1c) in ztest_tx_assign() Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@8fe00bfb87	2015-10-14 10:28:29 +00:00
Alexander Motin	95e20e65d3	MFV r289298: 6286 ZFS internal error when set large block on bootfs Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@6de9bb5603	2015-10-14 07:50:08 +00:00
Alexander Motin	a4256278bf	MFV r289296: 6288 dmu_buf_will_dirty could be faster Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Justin Gibbs <gibbs@scsiguy.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@0f2e7d03b8	2015-10-14 07:45:44 +00:00
Alexander Motin	0f45d37812	MFV r289294: 5219 l2arc_write_buffers() may write beyond target_sz Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Saso Kiselkov <skiselkov@gmail.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Reviewed by: Justin Gibbs <gibbs@FreeBSD.org> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Andriy Gapon <avg@freebsd.org> illumos/illumos-gate@d7d9a6d919	2015-10-14 07:37:02 +00:00
Alexander Motin	2269f420b2	FreeBSD-specific addition to r289191.	2015-10-12 18:15:25 +00:00
Alexander Motin	fea4f2108f	MFV r289188: 6281 prefetching should apply to 1MB reads Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Alexander Motin <mav@freebsd.org> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Justin Gibbs <gibbs@scsiguy.com> Reviewed by: Xin Li <delphij@freebsd.org> Approved by: Gordon Ross <gordon.ross@nexenta.com> Author: George Wilson <george.wilson@delphix.com> illumos/illumos-gate@632802744e	2015-10-12 15:48:45 +00:00
Alexander Motin	558dcd4e42	MFV r289187: 6251 add tunable to disable free_bpobj processing Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Reviewed by: Albert Lee <trisk@omniti.com> Reviewed by: Xin Li <delphij@freebsd.org> Approved by: Garrett D'Amore <garrett@damore.org> Author: George Wilson <george.wilson@delphix.com> illumos/illumos-gate@139510fb6e	2015-10-12 15:44:44 +00:00
Alexander Motin	72b6ad9bb5	MFV r289185: 6250 zvol_dump_init() can hold txg open Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Albert Lee <trisk@omniti.com> Reviewed by: Xin Li <delphij@freebsd.org> Approved by: Garrett D'Amore <garrett@damore.org> Author: George Wilson <george.wilson@delphix.com> illumos/illumos-gate@b10bba7246	2015-10-12 15:39:03 +00:00
Alexander Motin	ec5a8cf7c0	Restore original array_rd_sz semantics. Before r278702 prefetch was blocked for I/Os > 1MB, after -- >= 1MB. 1MB I/Os are used for bulk operations in CTL (XCOPY, VERIFY), and disabling prefetch for them reduced the performance. This is temporary local patch, that should be replaced when upstreamed. Discussed with: mahrens MFC after: 3 days	2015-10-03 11:05:58 +00:00
Mark Johnston	f7c3db2537	MFV r288408: 6266 harden dtrace_difo_chunksize() with respect to malicious DIF illumos/illumos-gate@395c7a3dcf Reviewed by: Alex Wilson <alex.wilson@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Bryan Cantrill <bryan@joyent.com> MFC after: 1 week	2015-09-30 05:24:22 +00:00
Andriy Gapon	9b977fcea2	define aok in libnvpair which is linked to all zfs libraries that need aok This removes the circular dependency of libnvpair on libzfs / libzpool. PR: 199811 Obtained from: bapt MFC after: 23 days	2015-09-28 15:25:36 +00:00
Xin LI	8012d6910c	MFV r288063: make dataset property de-registration operation O(1) A change to a property on a dataset must be propagated to its descendants in case that property is inherited. For datasets whose information is not currently loaded into memory (e.g. a snapshot that isn't currently mounted), there is nothing to do; the property change will take effect the next time that dataset is loaded. To handle updates to datasets that are in-core, ZFS registers a callback entry for each property of each loaded dataset with the dsl directory that holds that dataset. There is a dsl directory associated with each live dataset that references both the live dataset and any snapshots of the live dataset. A property change is effected by doing a traversal of the tree of dsl directories for a pool, starting at the directory sourcing the change, and invoking these callbacks. The current implementation both registers and de-registers properties individually for each loaded dataset. While registration for a property is O(1) (insert into a list), de-registration is O(n) (search list and then remove). The 'n' for de-registration, however, is not limited to the size (number of snapshots + 1) of the dsl directory. The eviction portion of the life cycle for the in core state of datasets is asynchronous, which allows multiple copies of the dataset information to be in-core at once. Only one of these copies is active at any time with the rest going through tear down processing, but all copies contribute to the cost of performing a dsl_prop_unregister(). One way to create multiple, in-flight copies of dataset information is by performing "zfs list" operations from multiple threads concurrently. In-core dataset information is loaded on demand and then evicted when reference counts drops to zero. For datasets that are not mounted, there is no persistent reference count to keep them resident. So, a list operation will load them, compute the information required to do the list operation, and then evict them. When performing this operation from multiple threads it is possible that some of the in-core dataset information will be reused, but also possible to lose the race and load the dataset again, even while the same information is being torn down. Compounding the performance issue further is a change made for illumos issue 5056 which made dataset eviction single threaded. In environments using automation to manage ZFS datasets, it is now possible to create enough of a backlog of dataset evictions to consume excessive amounts of kernel memory and to bog down the system. The fix employed here is to make property de-registration O(1). With this change in place, it is hoped that a single thread is more than sufficient to handle eviction processing. If it isn't, the problem can be solved by increasing the number of threads devoted to the eviction taskq. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_prop.h: Associate dsl property callback records with both the dsl directory and the dsl dataset that is registering the callback. Both connections are protected by the dsl directory's "dd_lock". When linking callbacks into a dsl directory, group them by the property type. This helps reduce the space penalty for the double association (the property name pointer is stored once per dsl_dir instead of in each record) and reduces the number of strcmp() calls required to do callback processing when updating a single property. Property types are stored in a linked list since currently ZFS registers a maximum of 10 property types for each dataset. Note that the property buckets/records associated with a dsl directory are created on demand, but only freed when the dsl directory is freed. Given the static nature of property types and their small number, there is no benefit to freeing the few bytes of memory used to represent the property record earlier. When a property record becomes empty, the dsl directory is either going to become unreferenced a little later in this thread of execution, or there is a high chance that another dataset is going to be loaded that would recreate the bucket anyway. Replace dsl_prop_unregister() with dsl_prop_unregister_all(). All callers of dsl_prop_unregister() are trying to remove all property registrations for a given dsl dataset anyway. By changing the API, we can avoid doing any lookups of callbacks by property type and just traverse the list of all callbacks for the dataset and free each one. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c: Replace use of dsl_prop_unregister() with the new dsl_prop_unregister_all() API. illumos/illumos-gate@03bad06fbb Author: Justin Gibbs <gibbs@scsiguy.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Illumos issue: 6171 dsl_prop_unregister() slows down dataset eviction https://www.illumos.org/issues/6171 MFC after: 2 weeks	2015-09-25 01:05:44 +00:00
Andriy Gapon	e88445a48b	MFV r287817: 6220 memleak in l2arc on debug build `c546f36aa8` https://www.illumos.org/issues/6220 5408 introduced a memleak in l2arc, namely the member b_thawed gets leaked when an arc_hdr is realloced from full to l2only. Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Arne Jansen <sensille@gmx.net>	2015-09-21 12:23:01 +00:00
Xin LI	b4f6099b9f	MFV r287623: 5997 FRU field not set during pool creation and never updated ZFS already supports storing the vdev FRU in a vdev property. There is code in libzfs to work with this property, and there is code in the zfs-retire FMA module that looks for that information. But there is no code actually setting or updating the FRU. To address this, ZFS is changed to send a handful of new events whenever a vdev is added, attached, cleared, or onlined, as well as when a pool is created or imported. Note that syseventd is not currently available on FreeBSD and thus some work is needed to actually support the new ZFS events (e.g. in zfsd) to actually use this capability, this changeset is mostly a diff reduction from upstream. illumos/illumos-gate@1437283407 Illumos issues: 5997 FRU field not set during pool creation and never updated https://www.illumos.org/issues/5997	2015-09-13 07:15:14 +00:00
Xin LI	011ecb128f	Note r286552 as merged and reduce diff against upstream.	2015-09-13 06:49:42 +00:00
Xin LI	653809335f	MFV r287699: 6214 zpools going south In r286570 (MFV of r277426) an unprotected write to b_flags to set the compression mode was introduced. This would open a race window where data is partially decompressed, modified, checksummed and written to the pool, resulting in pool corruption due to the partial decompression. Prevent this by reintroducing b_compress illumos/illumos-gate@d4cd038c92 Illumos issues: 6214 zpools going south https://www.illumos.org/issues/6214	2015-09-12 09:56:23 +00:00
Xin LI	3e691a57db	MFV r287684: 6091 avl_add doesn't assert on non-debug builds Use assfail() from libuutil instead of ASSERT() in userland AVL avl_add. illumos/illumos-gate@faa2b6be2f Illumos issues: 6091 avl_add doesn't assert on non-debug builds https://www.illumos.org/issues/6091	2015-09-12 08:50:43 +00:00
Xin LI	8c4f41ff34	MFV r287624: 5987 zfs prefetch code needs work Rewrite the ZFS prefetch code to detect only forward, sequential streams. The following kstats have been added: kstat.zfs.misc.arcstats.sync_wait_for_async How many sync reads have waited for async read to complete. (less is better) kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch How many demand read didn't have to wait for I/O because of predictive prefetch. (more is better) zfetch kstats have been similified to hits, misses, and max_streams, with max_streams representing times when we were not able to create new stream because we already have the maximum number of sequences for a file. The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes to prefetch per stream. illumos/illumos-gate@cf6106c8a0 Illumos ZFS issues: 5987 zfs prefetch code needs work https://www.illumos.org/issues/5987	2015-09-12 08:35:51 +00:00
Mark Johnston	e8baaa998c	MFV r283513: 5930 fasttrap_pid_enable() panics when prfind() fails in forking process Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Bryan Cantrill <bryan@joyent.com> illumos/illumos-gate@9df7e4e12e	2015-09-11 03:06:34 +00:00
Mark Johnston	2275da185c	MFV r283512: 3599 dtrace_dynvar tail calls can blow stack Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Gordon Ross <gordon.ross@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Bryan Cantrill <bryan@joyent.com> illumos/illumos-gate@d47448f09a	2015-09-11 03:04:24 +00:00
Xin LI	28ffe927c2	Expose an interface to determine if an ACE is inherited. Submitted by: sef Reviewed by: trasz MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D3540	2015-09-04 00:14:20 +00:00
Allan Jude	de84a5132c	Apply the noline attribute to vdev_queue_max_async_writes This makes it possible to analyze the performance of the new ZFS write throttle with dtrace PR: 200316 Submitted by: Lacey Powers <lacey.leanne@gmail.com> Reviewed by: avg, smh, delphij (no objection) Approved by: bapt (mentor) MFC after: 1 month Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D3472	2015-08-31 23:10:42 +00:00
Xin LI	9053fe148b	Fix a buffer overrun which may lead to data corruption, introduced in r286951 by reinstating changes in r274628. In l2arc_compress_buf(), we allocate a buffer to stash away the compressed data in 'cdata', allocated of l2hdr->b_asize bytes. We then ask zio_compress_data() to compress the buffer, b_l1hdr.b_tmp_cdata, which is of l2hdr->b_asize bytes, and have the compressed size (or original size, if compress didn't gain enough) stored in csize. To pad the buffer to fit the optimal write size, we round up the compressed size to L2 device's vdev_ashift. Illumos code rounds up the size by at most SPA_MINBLOCKSIZE. Because we know csize <= b_asize, and b_asize is integer multiple of SPA_MINBLOCKSIZE, we are guaranteed that the rounded up csize would be <= b_asize. However, this is not necessarily true when we round up to 1 << vdev_ashift, because it could be larger than SPA_MINBLOCKSIZE. So, in the worst case scenario, we are overwriting at most (1 << vdev_ashift - SPA_MINBLOCKSIZE) bytes of memory next to the compressed data buffer. Andriy's original change in r274628 reorganized the code a little bit, by moving the padding to after we determined that the compression was beneficial. At which point, we would check rounded size against the allocated buffer size, and the buffer overrun would not be possible.	2015-08-29 09:22:32 +00:00
Xin LI	253d699d3c	In r286705 (Illumos 5960/a2cdcdd), a separate thread is created with curproc as parent. In the case of a send or receive, the curproc would be the userland application that issues the ioctl. This would trigger an assertion failure introduced in Solaris compatibility shims in r196458 when kernel is compiled with INVARIANTS. Fix this by using p0 (proc0 or kernel) as the parent thread when creating the kernel threads.	2015-08-29 08:16:57 +00:00
Andriy Gapon	9f2d1b28df	MFV (partial) r286889: 5692 expose the number of hole blocks in a file FreeBSD porting notes: - only kernel-side changes are merged - the new ioctl is not actually implemented yet - thus, the goal is to synchronize DMU code illumos/illumos-gate@2bcf0248e9 https://www.illumos.org/issues/5692 we would like to expose the number of hole (sparse) blocks in a file. this can be useful to for example if you want to fill in the holes with some data; knowing the number of holes in advances allows you to report progress on hole filling. We could use SEEK_HOLE to do that but it would be O(n) where n is the number of holes present in the file. Author: Max Grossman <max.grossman@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net>	2015-08-24 09:48:50 +00:00
Andriy Gapon	21fc429242	spa_import_rootpool: prevent lock and resource leak The lock leak could lead to a deadlock later. PR: 198563 Submitted by: Fabian Keil <fk@fabiankeil.de> MFC after: 1 week	2015-08-24 08:44:44 +00:00
Andriy Gapon	082fcc9ed2	account for ashift when gathering buffers to be written to l2arc device The change that introduced the L2ARC compression support also introduced a bug where the on-disk size of the selected buffers could end up larger than the target size if the ashift is greater than 9. This was because the buffer selection could did not take into account the fact that on-disk size could be larger than the in-memory buffer size due to the alignment requirements. At the moment b_asize is a misnomer as it does not always represent the allocated size: if a buffer is compressed, then the compressed size is properly rounded (on FreeBSD), but if the compression fails or it is not applied, then the original size is kept and it could be smaller than what ashift requires. For the same reasons arcstat_l2_asize and the reported used space on the cache device could be smaller than the actual allocated size if ashift > 9. That problem is not fixed by this change. This change only ensures that l2ad_hand is not advanced by more than target_sz. Otherwise we would overwrite active (unevicted) L2ARC buffers. That problem is manifested as growing l2_cksum_bad and l2_io_error counters. This change also changes 'p' prefix to 'a' prefix in a few places where variables represent allocated rather than physical size. The resolved problem could also result in the reported allocated size being greater than the cache device's capacity, because of the overwritten buffers (more than one buffer claiming the same disk space). This change is already in ZFS-on-Linux: zfsonlinux/zfs@ef56b0780c PR: 198242 PR: 195746 (possibly related) Reviewed by: mahrens (https://reviews.csiden.org/r/229/) Tested by: gkontos@aicom.gr (most recently) MFC after: 15 days X-MFC note: patch does not apply as is at the moment Relnotes: yes Sponsored by: ClusterHQ Differential Revision: https://reviews.freebsd.org/D2764 Reviewed by: noone (@FreeBSD.org)	2015-08-24 08:10:52 +00:00
Andriy Gapon	243f5e3085	try to fix lor between z_teardown_lock and spa_namespace_lock The lock order reversal and a resulting deadlock were introduced in r285021 / D2865. The problem is that zfs_register_callbacks() calls dsl_prop_get_integer() that has to acquire spa_namespace_lock. At the same time, spa_config_sync() is called with spa_namespace_lock held and then it performs ZFS vnode operations that acquire z_teardown_lock in the reader mode. So, fix the problem by using dsl_prop_get_int_ds() instead of dsl_prop_get_integer(). The former does not need to look up the pool and the dataset by name. Reported by: many Reviewed by: delphij Tested by: delphij, Jens Schweikhardt <schweikh@schweikhardt.net> MFC after: 5 days X-MFC with: r285021	2015-08-21 08:17:44 +00:00
Andriy Gapon	b985dac5ff	fix a mismerge in r286539 (MFV 286538: 5562 ZFS sa_handle's violate...) PR: 202358 X-MFC with: r286539 X-MFC attn: mav	2015-08-21 08:04:56 +00:00
Alexander Motin	602015fd15	Restore part of r274628, reverted at r286776. Submitted by: avg	2015-08-20 07:41:33 +00:00
Alexander Motin	d8928f479b	Remove some random accumulated diff from Illumos. Submitted by: avg (partially)	2015-08-14 13:43:12 +00:00
Alexander Motin	6cb8dbf791	2618 arc.c mistypes in the comments Reviewed by: Jason King <jason.brian.king@gmail.com> Reviewed by: Josef Sipek <jeffpc@josefsipek.net> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Bart Coddens <bart.coddens@gmail.com> illumos/illumos-gate@fc98fea58e	2015-08-14 13:10:30 +00:00
Alexander Motin	997d864ce0	Fix r286766 build with debug.	2015-08-14 11:47:53 +00:00
Alexander Motin	ab4d08c3d3	Fix minor mismerge sometimes earlier.	2015-08-14 09:48:23 +00:00
Alexander Motin	5ba12a280a	MFV r286765: 5817 change type of arcs_size from uint64_t to refcount_t Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@2fd872a734 As a way to make it more difficult to introduce bugs into the ARC, and to make it easier to diagnose issues when bugs do creep in, it would be beneficial to change the type of the arc_state_t's arcs_size field to be a refcount_t instead of a uint64_t. This would allow us to make stricter checks when incrementing and decrementing the value with debugging enabled, but still fallback to simple, fast atomic operations when debugging is disabled.	2015-08-14 09:39:23 +00:00
Alexander Motin	ab4930d98c	MFV r285025: 6033 arc_adjust() should search MFU lists for oldest buffer when adjusting MFU size. illumos/illumos-gate@31c46cf23c https://www.illumos.org/issues/6033 When we're looking for the list containing oldest buffer we never actually look at the MFU lists even when we try to evict from MFU. looks like a copy paste error, the fix is here: Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com> Reviewed by: Xin Li <delphij@delphij.net> Reviewed by: Prakash Surya <me@prakashsurya.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alek Pinchuk <alek@nexenta.com> Obtained from: illumos	2015-08-14 09:33:46 +00:00
Alexander Motin	e0360e14d2	MFV r277431: 5497 lock contention on arcs_mtx Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@244781f10d This patch attempts to reduce lock contention on the current arc_state_t mutexes. These mutexes are used liberally to protect the number of LRU lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at which these locks are acquired has been shown to greatly affect the performance of highly concurrent, cached workloads.	2015-08-14 09:31:07 +00:00
Alexander Motin	267b62ec43	Revert part of r205231, introducing multiple ARC state locks. This local implementation will be replaced by one from Illumos to reduce code divergence and make further merges easier.	2015-08-14 09:25:54 +00:00
Alexander Motin	49114ce463	MFV 286711: 6096 ZFS_SMB_ACL_RENAME needs to cleanup better Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Gordon Ross <gordon.w.ross@gmail.com> Reviewed by: George Wilson <gwilson@zfsmail.com> Approved by: Robert Mustacchi <rm@joyent.com> illumos/illumos-gate@8f5190a540	2015-08-13 00:13:55 +00:00
Alexander Motin	3b1f51e911	MFV 286709: 6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@0f92170f1e	2015-08-13 00:10:36 +00:00
Alexander Motin	0d0def87fe	MFV 286707: 5959 clean up per-dataset feature count code Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@ca0cc3918a A ZFS feature flags (large blocks) tracks its refcounts as the number of datasets that have ever used the feature. Several features of this type are planned to be added (new checksum functions). This code should be made common infrastructure rather than duplicating the code for each feature.	2015-08-12 23:59:17 +00:00
Alexander Motin	b696497df0	MFV r286704: 5960 zfs recv should prefetch indirect blocks 5925 zfs receive -o origin= Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com> While running 'zfs recv' we noticed that every 128th 8K block required a read. We were seeing that restore_write() was calling dmu_tx_hold_write() and the indirect block was not cached. We should prefetch upcoming indirect blocks to avoid having to go to disk and blocking the restore_write(). Allow an incremental send stream to be received as a clone, even if the stream does not mark it as a clone.	2015-08-12 22:41:06 +00:00
Alexander Motin	d0687a01d7	MFV r284763: 5981 Deadlock in dmu_objset_find_dp illumos/illumos-gate@1d3f896f54 https://www.illumos.org/issues/5981 When dmu_objset_find_dp gets called with a read lock held, it fans out the work to the task queue. Each task in turn acquires its own read lock before calling the callback. If during this process anyone tries to a acquire a write lock, it will stall all read lock requests.Thus the tasks will never finish, the read lock of the caller will never get freed and the write lock never acquired. deadlock. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Arne Jansen <jansen@webgods.de>	2015-08-12 19:10:29 +00:00
Alexander Motin	101a6d4eac	MFV r284762: 5269 zpool import slow illumos/illumos-gate@12380e1e70 https://www.illumos.org/issues/5269 When importing a pool (at boot or with zpool import) with many filesystem, the process can take minutes. It doesn't matter whether the pool has been exported cleanly or uncleanly. The problem is that each dataset has its own log chain. On import, all datasets have to be checked if there are logs to replay. The idea is to speed up this process by paralellizing it. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Arne Jansen <jansen@webgods.de>	2015-08-12 18:47:30 +00:00
Alexander Motin	ebf527de10	MFV r286682: 5765 add support for estimating send stream size with lzc_send_space when source is a bookmark Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@nexenta.com> Author: Max Grossman <max.grossman@delphix.com> illumos/illumos-gate@643da460c8	2015-08-12 18:23:08 +00:00
Alexander Motin	2d41b1006f	MFV r286224: 5695 dmu_sync'ed holes do not retain birth time illumos/illumos-gate@70163ac57e https://www.illumos.org/issues/5695 In dmu_sync_ready(), a hole block pointer will have it's logical size explicitly set as it's necessary for replay purposes. To "undo" this, dmu_sync_done() will zero out any hole that it finds. This becomes a problem when using the "hole_birth" feature, as this will also wipe out any birth time that might have happened to be set on the hole. ... As a fix, the logic to zero out a hole is only applied to old style holes with a birth time of zero. Holes created with the "hole_birth" feature enabled will have a non-zero birth time, and will be skipped (thus preserving the ltime, type, and level information as well). In addition, zdb was updated to also print the ltime, type, and level information for these new style holes. Previously, only the logical birth time would be printed. Author: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com>	2015-08-12 17:21:41 +00:00
Alexander Motin	ef00c83db1	Fix set of sign extension bugs in r286625.	2015-08-12 08:36:58 +00:00
Alexander Motin	b3fc966389	Fix assertion panic caused by combination of r286598 and TRIM.	2015-08-11 19:15:55 +00:00
Alexander Motin	3caed89878	Fix r286625 build on i386.	2015-08-11 12:38:01 +00:00
Alexander Motin	a3b3a9752c	Fix minor mismerge in r286574.	2015-08-11 12:22:16 +00:00
Alexander Motin	c350858a50	MFV r277425: 5376 arc_kmem_reap_now() should not result in clearing arc_no_grow Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Richard Elling <richard.elling@richardelling.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@2ec99e3e98	2015-08-11 10:39:19 +00:00
Alexander Motin	6be7d38913	Remove extra lock, that IMO only creates potential problems now.	2015-08-11 09:18:51 +00:00
Alexander Motin	1af86496cb	MFV 286604: 5812 assertion failed in zrl_tryenter(): zr_owner==NULL Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@8df173054c	2015-08-10 21:36:51 +00:00
Alexander Motin	799f47828d	MFV 286602: 5810 zdb should print details of bpobj Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@732885fca0	2015-08-10 21:32:40 +00:00
Alexander Motin	c70c15ffa9	MFV 286599: 5808 spa_check_logs is not necessary on readonly pools Reviewed by: George Wilson <george@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Gordon Ross <gwr@nexenta.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@23367a2f2c	2015-08-10 21:19:42 +00:00
Alexander Motin	f7bf11ab59	MFV 286597: 5701 zpool list reports incorrect "alloc" value for cache devices Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@a52fc310ba	2015-08-10 21:13:59 +00:00
Alexander Motin	2e92f38b63	Local addition and mismerge fix for r286579.	2015-08-10 20:34:46 +00:00
Alexander Motin	de8b7ceff1	MFV 286588: 5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig) Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matthew Ahrens <mahrens@delphix.com> illumod/illumos-gate@34e8acef00	2015-08-10 19:38:07 +00:00
Alexander Motin	57f7c5acf5	MFV 286586: 5746 more checksumming in zfs send Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Albert Lee <trisk@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@98110f08fa	2015-08-10 19:32:58 +00:00
Alexander Motin	9b4b955150	MFV r277430: 5313 Allow I/Os to be aggregated across ZIO priority classes Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Will Andrews <willa@SpectraLogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com> illumos/illumos-gate@fe319232d2	2015-08-10 12:39:10 +00:00
Alexander Motin	0702ce1a52	Fix r286570 build with debug.	2015-08-10 11:52:54 +00:00
Alexander Motin	83a6947e11	MFV r277428: 5056 ZFS deadlock on db_mtx and dn_holds Reviewed by: Will Andrews <willa@spectralogic.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Justin Gibbs <justing@spectralogic.com> illumos/illumos-gate@bc9014e6a8	2015-08-10 11:30:07 +00:00
Alexander Motin	f13e9e1470	MFV r277427: 5445 Add more visibility via arcstats; specifically arc_state_t stats and differentiate between "data" and "metadata" Reviewed by: Basil Crow <basil.crow@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Bayard Bell <bayard.bell@nexenta.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@4076b1bf41	2015-08-10 10:59:58 +00:00
Alexander Motin	c908dc6f4b	MFV r277426: 5408 managing ZFS cache devices requires lots of RAM Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Don Brady <dev.fs.zfs@gmail.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Chris Williamson <Chris.Williamson@delphix.com> illumos/illumos-gate@89c86e3229 Currently, every buffer cached in the L2ARC is accompanied by a 240-byte header in memory, leading to very high memory consumption when using very large cache devices. These changes significantly reduce this overhead. Currently: L1-only header = 176 bytes L1 + L2 or L2-only header = 176 bytes + 32 byte checksum + 32 byte l2hdr = 240 bytes Memory-optimized: L1-only header = 176 bytes L1 + L2 header = 176 bytes + 32 byte checksum = 208 bytes L2-only header = 96 bytes + 32 byte checksum = 128 bytes So overall: Trunk Optimized +-----------------+ L1-only \| 176 B \| 176 B \| (same) +-----------------+ L1 & L2 \| 240 B \| 208 B \| (saved 32 bytes) +-----------------+ L2-only \| 240 B \| 128 B \| (saved 116 bytes) +-----------------+ For an average blocksize of 8KB, this means that for the L2ARC, the ratio of metadata to data has gone down from about 2.92% to 1.56%. For a 'storage optimized' EC2 instance with 1600GB of SSD and 60GB of RAM, this means that we expect a completely full L2ARC to use (1600 GB * 0.0156) / 60GB = 41% of the available memory, down from 78%.	2015-08-10 10:34:23 +00:00
Alexander Motin	1838fc920f	MFV 286555: Avoid 128K kmem allocations in mzap_upgrade() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Steven Hartland <killing@multiplay.co.uk> Approved by: Rich Lowe <richlowe@richlowe.net> illumos/illumos-gate@be3e2ab906	2015-08-09 20:41:44 +00:00
Alexander Motin	475e2caa3d	MFV 286553: 5769 Cast 'zfs bad bloc' to ULL for x86 Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Richard PALO <richard@NetBSD.org> Approved by: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@8c76e0763b	2015-08-09 20:32:30 +00:00
Alexander Motin	a5a2e782be	MFV 286550: 5694 traverse_prefetcher does not prefetch enough Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: George Wilson <george.wilson@delphix.com> illumos/illumos-gate@34d7ce052c	2015-08-09 20:23:35 +00:00
Alexander Motin	927211755b	MFV 286548: 5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override Reviewed by: George Wilson <george@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@7f7ace3700	2015-08-09 20:08:38 +00:00
Alexander Motin	4ff9527edc	MFV 286546: 5661 ZFS: "compression = on" should use lz4 if feature is enabled Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Xin LI <delphij@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com> illumos/illumos-gate@db1741f555	2015-08-09 20:02:16 +00:00
Alexander Motin	9834135947	MFV 286544: 5630 stale bonus buffer in recycled dnode_t leads to data corruption Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com>	2015-08-09 19:35:39 +00:00
Alexander Motin	07470009be	MFV 286542: 5592 NULL pointer dereference in dsl_prop_notify_all_cb() Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Will Andrews <will@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> illumos/illumos-gate@9d47dec048	2015-08-09 19:29:10 +00:00
Alexander Motin	1d367fea94	MFV 286540: 5531 NULL pointer dereference in dsl_prop_get_ds() Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Justin T. Gibbs <justing@spectralogic.com> illumos/illumos-gate@e57a022b8f	2015-08-09 19:26:21 +00:00
Alexander Motin	c44a37d0b7	MFV 286538: 5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Robert Mustacchi <rm@fingolfin.org> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Rich Lowe <richlowe@richlowe.net> Approved by: Dan McDonald <danmcd@omniti.com> Author: Justin T. Gibbs <justing@spectralogic.com> illumos/illumos-gate@0fda3cc5c1	2015-08-09 19:17:59 +00:00
Steven Hartland	ebbc56ecd6	Fix KSTACK_PAGES check in ZFS module The check introduced by r285946 failed to add the dependency on opt_kstack_pages.h which meant the default value for the platform instead of the customised options KSTACK_PAGES=X was being tested. Also wrap in #ifdef __FreeBSD__ for portability. MFC after: 3 days Sponsored by: Multiplay	2015-08-03 09:34:09 +00:00
Mark Johnston	48fcd357c4	Avoid dereferencing curthread->td_proc->p_cred in DTrace probe context. When a process is exiting, there is a narrow window where p_cred may be NULL while its threads are still executing. Specifically, the last thread to exit a process sets the process state to PRS_ZOMBIE with the proc spinlock held and then calls thread_exit(). thread_exit() drops the spin lock, permitting the process to be reaped and thus causing its cred struct to be released. However, the exiting thread may still cause DTrace probes to fire by calling sched_throw(), resulting in a double fault if such a probe enabling attempts to access the GID or UID DIF variables. The thread's cred reference is not susceptible to this race since it is not released until after the thread has exited. MFC after: 1 week	2015-08-02 00:11:56 +00:00
Steven Hartland	1635369e99	Add warning about low KSTACK_PAGES for ZFS use As ZFS requires a more kernel stack pages than is the default on some architectures e.g. i386, warn if KSTACK_PAGES is less than ZFS_MIN_KSTACK_PAGES (which is 4 at the time of writing). MFC after: 3 days Sponsored by: Multiplay	2015-07-28 11:19:38 +00:00
Mateusz Guzik	752fc07d33	vfs: implement v_holdcnt/v_usecount manipulation using atomic ops Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free list) of either counter are still guarded with vnode interlock. Reviewed by: kib (earlier version) Tested by: pho	2015-07-16 13:57:05 +00:00
Mariusz Zaborski	306a82f8f4	Rename zfs nvpair files to not colidate with our nvlist. PR: 201356 Approved by: pjd (mentor)	2015-07-09 21:53:40 +00:00
Andriy Gapon	74f75cb1bd	zfs_mount(MS_REMOUNT): protect zfs_(un)register_callbacks calls We now take z_teardown_lock as a writer to ensure that there is no I/O while the filesystem state is in a flux. Also, zfs_suspend_fs() -> zfsvfs_teardown() call zfs_unregister_callbacks() and zfs_resume_fs() -> zfsvfs_setup() call zfs_unregister_callbacks(). Previously there was no synchronization between those calls and the calls in the re-mounting case. That could lead to concurrent execution and a crash. PR: 180060 Differential Revision: https://reviews.freebsd.org/D2865 Suggested by: mahrens Reviewed by: delphij, pho, mahrens, will MFC after: 13 days Sponsored by: ClusterHQ	2015-07-02 08:32:02 +00:00
Ruslan Bukin	b78ee15e9f	First cut of DTrace for AArch64. Reviewed by: andrew, emaste Sponsored by: ARM Limited Differential Revision: https://reviews.freebsd.org/D2738	2015-07-01 15:51:11 +00:00
Andriy Gapon	bc97daa07e	MFV r284412: 5911 ZFS "hangs" while deleting file Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com> Reviewed by: Alek Pinchuk <alek@nexenta.com> Reviewed by: Simon Klinkert <simon.klinkert@gmail.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@46e1baa6cf https://www.illumos.org/issues/5911 Sometimes ZFS appears to hang while deleting a file. It is actually making slow progress at the file deletion, but other operations (administrative and writes via the data path) "hang" until the file removal completes, which can take a long time if the file has many blocks. The deletion (or most of it) happens in a single txg, and the sync thread spends most of its time reading indirect blocks via this stack trace: swtch+0x141() cv_wait+0x70() zio_wait+0x5b() dbuf_read+0x2c0() free_children+0x50() free_children+0x12a() free_children+0x12a() free_children+0x12a() dnode_sync_free_range_impl+0xdf() dnode_sync_free_range+0x52() range_tree_vacate+0x65() dnode_sync+0x1d8() dmu_objset_sync_dnodes+0x77() dmu_objset_sync+0x19f() dsl_dataset_sync+0x51() dsl_pool_sync+0x9a() spa_sync+0x2ff() txg_sync_thread+0x21f() thread_start+8() One way to reproduce the problem is if we are over the arc_meta_limit, e.g. because lots of indirect blocks are pinned because we have L0 dbufs under them. It could be that most of the L1 indirects are cached, in which case when dmu_free_long_range_impl() calls dmu_tx_hold_free(), it will complete very quickly. This allows dmu_free_long_range_impl() to put many (perhaps all of its) transactions in the same TXG. However, dmu_free_long_range_impl() calls dnode_evict_dbufs (and dnode_free_range()), which removes the L0 dbufs, thus reducing the hold count on the L1 indirect blocks above it, allowing them to be evicted. Because we are over the arc_meta_limit(), these L1 blocks will be evicted ASAP. Thus when we get to syncing context, the L1 indirects are no longer cached and must be read in. Obtained from: illumos MFC after: 15 days	2015-06-19 06:58:05 +00:00
Andriy Gapon	ab50c99d40	illums compat: use flsl/flsll for highbit/highbit64 Do that only when when fast inline versions are available. At the moment that can be the case only in the kernel and not for all platforms. The original code uses the binary search and that's kept as a fallback. This is a micro optimization. Differential Revision: https://reviews.freebsd.org/D2839 Reviewed by: delphij, mahrens, mav MFC after: 17 days	2015-06-19 06:41:53 +00:00
Gleb Smirnoff	093ebe1d28	o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async(). o Provide an extensive set of assertions for input array of pages. o Remove now duplicate assertions from different pagers. Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-06-17 22:44:27 +00:00
Andriy Gapon	783379a942	Revert r284511 because it caused build failures on many platforms The problem is that when inline versions of flsl and flsll are not available, then libkern.h must be included for their declarations in kernel sources. The fix would be trivial, but I would like to figure out first if it even makes sense to use the libkern provided implementations. Reported by: bz Pointyhat to: avg	2015-06-17 17:16:06 +00:00
Andriy Gapon	6470c31911	l2arc: pass correct size to trim requests b_size is a logical size of a buffer in memory, b_asize is its physical size that accounts for possible compression. Currently the latter is the best approximation for the allocated, on-disk size. L2ARC TRIM support was committed a few weeks before L2ARC compression was imported, so originally the code was correct, because b_size was the size. Further thoughts. Given that the cache device is being overwritten in a circular fashion it is not clear if a TRIM per each evicted L2ARC buffer has any benefits. Maybe it would be sufficient to issue a single trim request for the whole device when it is loaded, e.g. after a bootup, or when it is unloaded, e.g. before a shutdown. At least as long as L2ARC is not persistent across reboots. Discussed with: smh MFC after: 19 says	2015-06-17 12:28:13 +00:00
Andriy Gapon	1fa1d4a651	illumos compat: use flsl/flsll for highbit/highbit64 This is a micro optimization. The upstream code uses the binary search. Differential Revision: https://reviews.freebsd.org/D2839 Reviewed by: delphij, mav MFC after: 15 days	2015-06-17 12:05:04 +00:00
Andriy Gapon	bab89d0897	MFV r284036: 5961 Fix stack overflow in zfs_create_fs illumos/illumos-gate@c701fde691 Author: glebius MFC after: 11 days	2015-06-12 11:10:49 +00:00
Andriy Gapon	ff7e06fbf4	MFV r284030: 5818 zfs {ref}compressratio is incorrect with 4k sector size illumos/illumos-gate@81cd5c555f Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 17 days	2015-06-12 10:57:05 +00:00
Andriy Gapon	8e9f0d5803	MFV r283534: 5515 dataset user hold doesn't reject empty tags illumos/illumos-gate@752fd8dabc Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> MFC after: 10 days	2015-06-12 10:52:53 +00:00
Andriy Gapon	dde4126314	MFV r284040: check that datasets are snapshots 5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots 5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot Reviewed by: Steven Hartland <killing@multiplay.co.uk> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> illumos/illumos-gate@24218bebb4 Note that the upstream commit is modified during MFV: in the upstream the check is done by inspecting ds_is_snapshot field while in FreeBSD we call dsl_dataset_is_snapshot(). This is because illumos/illumos-gate@bc9014e6a8 (r277428 in vendor-sys/illumos) is not MFV-ed yet. MFC after: 10 days	2015-06-12 10:41:24 +00:00
Konstantin Belousov	63261dad32	Add missed {}. Noted by: Morten Rodal <morten@rodal.no> MFC after: 2 weeks	2015-05-27 19:28:14 +00:00
Konstantin Belousov	780dca1b1e	Right now, dounmount() is called with unreferenced mount point. Nothing stops a parallel unmount to suceed before the given call to dounmount() checks and locks the covered vnode. Prevent dounmount() from acting on the freed (although type-stable) memory by changing the interface to require the mount point to be referenced. dounmount() consumes the reference on return, regardless of the sucessfull or erronous result. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-05-27 09:22:50 +00:00
Andriy Gapon	4b040d9513	zfs: fixes for a full stream received into an existing dataset - this should fail early unless the force flag is set - if the force flag is set then any local modifications including snapshots should be undone See: https://www.illumos.org/issues/5912 See: https://reviews.csiden.org/r/220/ Reviewed by: mahrens, Paul Dagnelie <pcd@delphix.com> MFC after: 15 days Sponsored by: ClusterHQ	2015-05-25 11:56:57 +00:00
Andriy Gapon	e80d8b4b7c	dsl_dataset_promote_check: ensure that shared snaps do not become too long ... after they are transfered from the old origin to the new one. See: https://www.illumos.org/issues/5909 See: https://reviews.csiden.org/r/219/ Reviewed by: mahrens MFC after: 10 days Sponsored by: ClusterHQ	2015-05-25 11:48:15 +00:00
Konstantin Belousov	e61d4e626e	Remove excess Giant acquisition around the dounmount() call. Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-05-25 09:08:19 +00:00
Jung-uk Kim	fd90e2ed54	CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten years for head. However, it is continuously misused as the mpsafe argument for callout_init(9). Deprecate the flag and clean up callout_init() calls to make them more consistent. Differential Revision: https://reviews.freebsd.org/D2613 Reviewed by: jhb MFC after: 2 weeks	2015-05-22 17:05:21 +00:00
Steven Hartland	c017a87e08	Add copyright info missing from r282205 Add the copyright info missing from ZoL origin version. MFC after: 2 days Sponsored by: Multiplay	2015-05-14 08:13:01 +00:00
Andriy Gapon	defce67748	zfs ioctls: use fget_write / fget_read instead of getf wrapper for fget This allows to ensure that we do not write to a file that was opened for reading only or vice versa. Also, use the correct capability in in zfs_ioc_send_new(). Differential Revision: https://reviews.freebsd.org/D2382 Reviewed by: delphij MFC after: 17 days Sponsored by: ClusterHQ	2015-05-11 10:07:31 +00:00
Justin Hibbits	45bf6d59de	Fix a couple bugs in 64-bit powerpc fasttrap argument retrieval. Found by code inspection.	2015-05-10 04:33:01 +00:00
Andriy Gapon	96b60db0d7	MFV r282630: 5809 Blowaway full receive in v1 pool causes kernel panic MFC after: 5 days	2015-05-08 14:03:14 +00:00
Andriy Gapon	24dd1a8242	zfs: do not hold an extra reference on a root vnode while a filesystem is mounted At present zfs_domount() acquires a reference on the filesystem's root vnode and that reference is kept until zfs_umount. The latter calls vflush(rootrefs = 1) to dispose of the extra reference. There is no explanation of why that reference is kept - what problem it solves or what behavior it improves. Also, that logic is FreeBSD specific. There is one real problem with that reference, though. zfs recv -F may receive a full, non-incremental stream to a mounted filesystem. In that case the received root object is likely to have a different z_gen attribute value. Because of that, zfs_rezget will leave the previous root znode and vnode disassociated from the actual object (z_sa_hdl == NULL). Thus, future calls to VFS_ROOT() -> zfs_root() will produce a new vnode-znode pair, while the old one will be kept alive by the outstanding reference. So, the outstanding reference will not actually be for the new root vnode (or, more precisely, vnodes - because a root vnode may be recycled and a newer one can be created). As a result, when vflush(rootrefs = 1) s called there will be two problems: - a leaked reference on the old root vnode preventing a graceful unmount - insufficient references on the actual root vnode leading to a crash upon access to the vnode after it is destroyed by vgone() + vdrop() The second issue will actually override the first one. Differential Revision: https://reviews.freebsd.org/D2353 Reviewed by: delphij, kib, smh MFC after: 17 days	2015-05-05 11:01:06 +00:00
Andriy Gapon	ce0023d851	dmu_recv_end_check: don't leak hold if dsl_destroy_snapshot_check_impl fails The leak may happen if !drc_newfs && drc_force and there is an error iterating through snapshots or any of snapshot checks fails. See https://www.illumos.org/issues/5870 See https://reviews.csiden.org/r/206/ Reviewed by: mahrens (as mahrens@delphix.com) MFC after: 15 days Sponsored by: ClusterHQ	2015-05-05 10:56:16 +00:00
Steven Hartland	aeb9d4dad9	Fix misuse of input argument in traverse_visitbp In traverse_visitbp(), the input argument dnp is modified in the middle to point to a temporary buffer. Originally this doesn't matter, because no user of TRAVERSE_POST dereferences it. However, in `fbeddd6` a piece of code is added dereferencing dnp after the modification, creating a possible bug. We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case, so we don't modify the input argument. Also we introduce different local variables in the DMU_OT_OBJSET case to prevent confusion between the input argument. Obtained from: zfsonlinux (a585f2f844ed3d4270221fed88f5e494eb55d932) MFC after: 2 weeks Sponsored by: Multiplay	2015-04-28 22:46:58 +00:00
Andriy Gapon	9bc3222765	replace a comment about zfs recv -F corner case with a longer, more detailed one The old comment in zfs_rezget explains what situation the code handles, the new comment also describes how the situation can arise. Also, re-join a line that became sufficiently shorti some time ago. Differential Revision: https://reviews.freebsd.org/D2352 Reviewed by: delphij, smh MFC after: 12 days	2015-04-28 09:19:40 +00:00
Andriy Gapon	1af760ce1b	zfs_onexit_fd_hold: return EBADF even if devfs_get_cdevpriv gave ENOENT /dev/zfs always has per-open data, so when it is missing the file descriptor is for some other file. Returning ENOENT in this case is confusing as a variety of other conditions (like a missing dataset) may result in the same error. It's better to consistently return EBADF for any problems with the file descriptor. Note that zfs_onexit_fd_hold() is used with 'automatic cleanup fd' - when that fd is closed, typically because a process is terminated, some cleanup action is taken by ZFS driver. E.g. a temporary snapshot hold is released. Perhaps, it would even be worthwhile changing devfs_get_cdevpriv() to return EBADF if there is no associated data. Differential Revision: https://reviews.freebsd.org/D2370 Reviewed by: delphij, smh MFC after: 12 days	2015-04-28 09:11:47 +00:00
Andriy Gapon	37a9b4136e	dsl_dir_rename_check: return EXDEV on cross-pool rename attempt Obtained from: zfsonlinux/zfs@9063f65476 Obtained from: Boris Protopopov <boris.protopopov@actifio.com> MFC after: 10 days	2015-04-28 08:04:16 +00:00
Andriy Gapon	99d058c8a7	MFV r282123: 5610 zfs clone from different source and target pools produces coredump MFC after: 10 days	2015-04-28 07:42:28 +00:00
Andriy Gapon	28d15239af	MFV r282124: 5393 spurious failures from dsl_dataset_hold_obj() The actual bugfix was pro-actively committed in r275515. This MFV is cosmetic, it just aligns code style with the upstream. MFC after: 10 days	2015-04-28 07:37:38 +00:00
Andriy Gapon	39b6f1d6c1	nvpair_type_is_array: DATA_TYPE_INT8_ARRAY was not recognized To do: upstream (https://www.illumos.org/issues/5778) MFC after: 10 days	2015-04-28 06:34:55 +00:00
Mark Johnston	8241ee3b2c	Fix DTrace's panic() action. It would previously call into some unfinished Solaris compatibility code and return without actually calling panic(9). The compatibility code is unneeded, however, so just remove it and have dtrace_panic() call vpanic(9) directly. Differential Revision: https://reviews.freebsd.org/D2349 Reviewed by: avg MFC after: 2 weeks Sponsored by: EMC / Isilon Storage Division	2015-04-24 03:19:30 +00:00
Xin LI	384f656a1a	Remove vfs.zfs.snapshot_list_prefetch, the corresponding code was gone in r248571 already. MFC after: 1 week	2015-04-17 21:21:11 +00:00
Mark Johnston	67cf27b70f	libdtrace: add support for lazyload mode. Passing "-x lazyload" to dtrace -G during compilation causes dtrace(1) to not link drti.o into the output object file, so the USDT probes are not created during process startup. Instead, dtrace(1) will automatically discover and create probes on the process' behalf when attaching. Differential Revision: https://reviews.freebsd.org/D2203 Reviewed by: rpaulo MFC after: 1 month	2015-04-08 02:36:37 +00:00
Alexander Motin	91b9f63738	Add DTrace probe to the new ARC reclaim cause added in r281026. MFC after: 1 month	2015-04-05 14:45:52 +00:00
Alexander Motin	2e9ccb32a1	Make ZFS ARC track both KVA usage and fragmentation. Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage reaches certain threshold (3/4 on i386 or 16/17 otherwise). FreeBSD has even less KVA, but had no such limit on archs with direct map as amd64. As result, on machines with a lot of RAM, during load with very small user- space memory pressure, such as `zfs send`, it was possible to reach state, when there is enough both physical RAM and KVA (I've seen up to 25-30%), but no continuous KVA range to allocate even single 128KB I/O request. Address this situation from two sides: - restore KVA usage limitations in a way the most close to Illumos; - introduce new requirement for KVA fragmentation, specifying that we should have at least one sequential KVA range of zfs_max_recordsize bytes. Experiments show that first limitation done alone is not sufficient. On machine with 64GB of RAM it is sometimes needed to drop up to half of ARC size to get at leats one 1MB KVA chunk. Statically limiting ARC to half of KVA/RAM is too strict, so second limitation makes it to work in cycles: accumulate trash up to certain critical mass, do massive spring-cleaning, and then start littering again. :) MFC after: 1 month	2015-04-03 14:45:48 +00:00
Andrew Turner	7572a8c8f1	Add the arm64 defines for cddl code. Differential Revision: https://reviews.freebsd.org/D2186 Reviewed by: emaste Sponsored by: The FreeBSD Foundation	2015-04-01 08:31:56 +00:00
Alexander Motin	e5dcb72f45	Some cosmetic polishing. No functional change. MFC after: 1 week	2015-03-29 20:28:18 +00:00
Mark Johnston	97f2f66479	Remove unused upstream DTrace provider implementations that are duplicates of providers under sys/cddl/dev/. Also remove sdt_subr.c, which isn't used in FreeBSD's SDT implementation. Suggested by: rwatson	2015-03-16 01:15:08 +00:00
Steven Hartland	208264283d	Allow zvol_geom_worker to process BIO_DELETE's If zvol_geom_start is called with a BIO_DELETE from a thread which can sleep it queues it for later processing by the zvol_geom_worker. The zvol_geom_worker didn't have a delete case so would simply loose the bio hence preventing the original caller from every completing. In addition an other unknown types would suffer the same fate. Allow zvol_geom_worker to process BIO_DELETE's via zvol_strategy and return unsupported for all unknown bio types. MFC after: 2 weeks Sponsored by: Multiplay	2015-03-14 17:35:04 +00:00
Alexander Motin	0d45c37cb6	Make DIOCGATTR in device mode handle "GEOM::candelete". MFC after: 3 days	2015-03-12 16:19:18 +00:00
Andrew Turner	4a8169d97b	Add the MD parts of dtrace needed to use fbt on ARM. For this we need to emulate the instructions used in function entry and exit. For function entry ARM will use a push instruction to push up to 16 registers to the stack. While we don't expect all 16 to be used we need to handle any combination the compiler may generate, even if it doesn't make sense (e.g. pushing the program counter). On function return we will either have a pop or branch instruction. The former is similar to the push instruction, but with care to make sure we update the stack pointer and program counter correctly in the cases they are either in the list of registers or not. For branch we need to take the 24-bit offset, sign-extend it, and add that number of 4-byte words to the program counter. Care needs to be taken as, due to historical reasons, the address the branch is relative to is not the current instruction, but 8 bytes later. This allows us to use the following probes on ARM boards: dtrace -n 'fbt::malloc:entry { stack() }' and dtrace -n 'fbt:🆓return { stack() }' Differential Revision: https://reviews.freebsd.org/D2007 Reviewed by: gnn, rpaulo Sponsored by: ABT Systems Ltd	2015-03-05 17:55:31 +00:00
George V. Neville-Neil	fcb5606706	Initial version of DTrace on ARM32. Submitted by: Howard Su based on work by Oleksandr Tymoshenko Reviewed by: ian, andrew, rpaulo, markj	2015-02-10 19:41:30 +00:00
Mark Johnston	3277b9a257	Fix a typo in r278137: make sure to free provider state. X-MFC-With: r278136	2015-02-08 03:55:12 +00:00
Pedro F. Giffuni	3ccccdc17d	MFV r266995: 4767 dtrace_probe() always has the timestamp Reference: https://illumos.org/issues/4767 Obtained from: Illumos MFC after: 2 weeks	2015-02-03 20:06:30 +00:00
Pedro F. Giffuni	eadcd0fadf	MFV r266993: 4469 DTrace helper tracing should be dynamic Reference: https://illumos.org/issues/4469 Obtained from: Illumos Phabric: D1551 Reviewed by: markj MFC after: 2 weeks	2015-02-03 19:39:53 +00:00
Mark Johnston	c36bd253fa	Continue to handle the case where state is NULL, though this currently cannot happen on FreeBSD. r278136 overlooked the fact that a destructor registered with devfs_set_cdevpriv(9) is invoked even in the case of an error. X-MFC-With: r278136	2015-02-03 06:04:16 +00:00
Mark Johnston	ac21b651bf	Diff reduction with illumos, in preparation for merging r266993 from the vendor branch. No functional change. MFC after: 1 week	2015-02-03 05:38:52 +00:00
Steven Hartland	370a13bfff	Prevent inlining txg_quiesce This allows dtrace to monitor the calls to txg_quiesce which can be really helpful. Also standardise __noinline order for arc_kmem_reap_now. Sponsored by: Multiplay	2015-02-02 00:17:36 +00:00
Mark Johnston	a70a59ea73	Don't attempt to disable enabled fasttrap probes in an exiting process. There's no need to do so, and we can't hold an exiting process, so this race can result in panics. MFC after: 1 week	2015-01-30 05:03:23 +00:00
Mark Johnston	1eb8ad64ea	In fasttrap_sigtrap(), use tdsendsignal() rather than tdksignal() to send SIGTRAP. The latter requires that its thread argument be non-NULL, but fasttrap_sigtrap() does not. PR: 193593 MFC after: 1 week Reported by: danilo	2015-01-30 04:51:59 +00:00
Xin LI	63cffd61d1	MFV r255258: Diff reduction with upstream. The actual change was merged in r272483 already. MFC after: 2 weeks	2015-01-28 08:56:48 +00:00
Will Andrews	b4e360d239	When creating or updating a node, use vfs_timestamp() for "now" instead of gethrestime(), to allow the administrator to decide the appropriate timestamp precision instead of always using nanosecond precision.	2015-01-24 00:43:02 +00:00
Will Andrews	bd3a7c08c4	Remove commented log messages.	2015-01-21 19:30:01 +00:00
Will Andrews	35b540bfb2	Ignore sync requests from the system syncher, i.e. VFS_SYNC(waitfor=MNT_LAZY). ZFS already commits outstanding data every zfs_txg_timeout seconds, so these syncs are unnecessarily intrusive. Submitted by: gibbs Sponsored by: Spectra Logic MFSpectraBSD: `1105759` on 2014/12/11	2015-01-21 19:25:57 +00:00
Will Andrews	2a2c1d424a	Eliminate an #ifdef illumos for zfs_ioc_rename(). Since allow_mounted is a FreeBSD-specific change, default to B_TRUE, then locally check for the magic bit. Unconditionally check allow_mounted below. Convert the setting of allow_mounted to an explicit boolean. MFC after: 1 week Sponsored by: Spectra Logic MFSpectraBSD: 672578 (in part) on 2013/07/19	2015-01-21 19:20:36 +00:00
Will Andrews	55ddf051d8	Add vfs.zfs.reference_tracking_enable sysctl/tunable. This is primarily for developer/debugging use; it enables built-in tagged tracking of refcounts inside ZFS. It can only be enabled from the loader, since it modifies how in-core state is managed. Default remains disabled. MFC after: 1 week Sponsored by: Spectra Logic	2015-01-21 17:03:11 +00:00
Will Andrews	798cbb7523	Fix arc__shrink DTrace probe's to_free argument. Remove the unnecessary #ifdef _KERNEL, which did not differ in the true or false cases. Actually set the value of to_free before using it. MFC after: 1 week Sponsored by: Spectra Logic	2015-01-20 22:39:10 +00:00
Will Andrews	fe20fb9fb0	Use the "zfs_gfs" tag for GFS vnodes to make them easier to identify. MFC after: 1 week Sponsored by: Spectra Logic	2015-01-20 22:31:26 +00:00
Alexander Motin	d6245e3d44	Allow skipping dmu_buf_will_dirty() call in dsl_dir_transfer_space(). dsl_dir_transfer_space() is mostly called after dsl_dir_diduse_space(), which already calls dmu_buf_will_dirty() for the same dbuf and tx, so its duplicate call in those cases will change nothing, only spend time. Skipping this call by four times reduces time spent in dbuf_write_done() and descendants, updating dataset statistics with several congested lock acquisitions. When rewriting 8K zvol blocks at 1GB/s rate, this reduces CPU time spent inside dbuf_write_done(), according to profiling, from 45% of 683K samples to 18% of 422K. MFC after: 2 weeks	2015-01-20 13:09:12 +00:00
Steven Hartland	5eab7e5406	Clean ZFS spa config before syncing A number of entries that can be present in the spa config shouldn't be saved to disk so add a method to ensure this is case. Without this if the last caller to vdev_config_generate requested stats then we can end up in the cache file. Also only skip a none writable pool in the cache file generation if its active. This prevents unavailable pools incorrectly getting removed from cache file. Tested by: delphij MFC after: 2 weeks Sponsored by: Multiplay	2015-01-18 23:15:49 +00:00
Steven Hartland	bc96366c86	Mechanically convert cddl sun #ifdef's to illumos Since the upstream for cddl code is now illumos not sun, mechanically convert all sun #ifdef's to illumos #ifdef's which have been used in all newer code for some time. Also do a manual pass to correct the use if #ifdef comments as per style(9) as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos. MFC after: 1 month Sponsored by: Multiplay	2015-01-17 14:44:59 +00:00
Alexander Motin	38feff972b	Fix overflow bug from r248577, turning 30s TRIM timeout into ~4s. MFC after: 2 weeks	2015-01-14 16:22:00 +00:00
Alexander Motin	d4f46a775d	Reimplement TRIM throttling added in r248577. Previous throttling implementation approached problem from the wrong side. It significantly limited useful delaying of TRIM requests and aggregation potential, while not so much controlled TRIM burstiness under heavy load. With this change random 4K write benchmarks (probably the worst case for TRIM) show me IOPS increase by 20%, average latency reduction by 30%, peak TRIM bursts reduction by 3 times and same peak TRIM map size (memory usage). Also the new logic does not force map size down so heavily, really allowing to keep deleted data for 32 TXG or 30 seconds under moderate load. It was practically impossible with old throttling logic, which pushed map down to only 64 segments. Reviewed by: smh MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2015-01-14 09:39:57 +00:00
Alexander Motin	5b3a65823d	Skip extra bcopy() when scrubbing vdev without redundancy. According to profiler, this bcopy() can use about 10% of CPU time. MFC after: 2 weeks	2015-01-12 22:38:55 +00:00
Alexander Motin	f5b85f6551	When aggregating TRIM segments, move the new one to the list end. New segment at the list head may block all TRIM requests until txg of that segment can be processed. On my random I/O tests this change reduce peak TRIM list length from 650 to 450 segments. Hopefully it should reduce TRIM burstiness when list processing is unblocked. MFC after: 2 weeks	2015-01-11 16:36:39 +00:00
Alexander Motin	2de874ed23	Add LBA as secondary sort key for synchronous I/O requests. On FreeBSD gethrtime() implemented via getnanouptime(), that has 1ms (1/hz) precision. It makes primary sort key (timestamp) collision very possible. In such situations sorting by secondary key of LBA is much more reasonable then by totally meaningless zio pointer value. With this change on multi-threaded synchronous ZVOL read I've measured 10% throughput increase and average latency reduction. MFC after: 2 weeks	2015-01-11 00:26:18 +00:00
Alexander Motin	13ea8106d9	Use new optimized dmu_read_uio_dbuf() for ZVOLs in device mode. This slightly reduces overhead by avoiding dnode_hold()/dnode_rele() calls. MFC after: 2 weeks	2015-01-10 18:28:58 +00:00
Steven Hartland	8de799ea3a	Correct zpool list displaying invalid EXPANDSZ for unavailable pool vdevs When pools are unavailable their vdevs are also unavailable which means that vdev_max_asize remains at the default zero. This default was being used to calculate vs_esize resulting in a negative number as vdev_asize > vdev_max_asize, which caused zpool list -v to display 16.0E for EXPANDSZ of these vdevs.	2014-12-31 04:54:48 +00:00
Steven Hartland	51f529b50b	Always sync the global ZFS config cache to reflect the new mosconfig This fixes out of date zpool.cache for root pools, which can cause issues such as confusion of zdb etc. MFC after: 1 month	2014-12-23 09:31:24 +00:00
Steven Hartland	6831cf6aa6	Fix panic when resizing ZFS zvol's Resizing a ZFS ZVOL with debug enabled would result in a panic due to recursion on dp_config_rwlock. The upstream change "3464 zfs synctask code needs restructuring" changed zvol_set_volsize to avoid the recursion on dp_config_rwlock, but this was missed when originally merged in by r248571 due to significant differences in our codebases in this area. These changes also relied on bring in changes from upstream: 3557 dumpvp_size is not updated correctly when a dump zvol's size is changed, which where also not present. In order to help prevent future issues in this area a direct comparison and diff minimisation from current upstream version (b515258) of zvol.c. Differential Revision: https://reviews.freebsd.org/D1302 MFC after: 1 month X-MFC-With: r276063 & r276066 Sponsored by: Multiplay	2014-12-22 18:39:38 +00:00
Steven Hartland	08c0f91ecc	Refactor zvol locking to minimise diff with upstream Use #define zfsdev_state_lock spa_namespace_lock instead of replacing all zfsdev_state_lock with spa_namespace_lock to minimise changes from upstream. Differential Revision: D1302 MFC after: 1 month X-MFC-With r276063 Sponsored by: Multiplay	2014-12-22 17:04:51 +00:00
Steven Hartland	39c478eac4	Standardise on illumos for #ifdef's in zvol.c Also correct as per style(9) on the use of #ifdef comments. This is a no-op change as pre-cursor to a full cleanup and merge with upstream zvol changes. Sponsored by: Multiplay	2014-12-22 16:38:29 +00:00
Konstantin Belousov	789bdfdbc6	Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some fs, e.g. smbfs, already did it. Tested by: pho (previous version) Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-21 13:29:33 +00:00
Xin LI	62722f7d5b	Add missing continue: we can't proceed further if the kernel does not panic with zfs_panic_recover. Illumos issue: 5438 zfs_blkptr_verify should continue after zfs_panic_recover Reported by: Coverity CID: 1232014	2014-12-19 00:20:29 +00:00
Xin LI	7d073f8411	MFV r275914: As of r270383, the dbuf_compare comparator compares the dbuf attributes in the following order: db_level (indirect level) db_blkid (block number) db_state (current state) the address of the element Because db_state is being considered before the element's state, changing of db_state would affect balancedness of the AVL tree, even when the address of element compares differently. For instance, in dbuf_create, db_state may be altered after the node is inserted into the AVL tree and may break AVL tree balancedness. Instead of using db_state as a comparision critera (introduced in r270383), consider it only when we are doing a lookup, that is one of the two dbuf pointers contains DB_SEARCH. Illumos issue: 5422 preserve AVL invariants in dn_dbufs MFC after: 2 weeks	2014-12-18 23:45:26 +00:00
Konstantin Belousov	6c21f6edb8	The VOP_LOOKUP() implementations for CREATE op do not put the name into namecache, to avoid cache trashing when doing large operations. E.g., tar archive extraction is not usually followed by access to many of the files created. Right now, each VOP_LOOKUP() implementation explicitely knowns about this quirk and tests for both MAKEENTRY flag presence and op != CREATE to make the call to cache_enter(). Centralize the handling of the quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP. VFS now sets NOCACHE flag for CREATE namei() calls. Note that the change in semantic is backward-compatible and could be merged to the stable branch, and is compatible with non-changed third-party filesystems which correctly handle MAKEENTRY. Suggested by: Chris Torek <torek@pi-coral.com> Reviewed by: mckusick Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-12-18 10:01:12 +00:00
Xin LI	a771fba68c	MFV r275783: Convert ARC flags to use enum. Previously, public flags are defined in arc.h and private flags are defined in arc.c which can lead to confusion and programming errors. Consistently use 'hdr' (when referencing arc_buf_hdr_t) instead of 'buf' or 'ab' because arc_buf_t are often named 'buf' as well. Illumos issue: 5369 arc flags should be an enum 5370 consistent arc_buf_hdr_t naming scheme MFC after: 2 weeks	2014-12-15 18:22:45 +00:00
Xin LI	7728b3d28c	MFV r275551: Remove "dbuf phys" db->db_data pointer aliases. Use function accessors that cast db->db_data to the appropriate "phys" type, removing the need for clients of the dmu buf user API to keep properly typed pointer aliases to db->db_data in order to conveniently access their data. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c: In zap_leaf() and zap_leaf_byteswap, now that the pointer alias field l_phys has been removed, use the db_data field in an on stack dmu_buf_t to point to the leaf's phys data. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c: Remove the db_user_data_ptr_ptr field from dbuf and all logic to maintain it. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c: Modify the DMU buf user API to remove the ability to specify a db_data aliasing pointer (db_user_data_ptr_ptr). cddl/contrib/opensolaris/cmd/zdb/zdb.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_diff.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_traverse.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_bookmark.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deleg.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_synctask.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_userhold.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_history.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h: Create and use the new "phys data" accessor functions dsl_dir_phys(), dsl_dataset_phys(), zap_m_phys(), zap_f_phys(), and zap_leaf_phys(). sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h: Remove now unused "phys pointer" aliases to db->db_data from clients of the DMU buf user API. Illumos issue: 5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS MFC after: 2 weeks	2014-12-15 07:52:23 +00:00
Xin LI	a676454a98	MFV r275550: In addition to r273158, make the code in spa_sync() that checks if the current TXG is a no-op TXG less fragile. Illumos issue: 5347 idle pool may run itself out of space MFC after: 2 weeks	2014-12-15 05:10:55 +00:00
Xin LI	51e527396b	MFV r275549: Add a loader tunable, vfs.zfs.arc_meta_min, which controls how much metadata ZFS should keep in ARC at minimum. In arc_evict(), when doing recycle, take more factors into account by applying the following policy: 1. If no evictable data, evict metadata; 2. If no evictable metadata, evict data; 3. If we hit arc_meta_limit, evict metadata; 4. If we haven't hit arc_meta_min, evict data; 5* (Illumos only, not present in new FreeBSD code, yet) evict the oldest cached element from data and metadata. (FreeBSD) evict the data type specified by caller, which is the existing behavior. Note that because of our splitted locks (implemented in r205231 to improve scalability by reducing lock contention), implementing the fifth Illumos behavior will not be cheap, so for now just implement the 1-4 and fall back to current behavior for 5. Illumos issue: 5368 ARC should cache more metadata MFC after: 2 months (assuming we didn't found better solution)	2014-12-15 04:51:36 +00:00
Xin LI	5571af1d8f	MFV r247174: Expose arc_meta_limit, et al via kstats. Note that as a result, vfs.zfs.arc_meta_used is removed. The existing vfs.zfs.arc_meta_limit sysctl/tunable is retained with a SYSCTL_PROC wrapper. Illumos ZFS issues: 3561 arc_meta_limit should be exposed via kstats Relnotes: yes MFC after: 2 weeks	2014-12-13 19:17:28 +00:00
Xin LI	d25a5fd878	MFV r275548: Verify that the block pointer is structurally valid, before attempting to read it in. It can only be invalid in the case of a ZFS bug, but this change will help identify such bugs in a more transparent way, by panic'ing with a relevant message, rather than indexing off the end of an array or something. Illumos issue: 5349 verify that block pointer is plausible before reading MFC after: 2 weeks	2014-12-13 02:08:18 +00:00
Xin LI	ca7354fa07	MFV r275546: Reduce scrub activities when system there is enough dirty data, namely when dirty data is more than zfs_vdev_async_write_active_min_dirty_percent (once we start to increase the number of concurrent async writes). While there also correct rounding error which would make scrub end up pausing for (zfs_txg_timeout + 1) seconds instead of the desired zfs_txg_timeout seconds. Illumos issue: 5351 scrub goes for an extra second each txg 5352 scrub should pause when there is some dirty data MFC after: 2 weeks	2014-12-13 01:39:24 +00:00
Xin LI	8b2dbf8af3	MFV r275545: If zio_checksum_error() returns other than ECKSUM (e.g. EINVAL), it does not fill in the "zio_bad_cksum_t *info" parameter. Caller should not attempt to use it in this case. Illumos issue: 5348 zio_checksum_error() only fills in info if ECKSUM MFC after: 2 weeks	2014-12-13 01:26:06 +00:00
Xin LI	08fda6cd66	MFV r275544: Clean up some duplicated code in dnode_sync() around freeing spill blocks. Illumos issue: 5350 clean up code in dnode_sync() MFC after: 2 weeks	2014-12-13 01:18:23 +00:00
Xin LI	48438ce4c4	MFV r275543: Remove always true tests for ds->ds_phys' presence. Clean up assertions in dsl_dataset_disown. Remove unreachable code in dsl_dataset_disown(). Illumos issue: 5310 Remove always true tests for non-NULL ds->ds_phys MFC after: 2 weeks	2014-12-13 01:14:59 +00:00
Xin LI	c90d7474a8	MFV r275542: If a dnode has a spill block and there is an error while accessing a data block then traverse_dnode() loses information about that error and returns a status of visiting the spill block. This issue is discovered by Spectra Logic. Illumos issue: 5311 traverse_dnode may report success when it should not Original author: gibbs MFC after: 2 weeks	2014-12-13 01:10:17 +00:00
Xin LI	c2161091ad	MFV r275540: When importing a pool, don't assume that the passed pool configuration at vdev_load is always vaild. It's possible that a stale configuration that comes with extra vdevs, where metaslab_init() would fail because of lower layer returns error. Change the code to make metaslab_init() handle and return errors from lower layer and pass it back to upper layer and handle it there. Illumos issue: 5213 panic in metaslab_init due to space_map_open returning ENXIO MFC after: 2 weeks	2014-12-08 06:04:42 +00:00
Andriy Gapon	036a8c5dac	remove opensolaris cyclic code, replace with high-precision callouts In the old days callout(9) had 1 tick precision and that was inadequate for some uses, e.g. DTrace profile module, so we had to emulate cyclic API and behavior. Now we can directly use callout(9) in the very few places where cyclic was used. Differential Revision: https://reviews.freebsd.org/D1161 Reviewed by: gnn, jhb, markj MFC after: 2 weeks	2014-12-07 11:21:41 +00:00
Andrew Turner	f9307cced7	Apply the same fix in r274697 to the ARM case.	2014-12-06 12:03:09 +00:00
Xin LI	54f76dcb4a	MFV r275535: Unexpand ISP2() and MSEC2NSEC(). Illumos issue: 5255 uts shouldn't open-code ISP2 MFC after: 2 weeks	2014-12-06 09:38:28 +00:00
Xin LI	6054c38913	MFV r275534: Sync with Illumos. This have no effect to FreeBSD. Illumos issue: 5285 pass in cpu_pause_func via pause_cpus MFC after: 2 weeks	2014-12-06 09:14:46 +00:00
Xin LI	d4548c2e8e	MFC r275533: Sync with Illumos. This have no effect to FreeBSD. Illumos issue: 5100 sparc build failed after 5004 MFC after: 2 weeks	2014-12-06 09:11:13 +00:00
Xin LI	4603a0aeb2	Use %d instead of %u for error number. This way we see ERESTART as -1 not 4294967295 when doing DTrace. MFC after: 2 weeks	2014-12-05 22:56:10 +00:00
Xin LI	26f96d922b	Fix a regression introduced in r274337 (large block support) In dsl_dataset_hold_obj() we used zap_contains(.., DS_FIELD_LARGE_BLOCKS) to determine whether the extensible (zapifyed) dataset have large blocks. The code expects the result be either 0 (found) or ENOENT (not found), however reused the variable 'err' which later code expects to be 0. Fix this by adopting similar code construct that is used later for DS_FIELD_BOOKMARK_NAMES, which uses a temporary variable zaperr to catch errors from zap_* rountines. Reported by: Peter J. Creath (on FreeNAS; FreeNAS bug #6848) Illumos issue: 5393 spurious failures from dsl_dataset_hold_obj() Reviewed by: mahrens Sponsored by: iXsystems, Inc. X-MFC with: r274337	2014-12-05 18:29:01 +00:00
Alexander Motin	ef8daf3fed	Add GET LBA STATUS command support to CTL. It is implemented for LUNs backed by ZVOLs in "dev" mode and files. GEOM has no such API, so for LUNs backed by raw devices all LBAs will be reported as mapped/unknown. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-12-04 11:34:19 +00:00
Andriy Gapon	782c06dfc8	zfs_putpages: actually update mtime and ctime Reported by: Paul Koch <paul.koch@akips.com> Tested by: Paul Koch <paul.koch@akips.com> MFC after: 2 weeks	2014-12-02 11:44:56 +00:00
Xin LI	d10e52d627	Revert r273060 per discussion with avg@ as we need to make L2ARC aware of 4K devices and this one is not the right fix anyway.	2014-11-26 02:20:25 +00:00
Dimitry Andric	c577699e4b	Fix the following -Werror warning from clang 3.5.0, while building cddl/lib/libctf: In file included from cddl/contrib/opensolaris/common/ctf/ctf_create.c:31: In file included from sys/cddl/contrib/opensolaris/uts/common/sys/sysmacros.h:34: sys/cddl/contrib/opensolaris/uts/common/sys/isa_defs.h:334:9: warning: '_ILP32' macro redefined [-Wmacro-redefined] #define _ILP32 ^ <built-in>:26:9: note: previous definition is here #define _ILP32 1 ^ 1 warning generated. This is because clang 3.5.0 started predefining _ILP32 and __ILP32__ for the i386 arch. (Earlier versions already predefined _LP64 and __LP64__ for the x86_64 arch.) Reviewed by: emaste, avg, smh, delphij, markj Differential Revision: https://reviews.freebsd.org/D1187	2014-11-19 07:44:21 +00:00
Xin LI	8c3d6a4ab2	Make vfs.zfs.max_recordsize read-write at runtime. MFC after: 2 weeks	2014-11-18 22:35:19 +00:00
Xin LI	8efcd876ca	Add a tunable for spa_slop_shift which controls how much space we would reserve by default. Tuning is not recommended. MFC after: 2 weeks	2014-11-18 18:52:38 +00:00
Xin LI	18144ab1a3	Allow tuning zfs_max_recordsize via loader tunable. Tuning is NOT recommended. Requested by: Slawa Olhovchenkov <slw zxy spb ru> MFC after: 2 weeks	2014-11-18 18:40:01 +00:00
Andriy Gapon	2c51c83bc8	l2arc: restore correct rounding up of asize of compressed data This rounding up was lost in a mismerge of illumos code. See r268075 MFV r267565. After that commit zio_compress_data() no longer performs any compressed size adjustment, so it needs to be done externally. On FreeBSD we round up the size using vdev_ashift rather than SPA_MINBLOCKSIZE so that 4KB devices are properly supported. Additionally, zero out the buffer tail only if compression succeeds. The compression is considered successful if the size of compressed data after rounding up to account for the vdev ashift is less than the original data size. It does not make sense to have the data compressed if all the savings are lost to rounding up. With the new zio_compress_data() it could have been possible that the rounded compressed size would be greater than the original size and thus we could zero beyond the allocated buffer if the zeroing code was kept at the original place. Discussed with: delphij, gibbs MFC after: 2 weeks X-MFC with: r274627	2014-11-17 14:45:42 +00:00
Andriy Gapon	0908b20b7e	Revert r269093 which introduced physical zio alignment transform Size of physical ZIOs must never be implicitly adjusted, it's a responsibility of a caller to make sure that such a ZIO has proper offset and size. Discussed with: delphij, gibbs MFC after: 2 weeks	2014-11-17 14:16:02 +00:00
Steven Hartland	a559adfbce	Disable TRIM on file backed ZFS vdevs and fix TRIM on init After r265152 TRIM requests are ZIO_TYPE_FREE instead of ZIO_TYPE_IOCTL this meant file backed vdevs to attempted to process the ZIO as a write causing a panic. We now disable TRIM on file backed vdevs and ASSERT the ZIO types supported by each vdev type to ensure we explicity support the ZIO type being processed. Also ensure that TRIM on init is not procesed for devices which declare they didn't support TRIM via vdev_notrim. PR: 195061, 194976, 191573 Sponsored by: Multiplay	2014-11-17 11:32:10 +00:00
Xin LI	8bcd603968	MFV r274273: ZFS large block support. Please note that booting from datasets that have recordsize greater than 128KB is not supported (but it's Okay to enable the feature on the pool). This may remain unchanged because of memory constraint. Limited safety belt is provided for mounted root filesystem but use caution is advised. Illumos issue: 5027 zfs large block support MFC after: 1 month	2014-11-10 08:20:21 +00:00
Xin LI	42350b6bde	MFV r274272 and diff reduction with upstream. Illumos issue: 5244 zio pipeline callers should explicitly invoke next stage Tested with: ztest plus ZFS over GELI configuration MFC after: 1 month	2014-11-09 07:37:00 +00:00
Xin LI	81f1255e58	MFV r274271: Improve zdb -b performance: - Reduce gethrtime() call to 1/100th of blkptr's; - Skip manipulating the size-ordered tree; - Issue more (10, previously 3) async reads; - Use lighter weight testing in traverse_visitbp(); Illumos issue: 5243 zdb -b could be much faster MFC after: 2 weeks	2014-11-08 07:30:40 +00:00
Andriy Gapon	2fd3cc0cb2	fix l2arc compression buffers leak We have observed that arc_release() can be called concurrently with a l2arc in-flight write. Also, we have observed that arc_hdr_destroy() can be called from arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar circumstances. Previously the l2arc headers would be freed while leaking their associated compression buffers. Now the buffers are placed on l2arc_free_on_write list for delayed freeing. This is similar to what was already done to arc buffers that were supposed to be freed concurrently with in-flight writes of those buffers. In addition to fixing the discovered leaks this change also adds some protective code to assert that a compression buffer associated with a l2arc header is never leaked. A new kstat l2_cdata_free_on_write is added. It keeps a count of delayed compression buffer frees which previously would have been leaks. Tested by: Vitalij Satanivskij <satan@ukr.net> et al Requested by: many MFC after: 2 weeks Sponsored by: HybridCluster / ClusterHQ	2014-11-06 11:08:02 +00:00
Alexander Motin	c3e7ba3e6d	Add to CTL support for logical block provisioning threshold notifications. For ZVOL-backed LUNs this allows to inform initiators if storage's used or available spaces get above/below the configured thresholds. MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-11-06 00:48:36 +00:00
Josh Paetzel	14127f5b21	This change addresses 4 bugs in ZFS exposed by Richard Kojedzinszky's crash.sh script attached to FreeNAS bug 4109: https://bugs.freenas.org/issues/4109 Three are in the snapshot layer: a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD "VOP_INACTIVE must not do any destructive actions to a vnode and its filesystem node, nor invalidate them in any way." gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM. Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim and merge in the requisite vnode_destroy from zfsctl_common_reclaim. b) gfs_lookup_dot and various zfsctl functions do not honor the FreeBSD VFS convention of only locking from the root downward. When looking up ".." the convention is to drop the current leaf vnode lock before acquiring the directory vnode and then subsequently re-acquiring the lock on the leaf vnode. This fixes that in all the places that our exercised by crash.sh. c) The snapshot may already be unmounted when the directory vnode is reclaimed. Check for this case and return. One in the common layer: d) Callers of traverse expect the reference to the vnode passed in to be maintained. Don't release it. This last one may be an unclear contract. There may in fact be some callers that do expect the reference to be dropped on success in addition to callers that expect it to be released. In this case a further audit of the callers is needed and a consensus on the correct behavior. PR: 184677 Submitted by: kmacy Reviewed by: delphij, will, avg MFC after: 2 weeks Sponsored by: iXsystems	2014-10-25 17:42:44 +00:00
Hans Petter Selasky	f0188618f2	Fix multiple incorrect SYSCTL arguments in the kernel: - Wrong integer type was specified. - Wrong or missing "access" specifier. The "access" specifier sometimes included the SYSCTL type, which it should not, except for procedural SYSCTL nodes. - Logical OR where binary OR was expected. - Properly assert the "access" argument passed to all SYSCTL macros, using the CTASSERT macro. This applies to both static- and dynamically created SYSCTLs. - Properly assert the the data type for both static and dynamic SYSCTLs. In the case of static SYSCTLs we only assert that the data pointed to by the SYSCTL data pointer has the correct size, hence there is no easy way to assert types in the C language outside a C-function. - Rewrote some code which doesn't pass a constant "access" specifier when creating dynamic SYSCTL nodes, which is now a requirement. - Updated "EXAMPLES" section in SYSCTL manual page. MFC after: 3 days Sponsored by: Mellanox Technologies	2014-10-21 07:31:21 +00:00
Xin LI	78701de4b7	Add tunable vfs.zfs.space_map_blksz for space map's maximum block size. MFC after: 2 weeks	2014-10-18 22:11:10 +00:00
Davide Italiano	2be111bf7d	Follow up to r225617. In order to maximize the re-usability of kernel code in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv(). This fixes a namespace collision with libc symbols. Submitted by: kmacy Tested by: make universe	2014-10-16 18:04:43 +00:00
Steven Hartland	ca6505b818	Prevent ZFS leaking pool free space When processing async destroys ZFS would leak space every txg timeout (5 seconds by default), if no writes occurred, until the pool is totally full. At this point it would be unfixable without a pool recreation. In addition if the machine was rebooted with the pool in this situation would fail to import on boot, hanging indefinitely, as the import process requires the ability to write data to the pool. Any attempts to query the pool status during the hung import would not return as the import holds the pool lock. The only way to import such a pool would be to specify -o readonly=on to the zpool import. zdb -bb <pool> can be used to check for "deferred free" size which is where this lost space will be counted. MFC after: 3 days Sponsored by: Multiplay	2014-10-16 02:23:27 +00:00
Xin LI	ba6e85e0cf	Use write_psize instead of write_asize when doing vdev_space_update. Without this change the accounting of L2ARC usage would be wrong and give 16EB free space because the number became negative and overflows. Obtained from: FreeNAS (issue #6239) MFC after: 2 weeks	2014-10-13 20:39:51 +00:00
Xin LI	a4f5b8db9f	Add a tunable for arc_shrink_shift (vfs.zfs.arc_shrink_shift) that controls how much fraction, 1/2^arc_shrink_shift, should be reclaimed when there is memory pressure. Submitted by: Richard Kojedzinszky <krichy at tvnetwork.hu> MFC after: 2 weeks	2014-10-13 05:34:10 +00:00
Xin LI	eba15cf463	MFV r272804: Refactor the code and stop restore_object from creating two transactions. Illumos issue: 3693 restore_object uses at least two transactions to restore an object MFC after: 2 weeks	2014-10-09 07:52:51 +00:00
Xin LI	ce44f14b41	MFV r272803: Illumos issue: 5175 implement dmu_read_uio_dbuf() to improve cached read performance MFC after: 2 weeks	2014-10-09 07:18:40 +00:00
Andriy Gapon	c3d1d2e104	l2arc_write_buffers: reduce headroom value FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS data lists (currently both are 16) while illumos has just a single list of each kind. headroom determines how much data is scanned on a single list during each run of the l2arc feed thread. Because FreeBSD has more lists we proportionally decrease the limit. Reviewed by: Brendan Gregg (earlier version) MFC after: 2 weeks Sponsored by: HybridCluster	2014-10-07 16:08:21 +00:00
Andriy Gapon	9f96723ec5	revert r272702: wrong (earlier) change was committed	2014-10-07 16:06:10 +00:00
Andriy Gapon	4c3b02bfce	reduce L2ARC_WRITE_SIZE on FreeBSD FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS data lists (currently both are 16) while illumos has just a single list of each kind. L2ARC_WRITE_SIZE determines the default value of l2arc_write_max which defines limits on how much data is scanned and written to a cache device during each run of the l2arc feed thread. The limits are applied on the per buffer list basis. Because FreeBSD has more lists we proportionally reduce the limits. Reviewed by: Brendan Gregg (earlier version) MFC after: 2 weeks Sponsored by: HybridCluster	2014-10-07 14:30:24 +00:00
Xin LI	1b5bcb8425	MFV r272591: Use loaned ARC buffer for zfs receive to avoid copy. Illumos issue: 5162 zfs recv should use loaned arc buffer to avoid copy MFC after: 2 weeks	2014-10-06 07:29:17 +00:00
Xin LI	8fb26f5aef	MFV r272585: Split the godfather zio into CPU number's to reduce lock contention. Illumos issue: 5176 lock contention on godfather zio MFC after: 2 weeks	2014-10-06 07:03:17 +00:00
Xin LI	dcb20006f0	MFV r272501: Illumos issue: 5177 remove dead code from dsl_scan.c MFC after: 2 weeks	2014-10-06 05:46:51 +00:00
Xin LI	00769ce74d	MFV r272500: Don't inherit flags other than DS_FLAG_CI_DATASET and DS_FLAG_INCONSISTENT when cloning. This prevents DS_FLAG_DEFER_DESTROY being inherited from a clone that is marked for deferred destroy, which causes snapshots of the clone being destroyed when getting a hold or clone. Illumos issue: 5150 zfs clone of a defer_destroy snapshot causes strangeness MFC after: 1 week	2014-10-06 05:42:20 +00:00
Xin LI	4bb264ae15	Don't make nested definition for range_seg_cache. Reported by: ian MFC after: 1 week X-MFC-With: r272506	2014-10-04 15:42:52 +00:00
Xin LI	4750c382a9	MFV r272499: Illumos issue: 5174 add sdt probe for blocked read in dbuf_read() MFC after: 2 weeks	2014-10-04 08:55:08 +00:00
Xin LI	eb0b70068c	Add a new sysctl, vfs.zfs.vol.unmap_enabled, which allows the system administrator to toggle whether ZFS should ignore UNMAP requests. Illumos issue: 5149 zvols need a way to ignore DKIOCFREE MFC after: 2 weeks	2014-10-04 08:51:57 +00:00
Xin LI	2d36d67c72	Diff reduction with upstream. The code change is not really applicable to FreeBSD. Illumos issue: 5148 zvol's DKIOCFREE holds zfsdev_state_lock too long MFC after: 1 month	2014-10-04 08:41:23 +00:00
Xin LI	523b4c7fdf	MFV r272496: Add tunable for number of metaslabs per vdev (vfs.zfs.vdev.metaslabs_per_vdev). The default remains at 200. Illumos issue: 5161 add tunable for number of metaslabs per vdev MFC after: 2 weeks	2014-10-04 08:29:48 +00:00
Xin LI	a8d7512709	MFV r272495: In arc_kmem_reap_now(), reap range_seg_cache too to reclaim memory in response of memory pressure. Illumos issue: 5163 arc should reap range_seg_cache MFC after: 1 week	2014-10-04 08:14:10 +00:00
Xin LI	8c20e2ff11	MFV r272494: Make space_map_truncate() always do space_map_reallocate(). Without this, setting space_map_max_blksz would cause panic for existing pool, as dmu_objset_set_blocksize would fail if the object have multiple blocks. Illumos issues: 5164 space_map_max_blksz causes panic, does not work 5165 zdb fails assertion when run on pool with recently-enabled spacemap_histogram feature MFC after: 2 weeks	2014-10-04 08:05:39 +00:00
Steven Hartland	14a0d74ea8	Refactor ZFS ARC reclaim checks and limits Remove previously added kmem methods in favour of defines which allow diff minimisation between upstream code base. Rebalance ARC free target to be vm_pageout_wakeup_thresh by default which eliminates issue where ARC gets minimised instead of balancing with VM pageout. The restores the target point prior to r270759. Bring in missing upstream only changes which move unused code to further eliminate code differences. Add additional DTRACE probe to aid monitoring of ARC behaviour. Enable upstream i386 code paths on platforms which don't define UMA_MD_SMALL_ALLOC. Fix mixture of byte an page values in arc_memory_throttle i386 code path value assignment of available_memory. PR: 187594 Review: D702 Reviewed by: avg MFC after: 1 week X-MFC-With: r270759 & r270861 Sponsored by: Multiplay	2014-10-03 20:34:55 +00:00
Steven Hartland	99140218aa	Fix various issues with zvols When performing snapshot renames we could deadlock due to the locking in zvol_rename_minors. In order to avoid this use the same workaround as zvol_open in zvol_rename_minors. Add missing zvol_rename_minors to dsl_dataset_promote_sync. Protect against invalid index into zv_name in zvol_remove_minors. Replace zvol_remove_minor calls with zvol_remove_minors to ensure any potential children are also renamed. Don't fail zvol_create_minors if zvol_create_minor returns EEXIST. Restore the valid pool check in zfs_ioc_destroy_snaps to ensure we don't call zvol_remove_minors when zfs_unmount_snap fails. PR: 193803 MFC after: 1 week Sponsored by: Multiplay	2014-10-03 14:49:48 +00:00
Marcelo Araujo	d8a5961f88	Fix failures and warnings reported by newpynfs20090424 test tool. This fix addresses only issues with the pynfs reports, none of these issues are know to create problems for extant real clients. Submitted by: Bart Hsiao <bart.hsiao@gmail.com> Reworked by: myself Reviewed by: rmacklem Approved by: rmacklem Sponsored by: QNAP Systems Inc.	2014-10-03 02:24:41 +00:00
Will Andrews	fbce0221eb	zfsvfs_create(): Refuse to mount datasets whose names are too long. This is checked for in the zfs_snapshot_004_neg STF/ATF test (currently still in projects/zfsd rather than head). sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c: - zfsvfs_create(): Check whether the objset name fits into statfs.f_mntfromname, and return ENAMETOOLONG if not. Although the filesystem can be unmounted via the umount(8) command, any interface that relies on iterating on statfs (e.g. libzfs) will fail to find the filesystem by its objset name, and thus assume it's not mounted. This causes "zfs unmount", "zfs destroy", etc. to fail on these filesystems, whether or not -f is passed. MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 974872 on 2013/08/09	2014-10-01 14:12:02 +00:00
Xin LI	0b66c7c514	Fix a mismerge in r260183 which prevents snapshot zvol devices being removed and re-instate the fix in r242862. Reported by: Leon Dang <ldang nahannisys com>, smh MFC after: 3 days	2014-09-30 18:50:45 +00:00
Will Andrews	91dda985cc	Remove debug.zfs_flags in favor of the new vfs.zfs.debug_flags. Replace TUNABLE_INT with CTLFLAG_RWTUN. Submitted by: avg (debug.zfs_flags removal), smh (TUNABLE_INT replacement)	2014-09-18 18:46:38 +00:00
Will Andrews	f8c2f66a6c	Enable ZFS debug flags to be modified via vfs.zfs.debug_flags. This is primarily only of interest to ZFS developers, but it makes it easier to get additional debugging. Submitted by: gibbs MFC after: 1 month Sponsored by: Spectra Logic MFSpectraBSD: 517074 on 2011/12/15 (by will), 662343 on 2013/03/20 (by gibbs)	2014-09-18 16:55:41 +00:00
Will Andrews	cf0a1157d7	Reorder sysctls for spa.c global tunables; add sysctl for ccw_retry_interval. MFC after: 1 month Sponsored by: Spectra Logic	2014-09-18 16:38:03 +00:00
Will Andrews	cf7a096e72	bpobj_iterate_impl(): Close a refcount leak iterating on a sublist. If bpobj_space() returned non-zero here, the sublist would have been left open, along with the bonus buffer hold it requires. This call does not invoke any calls to bpobj_close() itself. This bug doesn't have any known vector, but was found on inspection. MFC after: 1 week Sponsored by: Spectra Logic Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc) MFSpectraBSD: r1050998 on 2014/03/26	2014-09-18 15:37:53 +00:00
Steven Hartland	d1d469e22b	Remove unused ZFS ARC functions * arc_data_buf_alloc * arc_data_buf_free MFC after: 1 week Sponsored by: Multiplay	2014-09-18 10:46:51 +00:00
Steven Hartland	a889b18c52	Added missing ZFS sysctls * vfs.zfs.vdev.async_write_active_min_dirty_percent * vfs.zfs.vdev.async_write_active_max_dirty_percent Added validation of min / max for ZFS sysctl * vfs.zfs.dirty_data_max_percent MFC after: 3 days	2014-09-14 12:23:00 +00:00
Xin LI	f9290bc2c9	MFV r271518: Correctly report hole at end of file. When asked to find a hole, the DMU sees that there are no holes in the object, and returns ESRCH. The ZPL interprets this as "no holes before the end of the file", and therefore inserts the "virtual hole" at the end of the file. Because DMU and ZPL have different ideas of where the end of an object/file is, we will end up returning the end of file, which is generally larger, instead of returning the end of object. The fix is to handle the "virtual hole" in the DMU. If no hole is found, the DMU will return a hole at the end of the file, rather than an error. Illumos issue: 5139 SEEK_HOLE failed to report a hole at end of file MFC after: 1 week	2014-09-13 17:48:44 +00:00
Xin LI	dc147754b7	MFV r271517: In zil_claim, don't issue warning if we get EBUSY (inconsistent) when opening an objset, instead, ignore it silently. Illumos issue: 5140 message about "%recv could not be opened" is printed when booting after crash MFC after: 1 week	2014-09-13 17:36:34 +00:00
Xin LI	be1b14a063	MFV r271515: Add a new tunable/sysctl, vfs.zfs.free_max_blocks, which can be used to limit how many blocks can be free'ed before a new transaction group is created. The default is no limit (infinite), but we should probably have a lower default, e.g. 100,000. With this limit, we can guard against the case where ZFS could run out of memory when destroying large numbers of blocks in a single transaction group, as the entire DDT needs to be brought into memory. Illumos issue: 5138 add tunable for maximum number of blocks freed in one txg MFC after: 2 weeks	2014-09-13 17:24:56 +00:00
Xin LI	ff0fc48bde	MFV r271512: Illumos issue: 5136 fix write throttle comment in dsl_pool.c MFC after: 2 weeks	2014-09-13 16:51:23 +00:00
Xin LI	263f396e2b	MFV r271510: Enforce 4K as smallest indirect block size (previously the smallest indirect block size was 1K but that was never used). This makes some space estimates more accurate and uses less memory for some data structures. Illumos issue: 5141 zfs minimum indirect block size is 4K MFC after: 2 weeks	2014-09-13 16:26:14 +00:00
Steven Hartland	3cdd9138c3	Persist vdev_resilver_txg changes to avoid panic caused by validation vs a vdev_resilver_txg value from a previous resilver. MFC after: 1 week	2014-09-11 16:21:51 +00:00
Gleb Smirnoff	27ad26d8c7	Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES().	2014-09-10 12:36:41 +00:00
Alexander Motin	ee9534ed96	Make ZVOL writes in device mode support IO_SYNC flag. MFC after: 1 month	2014-09-09 11:29:55 +00:00
Xin LI	817d804595	MFV r271223: In dnode_sync(), do dnode_increase_indirection() before processing the dn_next_nblkptr. Illumos issue: 5117 space map reallocation can cause corruption MFC after: 3 days	2014-09-07 13:13:42 +00:00
Peter Wemm	d903c21a64	Move the restored #ifdef i386 test back inside the #ifdef _KERNEL block where it originally was.	2014-08-31 09:05:02 +00:00
Steven Hartland	92ac3eb59f	Ensure that ZFS ARC free memory checks include cached pages Also restore kmem_used() check for i386 as it has KVA limits that the raw page counts above don't consider PR: 187594 Reviewed by: peter X-MFC-With: r270759 Review: D700 Sponsored by: Multiplay	2014-08-30 21:44:32 +00:00
Mateusz Guzik	6662ce5aab	Add missing proctree locking to fill_kinfo_proc consumers. This fixes r270444. Pointy hat: mjg Reported by: many MFC after: 1 week	2014-08-30 03:10:55 +00:00
Steven Hartland	4d19f4ad1f	Refactor ZFS ARC reclaim logic to be more VM cooperative Prior to this change we triggered ARC reclaim when kmem usage passed 3/4 of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC). This could lead large amounts of unused RAM e.g. on a 192GB machine with ARC the only major RAM consumer, 40GB of RAM would remain unused. The old method has also been seen to result in extreme RAM usage under certain loads, causing poor performance and stalls. We now trigger ARC reclaim when the number of free pages drops below the value defined by the new sysctl vfs.zfs.arc_free_target, which defaults to the value of vm.v_free_target. Credit to Karl Denninger for the original patch on which this update was based. PR: 191510 and 187594 Tested by: dteske MFC after: 1 week Relnotes: yes Sponsored by: Multiplay	2014-08-28 19:50:08 +00:00
Xin LI	ec1b564650	Instead of using timestamp in the AVL, use the memory address when comparing. Illumos issue: 5095 panic when adding a duplicate dbuf to dn_dbufs MFC after: 3 days	2014-08-22 23:13:53 +00:00
Xin LI	fa4484104c	MFV r270197: Illumos issue: 5066 remove support for non-ANSI compilation 5068 Remove SCCSID() macro from <macros.h> MFC after: 2 weeks	2014-08-22 22:13:36 +00:00
Xin LI	7c1db36b28	MFV r270196: Illumos issue: 5047 don't use atomic_*_nv if you discard the return value MFC after: 2 weeks	2014-08-20 22:39:26 +00:00
Xin LI	249ddb42f6	MFC r270195: Illumos issue: 5045 use atomic_{inc,dec}_* instead of atomic_add_* MFC after: 2 weeks	2014-08-20 21:44:48 +00:00
Xin LI	2bcc37f99c	MFV r270193: Illumos issues: 5042 stop using deprecated atomic functions MFC after: 2 weeks	2014-08-20 18:29:18 +00:00
Xin LI	60723bfe21	MFV r269542: In vdev_get_stats, check that the vdev is not a hole before computing the fragmentation. This fixes a panic when removing log device. Illumos issue: 5049 panic when removing log device Author: Alex Reece <alex@delphix.com> MFC after: 2 weeks	2014-08-05 00:07:21 +00:00
Mark Johnston	2661328745	Return 0 for the PPID of threads in process 0, as process 0 doesn't have a parent process. MFC after: 2 weeks	2014-08-04 19:02:30 +00:00
Xin LI	cd741a5e1d	Revert r269404 and use cpu_ticks() for dbuf allocation. Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks(). Reviewed by: mav, gibbs Differential Revision: https://phabric.freebsd.org/D521 MFC after: 2 weeks	2014-08-03 09:47:51 +00:00
Xin LI	1dcef10eac	MFV r269427: In dnode_children_t, use C99's "[]" idiom for declaring the variable sized array dnc_children at the end of the structure. This prevents the compiler from mistakenly optimizing away accesses beyond the array's defined size. Illumos issue: 5038 Remove "old-style" flexible array usage in ZFS. Author: Justin T. Gibbs <justing@spectralogic.com> MFC after: 2 weeks	2014-08-02 08:34:22 +00:00
Steven Hartland	6a369c018c	Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods This prevents recursion of vdev_queue_io_done as per r265321 but using a different method as recommended on the openzfs list. We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods. zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns ZIO_PIPELINE_STOP to ensure future changes don't reintroduce ZIO_PIPELINE_CONTINUE returns. Cleanup flow in vdev_geom_io_start while I'm here. Also fix some cases not using SET_ERROR(..) MFC after: 2 weeks X-MFC-With: r265321	2014-08-01 23:16:48 +00:00
Rui Paulo	d18aa577d5	Copy strtolctype.h to sys/cddl/contrib/opensolaris/common/util to keep the kernel self-contained. Requested by: jhb	2014-07-31 08:07:23 +00:00
Xin LI	9b046b421f	MFV r269224: Increase default ARC buf_hash_table size. When typical block size is small, the hash table could be too small, which would lead to long hash chains and limit performance for cached reads. A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which allows users to override the default assumption of average (typical) block size. Old default was 65536 (64 KiB) and new default is 8192 (8 KiB). Illumos issue: 5034 ARC's buf_hash_table is too small MFC after: 2 weeks	2014-07-29 09:36:48 +00:00
Xin LI	a3cbca537e	MFV r269223: Change dn->dn_dbufs from linked list to AVL tree. Illumos issues: 4873 zvol unmap calls can take a very long time for larger datasets MFC after: 2 weeks	2014-07-29 08:42:22 +00:00
Xin LI	343c95a24e	Reschedule the 'deadman' callout after handling, this makes our code behave more like it is on Solaris. Reported by: avg Reviewed by: avg, mav (but bugs are mine) Differential Revision: https://phabric.freebsd.org/D457	2014-07-29 06:57:13 +00:00
Konstantin Belousov	fe0e9a63e0	Initialize zfs vnode v_hash when the vnode is allocated, instead of postponing it to zfs_vget(). zfs_root() returned vnode with the default value of v_hash, which caused inconsistent v_hash value when root vnode was obtained from zfs_vget(). Nullfs allocated two upper vnodes for the root zfs vnode due to different hashes, causing consistency problems. Reported and tested by: Harald Schmalzbauer <h.schmalzbauer@omnilan.de> Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-28 14:24:18 +00:00
Xin LI	50b74c6ef1	Add two sysctls for newly added tunables. MFC after: 2 weeks	2014-07-26 19:07:08 +00:00
Xin LI	7e37b1e609	MFV r269010: Import Illumos changes to address the following Illumos issues: 4976 zfs should only avoid writing to a failing non-redundant top-level vdev 4978 ztest fails in get_metaslab_refcount() 4979 extend free space histogram to device and pool 4980 metaslabs should have a fragmentation metric 4981 remove fragmented ops vector from block allocator 4982 space_map object should proactively upgrade when feature is enabled 4984 device selection should use fragmentation metric MFC after: 2 weeks	2014-07-26 10:20:48 +00:00
Alexander Motin	1bc04f6a8c	Make sysctls under vfs.zfs.zfetch writeable. I don't see any reason for them to be read-only, while tuning them without reboot is much more convenient for experiments. MFC after: 2 weeks	2014-07-26 09:09:14 +00:00
Xin LI	0aa4ce9b7d	Transform the I/O when vdev_physical_ashift is greater than SPA_MINBLOCKSHIFT. MFC after: 2 weeks	2014-07-25 18:41:56 +00:00
Xin LI	883d80c104	As of r268075, the responsibility of rounding up buffer to optimal size have been transferred from zio_compress_data to its caller. Therefore, passing the 'minblocksize' down will be a no-op. Eliminate the parameter to reduce diff against upstream. MFC after: 2 weeks	2014-07-25 06:53:20 +00:00
Xin LI	3d4d6b0883	Correct typo introduced with r268855. MFC after: 10 days X-MFC with: r268855	2014-07-22 08:37:01 +00:00
Xin LI	b4bb49887b	Reduce lock contention on the z_teardown_lock under heavily cached read workload by splitting the single teardown rrw lock into RRM_NUM_LOCKS (17) of them. Read acquisitions are randomly distributed among these locks based on curthread pointer. Write acquisitions are going to all the locks, which for the usage of this type of lock should be rare. Illumos issue: 5008 lock contention (rrw_exit) while running a read only load MFC after: 2 weeks	2014-07-19 00:26:03 +00:00
Xin LI	82599d31fe	MFV r268851: When a sync task is waiting for a txg to complete, we should hurry it along by increasing the number of outstanding async writes (i.e. make vdev_queue_max_async_writes() return a larger number). Illumos issue: 4753 increase number of outstanding async writes when sync task is waiting MFC after: 2 weeks	2014-07-18 22:34:01 +00:00
Xin LI	f886b6e3bc	MFV r268850: Change the interaction between the DMU and ARC so that when the DMU is shutting down an objset, we do not evict the data from the ARC. Instead we simply coordinate the destruction of the DMU's data with the ARC. The only case where we actually need to explicitly evict from the ARC is when dbuf_rele_and_unlock() determines that the administrator has requested that it not be kept in memory, via the primarycache/secondarycache properties. In this case, we evict the data from the ARC by its blkptr_t, the same way as when a block is freed we explicitly evict it from the ARC. Illumos issue: 4631 zvol_get_stats triggering too many reads MFC after: 2 weeks	2014-07-18 22:04:21 +00:00
Xin LI	7882b61f60	MFV r268848: Instead of asserting all zio's be properly aligned, only assert on the logical ones. Cap uberblocks at 8k, otherwise with ashift=17, there would be only one uberblock. This fixes a problem that zdb would trip assert on pools with ashift >= 0xe (8k). While there, also change the code so it only attempt to condense space map unless the uncondensed size consumes greater than zfs_metaslab_condense_block_threshold blocks. Illumos issue: 4958 zdb trips assert on pools with ashift >= 0xe MFC after: 2 weeks	2014-07-18 20:41:40 +00:00
Xin LI	7079d5877c	MFV r268714: Improve extreme rewind import. When doing an "extreme rewind" import ("zpool import -XF"), we attempt to verify all data in the pool, essentially scrubbing the entire pool. The problem is that spa_load_verify_cb() issues an unbounded number of concurrent scrub i/os. This can lead to all of memory being used for these zio's, wedging the system. Like normal scrub, we need to put a cap on the number of outstanding i/os, and have the traverse thread block when we reach this cap. For this purpose the cap can be very large (10,000) to optimize the elevator algorithm. Three kernel tunables have been added: vfs.zfs.spa_load_verify_maxinflight vfs.zfs.spa_load_verify_metadata vfs.zfs.spa_load_verify_data The latter two tunables controls whether metadata and/or user data when doing extreme rewind. Make 'zpool import -T' imply scrub. Make zpool import -T <txg> accept hexadecimal values for the txg when prefixed with 0x. Skip txg's for which there is no uberblock when doing extreme rewind. Skip reading all user data twice by skipping prefetches when doing extreme rewinds as we do not access via the ARC. Illumos issues: 4970 need controls on i/o issued by zpool import -XF 4971 zpool import -T should accept hex values 4972 zpool import -T implies extreme rewind, and thus a scrub 4973 spa_load_retry retries the same txg 4974 spa_load_verify() reads all data twice MFC after: 2 weeks	2014-07-15 22:44:04 +00:00
Xin LI	eb75155228	MFV r268702: Add missing *_destroy() calls in various places with ZFS. Illumos issue: 4975 missing mutex_destroy() calls in zfs MFC after: 2 weeks	2014-07-15 20:32:23 +00:00
Xin LI	1b174fa1eb	MFV r268455: Use reserved space for ZFS administrative commands. We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at least) for system use. Most ZPL operations, e.g. write(2), creat(2), will fail with ENOSPC if we fall below this. Certain operations, e.g. file removal and most administrative actions, still permitted until half of the slop space is used. This would allow users to use these operations to free up space in the pool when pool is close to full but half of slop space is still free. A very restricted set of operations that frees up space or change quota are always permitted, regardless of the amount of free space. MFC after: 2 weeks	2014-07-09 23:14:59 +00:00
Xin LI	fdc0ee2cf5	MFV r268452: Explicitly mark file removal transactions as "presumed to result in a net free of space" so they will not fail with ENOSPC. Illumos issue: 4950 files sometimes can't be removed from a full filesystem MFC after: 2 weeks	2014-07-09 18:32:40 +00:00
Alexander Motin	e327a057a7	Remove IO_SYNC flag when writing extended file attributes on ZFS. While it is possible to create and write file, modify its permissions, etc. without ever doing sync, it looks odd that it is required for setting extended file attributes on ZFS. UFS does not do sync there too. Samba uses those extended attributes to store some its data, and doing it synchronously by many times reduces file creation performance for systems without SLOG device. Reviewed by: delphij, jpaetzel, silence on fs@ MFC after: 2 weeks Sponsored by: iXsystems, Inc.	2014-07-08 17:26:08 +00:00
Marcel Moolenaar	e7d939bda2	Remove ia64. This includes: o All directories named ia64 o All files named ia64 o All ia64-specific code guarded by __ia64__ o All ia64-specific makefile logic o Mention of ia64 in comments and documentation This excludes: o Everything under contrib/ o Everything under crypto/ o sys/xen/interface o sys/sys/elf_common.h Discussed at: BSDcan	2014-07-07 00:27:09 +00:00
Pedro F. Giffuni	5f40879138	Merge from OpenSolaris (24-Jul-2010): 6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops 6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink This finishes a set of merges from the older OpenSolaris releases. Still the FreeBSD port has many differences that are difficult to account for but that seems normal given that the kernels are different. MFC after: 1 week	2014-07-05 15:36:17 +00:00
Pedro F. Giffuni	99d8c6efbd	Merge from OpenSolaris (30-Jun-2009): 6851093 system drops to kmdb with anonymous dtrace probes + kmdb This has no effect on FreeBSD (code is ifdef'ed) but is useful as reference for future merges. MFC after: 1 week	2014-07-03 19:25:24 +00:00
Pedro F. Giffuni	87e109c3e0	Merge from OpenSolaris (22-Apr-2008): 6823388 DTrace ioctl handlers must validate all structure members MFC after: 1 week	2014-07-03 19:07:37 +00:00
Pedro F. Giffuni	e099b3a948	Merge from OpenSolaris (20-Apr-2008): 6822482 DOF validation needs to handle loadable sections flagged as unloadable MFC after: 1 week	2014-07-03 17:36:59 +00:00
Alexander Motin	5a178afd41	Fix bug in sync control in new "dev" mode of ZVOL (r265678). Don't check ZVOL_WCE flag, used in Solaris to control device "write cache". It is not applicable on FreeBSD and by default set to "disable". MFC after: 3 days	2014-07-02 21:25:32 +00:00
Pedro F. Giffuni	0b8f286e83	Merge from OpenSolaris (15-Sep-2008): 6735480 race between probe enabling and provider registration MFC after: 1 week	2014-07-01 23:37:24 +00:00
Xin LI	30324e945a	MFV r268122: 4929 want prevsnap property illumos/illumos-gate@b461c7460e MFC after: 2 weeks	2014-07-01 22:42:53 +00:00
Xin LI	9cc8a15b2e	MFV r268121: 4924 LZ4 Compression for metadata illumos/illumos-gate@b8289d24d8 MFC after: 2 weeks	2014-07-01 22:31:09 +00:00
Pedro F. Giffuni	f384ec379c	Small merges from OpenSolaris: These have no effect on FreeBSD, in fact they are ifdef'ed, but make easier future merges: 6699767 panic in spec_open() 6718877 crgetzoneid() use can cause problems when forking processes with USDT providers in a non global zone MFC after: 3 days	2014-07-01 22:16:44 +00:00
Xin LI	aa882b9048	MFV r268119: 4914 zfs on-disk bookmark structure should be named *_phys_t illumos/illumos-gate@7802d7bf98 MFC after: 2 weeks	2014-07-01 21:51:30 +00:00
Xin LI	55f6421982	- Fix handling of "new" style of ioctl in compatiblity mode [1]; - Reorganize code and reduce diff from upstream; - Improve forward compatibility shims for previous kernel; Reported by: sbruno [1] X-MFC-With: r268075	2014-07-01 20:57:39 +00:00
Pedro F. Giffuni	c6d712caf3	Revert r268007, and re-adapt MFV r260708: 4427 pid provider rejects probes with valid UTF-8 names Use of u8_textprep.c required -Wno-cast-qual for powerpc. MFC after: 2 weeks	2014-07-01 15:36:05 +00:00
Xin LI	be78a8db97	MFV r267570: 4756 metaslab_group_preload() could deadlock illumos/illumos-gate@30beaff42d MFC after: 2 weeks	2014-07-01 08:36:56 +00:00
Xin LI	3a0f8ff95e	MFV r267569: 4897 Space accounting mismatch in L2ARC/zpool illumos/illumos-dist@3038a2b421 MFC after: 2 weeks	2014-07-01 08:28:49 +00:00
Xin LI	93b8d53c09	MFV r267567: 4881 zfs send performance degradation when embedded block pointers are encountered illumos/illumos-gate@06315b795c MFC after: 2 weeks	2014-07-01 07:56:07 +00:00
Xin LI	71eaf0fda7	MFV r267566: 4390 i/o errors when deleting filesystem/zvol can lead to space map corruption MFC after: 2 weeks	2014-07-01 07:29:42 +00:00
Xin LI	29441ba3fa	MFV r267565: 4757 ZFS embedded-data block pointers ("zero block compression") 4913 zfs release should not be subject to space checks MFC after: 2 weeks	2014-07-01 06:43:15 +00:00
Pedro F. Giffuni	0135aadfc3	Reduce some warnings in the Solaris unicode support. Clean some warnings from parenthesis and minor style issues. MFC after: 3 days	2014-06-29 02:28:05 +00:00
Pedro F. Giffuni	f34dd28f7d	Revert r267869: MFV r260708 4427 pid provider rejects probes with valid UTF-8 names Use of u8_textprep.c broke the build on powerpc. Reported by: bz, rpaulo and tinderbox. Pointyhat: me	2014-06-28 19:59:12 +00:00
Hans Petter Selasky	af3b2549c4	Pull in r267961 and r267973 again. Fix for issues reported will follow.	2014-06-28 03:56:17 +00:00
Glen Barber	37a107a407	Revert r267961, r267973: These changes prevent sysctl(8) from returning proper output, such as: 1) no output from sysctl(8) 2) erroneously returning ENOMEM with tools like truss(1) or uname(1) truss: can not get etype: Cannot allocate memory	2014-06-27 22:05:21 +00:00
Hans Petter Selasky	3da1cf1e88	Extend the meaning of the CTLFLAG_TUN flag to automatically check if there is an environment variable which shall initialize the SYSCTL during early boot. This works for all SYSCTL types both statically and dynamically created ones, except for the SYSCTL NODE type and SYSCTLs which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to be used in the case a tunable sysctl has a custom initialisation function allowing the sysctl to still be marked as a tunable. The kernel SYSCTL API is mostly the same, with a few exceptions for some special operations like iterating childrens of a static/extern SYSCTL node. This operation should probably be made into a factored out common macro, hence some device drivers use this. The reason for changing the SYSCTL API was the need for a SYSCTL parent OID pointer and not only the SYSCTL parent OID list pointer in order to quickly generate the sysctl path. The motivation behind this patch is to avoid parameter loading cludges inside the OFED driver subsystem. Instead of adding special code to the OFED driver subsystem to post-load tunables into dynamically created sysctls, we generalize this in the kernel. Other changes: - Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask" to "hw.pcic.intr_mask". - Removed redundant TUNABLE statements throughout the kernel. - Some minor code rewrites in connection to removing not needed TUNABLE statements. - Added a missing SYSCTL_DECL(). - Wrapped two very long lines. - Avoid malloc()/free() inside sysctl string handling, in case it is called to initialize a sysctl from a tunable, hence malloc()/free() is not ready when sysctls from the sysctl dataset are registered. - Bumped FreeBSD version to indicate SYSCTL API change. MFC after: 2 weeks Sponsored by: Mellanox Technologies	2014-06-27 16:33:43 +00:00
Rui Paulo	a43f0be9fe	MFV illumos 4471 DTrace count() with histogram 4472 DTrace full width distribution histograms 4473 DTrace frequency trails MFC after: 2 weeks	2014-06-26 23:24:59 +00:00
Rui Paulo	8e648814b0	MFV illumos 4474 DTrace Userland CTF Support 4475 DTrace userland Keyword 4476 DTrace tests should be better citizens 4479 pid provider types 4480 dof emulation is missing checks MFC after: 2 weeks	2014-06-26 23:21:11 +00:00
Rui Paulo	b1f9167f94	MFV illumos 4477 DTrace should speak JSON MFC after: 2 weeks	2014-06-26 21:45:49 +00:00
Rui Paulo	0c2b601953	MFV illumos r266986: 2915 DTrace in a zone should see "cpu", "curpsinfo", et al 2916 DTrace in a zone should be able to access fds[] 2917 DTrace in a zone should have limited provider access MFC after: 2 weeks	2014-06-26 19:38:16 +00:00
Rui Paulo	dd9b2abed8	Revert r267898.	2014-06-26 17:34:42 +00:00
Rui Paulo	d8e37c5f72	Bring the following change from the illumos-joyent repository: commit 78e24ab6803bbe11ba37642624e1498ede5b239d Author: Bryan Cantrill <bryan@joyent.com> Date: Thu Oct 31 01:20:54 2013 OS-1688 DTrace count() with histogram OS-2360 DTrace full width distribution histograms OS-2361 DTrace frequency trails MFC after: 2 weeks	2014-06-26 07:06:43 +00:00
Pedro F. Giffuni	af8bd6e468	MFV r260708 4427 pid provider rejects probes with valid UTF-8 names This make use of Solaris' u8_validate() which we happen to use since r185029 for ZFS. Illumos Revision: 1444d846b126463eb1059a572ff114d51f7562e5 Reference: https://www.illumos.org/issues/4427 Obtained from: Illumos MFC after: 2 weeks	2014-06-25 14:23:30 +00:00
Davide Italiano	a99098e2ba	Continue the crusade towards a dev_clone()-free kernel, removing its usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD 8, so this change should be quite harmless. Reviewed by: markj Approved by: markj MFC after: never	2014-06-25 03:54:02 +00:00
Mark Johnston	efa1aff675	Fix some bugs when fetching probe arguments in i386. Firstly ensure that the 4 byte-aligned dtrace_invop_callsite can be found and that it immediately follows the call to dtrace_invop(). Secondly, fix some pointer arithmetic to account for differences between struct i386_frame and illumos' struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works by following a fixed number of frame pointers to the probe site, so inlining breaks it. MFC after: 3 weeks	2014-06-23 02:00:14 +00:00
Steven Hartland	74ddec2b18	Removed stale comment about multi-vdev root pool config not working MFC after: 1 week	2014-06-09 13:04:58 +00:00
Bryan Drewery	f3a7518361	- Naively fix build by partially reverting r267029 to still use gethrtime() when building libzpool. X-MFC-With: 267029	2014-06-04 05:04:15 +00:00
Alexander Motin	4220ebcf71	Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue selection. gethrtime() in our port updated with HZ rate, so unusable for this specific purpose, completely draining benefit of multiple taskqueues. MFC after: 2 weeks	2014-06-03 21:06:03 +00:00
Xin LI	f4c7dd6dd0	MFV 266913+266914: 3897 zfs filesystem and snapshot limits (fix leak) 4901 zfs filesystem/snapshot limit leaks MFC after: 3 days	2014-05-31 01:00:22 +00:00
Xin LI	2bdf7f79bc	MFV r266766: Add a new zfs property, "redundant_metadata" which can have values "all" or "most". The default will be "all", which is the current behavior. When set to all, ZFS stores an extra copy of all metadata. If a single on-disk block is corrupt, at worst a single block of user data (which is recordsize bytes long) can be lost. Setting to "most" will cause us to only store 1 copy of level-1 indirect blocks of user data files. This can improve performance of random writes, because less metadata has to be written. In practice, at worst about 100 blocks (of recordsize bytes each) of user data can be lost if a single on-disk block is corrupt. The exact behavior of which metadata blocks are stored redundantly may change in future releases. Illumos issue: 3835 zfs need not store 2 copies of all metadata MFC after: 2 weeks	2014-05-27 19:46:11 +00:00
Allan Jude	ecd9567c1a	Improve sysctl descriptions for new ZFS sysctls: vfs.zfs.dirty_data_max vfs.zfs.dirty_data_max_max vfs.zfs.dirty_data_sync Reviewed by: smh Approved by: wblock (mentor)	2014-05-22 05:30:38 +00:00
Steven Hartland	df23182a62	Added sysctls / tunables for ZFS dirty data tuning Added the following new sysctls / tunables: * vfs.zfs.dirty_data_max * vfs.zfs.dirty_data_max_max * vfs.zfs.dirty_data_max_percent * vfs.zfs.dirty_data_sync * vfs.zfs.delay_min_dirty_percent * vfs.zfs.delay_scale PR: kern/189865 MFC after: 2 weeks	2014-05-21 13:36:04 +00:00
Xin LI	b8cdcb8ad8	Import George Wilson's change for Illumos #4730 : 4730 metaslab group taskq should be destroyed in metaslab_group_destroy() Reviewed by: Alex Reece <alex.reece@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Original author: George Wilson MFC after: 3 days	2014-05-06 19:03:04 +00:00
Steven Hartland	4f64781818	Use a zio flag to prevent recursion of vdev_queue_io_done which can cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE from the zio_vdev_io_start stage and hence don't suspend and complete in a different thread. This prevents double fault panic on slow machines running ZFS on GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests. MFC after: 1 month X-MFC-With: r265152	2014-05-04 14:05:14 +00:00
Steven Hartland	573621a6d6	Don't treat TRIM requests returning ENOTSUP as an unexpected error. MFC after: 1 month X-MFC-With: r265152	2014-05-03 02:30:01 +00:00
Steven Hartland	10138166cf	Removed pointless / duplicated call to trim_map_first. MFC after: 1 month X-MFC-With: r265152	2014-05-02 09:31:21 +00:00
Steven Hartland	82ce008538	Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it translated the required geom values. This reduces the amount of changes required for FREE requests to be supported by the new IO scheduler. This also eliminates the need for a specific DKIOCTRIM. Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part of their schedule. As the new IO scheduler can result in a request to execute one type of IO to actually run a different type of IO it requires that zio_trim requests are processed without holding the trim map lock (tm->tm_lock), as the free request execute call may result in write request running hence triggering a trim_map_write_start call, which takes the trim map lock and hence would result in recused on no-recursive sx lock. This is based off avg's original work, so credit to him. MFC after: 1 month	2014-04-30 17:46:29 +00:00
Steven Hartland	101dfa0ed4	Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting in the wrong ZIO continuing its pipeline. This is a serious issue which could cause data loss / corruption but appears to be limited to error handling such as when vdev_readable(vd) returns false. MFC after: 2 days	2014-04-28 09:00:00 +00:00
Steven Hartland	c2b2c5fc76	Eliminate duplicate checks in vdev_geom_io_intr error handling MFC after: 1 month	2014-04-24 15:36:00 +00:00
Steven Hartland	5b245b8ae0	Add the ability to set a minimum ashift size for ZFS pool creation or root level vdev addition. Change max_auto_ashift sysctl to error when an invalid value is requested instead of silently limiting it.	2014-04-24 01:06:03 +00:00
Xin LI	754180f4ae	MFV r264830: 4745 fix AVL code misspellings MFC after: 2 weeks	2014-04-23 20:32:39 +00:00
Xin LI	f8587167e4	MFV r264829: 3897 zfs filesystem and snapshot limits MFC after: 2 weeks	2014-04-23 20:29:46 +00:00
Xin LI	18ab4bd8d9	MFV r264668: 4754 io issued to near-full luns even after setting noalloc threshold 4755 mg_alloc_failures is no longer needed illumos/illumos@b6240e830b MFC after: 2 weeks	2014-04-18 22:04:58 +00:00
Xin LI	d301d390a7	MFV r264667: 4752 fan out read zio taskqs illumos/illumos-gate@1b497ab83e	2014-04-18 21:35:23 +00:00
Xin LI	613074ec08	MFV r264666: 4374 dn_free_ranges should use range_tree_t illumos/illumos-gate@bf16b11e8d MFC after: 2 weeks	2014-04-18 21:15:12 +00:00
Mark Johnston	0626f3e435	DTrace's pid provider works by inserting breakpoint instructions at probe sites and installing a hook at the kernel's trap handler. The fasttrap code will emulate the overwritten instruction in some common cases, but otherwise copies it out into some scratch space in the traced process' address space and ensures that it's executed after returning from the trap. In Solaris and illumos, this (per-thread) scratch space comes from some reserved space in TLS, accessible via the fs segment register. This approach is somewhat unappealing on FreeBSD since it would require some modifications to rtld and jemalloc (for static TLS) to ensure that TLS is executable, and would thus introduce dependencies on their implementation details. I think it would also be impossible to safely trace static binaries compiled without these modifications. This change implements the functionality in a different way, by having fasttrap map pages into the target process' address space on demand. Each page is divided into 64-byte chunks for use by individual threads, and fasttrap's process descriptor struct has been extended to keep track of any scratch space allocated for the corresponding process. With this change it's possible to trace all libc functions in a program, e.g. with pid$target:libc.so.*::entry {@[probefunc] = count();} Previously this would generally cause the victim process to crash, as tracing memcpy on amd64 requires the functionality described above. Tested by: Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version) MFC after: 6 weeks	2014-04-14 00:22:42 +00:00
Davide Italiano	2f9e29745c	Fix a panic in zfs_rename(). this is due to a wrong dereference of a vnode when it's not locked and can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename() entry point because the VFS can't be sure that this scenario is LOR-free (it might violate the parent->child lock acquisition rule). Dereference 'tdvp' instead, which is already locked on entry, and access 'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope. While at it, remove the usage of VOP_REALVP, as long as this is a NOP on FreeBSD. Discussed with: avg Reviewed by: pjd	2014-04-13 01:15:37 +00:00
Alexander Motin	f6e1dc83c3	Create zvol devices on zfs clone. While big and shiny patch is not ready, it is better to have something. PR: kern/178999 MFC after: 1 week	2014-04-11 11:56:16 +00:00
Alexander Motin	a96fefe042	In addition to r264077, tell GEOM that we do support BIO_DELETE now.	2014-04-06 16:31:28 +00:00
Alexander Motin	537650f54d	Add property and sysctl to control how ZVOLs are exposed to OS. New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL between three modes: geom -- existing fully functional behavior (default); dev -- exposing volumes only as raw disk device file in devfs; none -- not exposing volumes outside ZFS. The "dev" mode is less functional (can't be partitioned, mounted, etc), but it is faster, and in some scenarios with untrusted consumers safer. It can be useful for NAS, VM block storages, etc. The "none" mode may be convenient for backup servers, etc. that don't need direct data access. Due to the way ZVOL is integrated with main ZFS code, those property and sysctl are checked only during pool import and volume creation. MFC after: 1 month Sponsored by: iXsystems, Inc.	2014-04-05 13:01:44 +00:00
Alexander Motin	89e84aead6	MFV r258922: 3580 Want zvols to return volblocksize when queried for physical block size illumos/illumos-gate@a0b60564df It is irrelevant for FreeBSD, just reducing diff.	2014-04-03 20:18:55 +00:00
Alexander Motin	4a03e8b64d	Add BIO_DELETE support to ZVOL. It is an adapted merge from the vendor branch of: 701 UNMAP support for COMSTAR (in part related to ZFS) 2130 zvol DKIOCFREE uses nested DMU transactions	2014-04-03 15:04:32 +00:00
Pedro F. Giffuni	23e4da439c	MFV r258379; 4248 dtrace(1M) should never create DOF with empty probes section 4249 Only probes from the first DTrace object file will be included Illumos Revision: 4a20ab41aadcb81c53e72fc65886e964e9add59 Reference: https://www.illumos.org/issues/4248 https://www.illumos.org/issues/4249 Obtained from: Illumos MFC after: 1 month	2014-04-02 15:32:44 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Alexander Motin	68d17718e0	Report ZVOL block size as GEOM stripesize. MFC after: 2 weeks	2014-03-13 19:26:26 +00:00

... 8 9 10 11 12 ...

1730 Commits