freebsd-dev

Author	SHA1	Message	Date
Steven Hartland	101dfa0ed4	Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting in the wrong ZIO continuing its pipeline. This is a serious issue which could cause data loss / corruption but appears to be limited to error handling such as when vdev_readable(vd) returns false. MFC after: 2 days	2014-04-28 09:00:00 +00:00
Steven Hartland	c2b2c5fc76	Eliminate duplicate checks in vdev_geom_io_intr error handling MFC after: 1 month	2014-04-24 15:36:00 +00:00
Steven Hartland	5b245b8ae0	Add the ability to set a minimum ashift size for ZFS pool creation or root level vdev addition. Change max_auto_ashift sysctl to error when an invalid value is requested instead of silently limiting it.	2014-04-24 01:06:03 +00:00
Xin LI	f8587167e4	MFV r264829: 3897 zfs filesystem and snapshot limits MFC after: 2 weeks	2014-04-23 20:29:46 +00:00
Xin LI	18ab4bd8d9	MFV r264668: 4754 io issued to near-full luns even after setting noalloc threshold 4755 mg_alloc_failures is no longer needed illumos/illumos@b6240e830b MFC after: 2 weeks	2014-04-18 22:04:58 +00:00
Xin LI	d301d390a7	MFV r264667: 4752 fan out read zio taskqs illumos/illumos-gate@1b497ab83e	2014-04-18 21:35:23 +00:00
Xin LI	613074ec08	MFV r264666: 4374 dn_free_ranges should use range_tree_t illumos/illumos-gate@bf16b11e8d MFC after: 2 weeks	2014-04-18 21:15:12 +00:00
Davide Italiano	2f9e29745c	Fix a panic in zfs_rename(). this is due to a wrong dereference of a vnode when it's not locked and can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename() entry point because the VFS can't be sure that this scenario is LOR-free (it might violate the parent->child lock acquisition rule). Dereference 'tdvp' instead, which is already locked on entry, and access 'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope. While at it, remove the usage of VOP_REALVP, as long as this is a NOP on FreeBSD. Discussed with: avg Reviewed by: pjd	2014-04-13 01:15:37 +00:00
Alexander Motin	f6e1dc83c3	Create zvol devices on zfs clone. While big and shiny patch is not ready, it is better to have something. PR: kern/178999 MFC after: 1 week	2014-04-11 11:56:16 +00:00
Alexander Motin	a96fefe042	In addition to r264077, tell GEOM that we do support BIO_DELETE now.	2014-04-06 16:31:28 +00:00
Alexander Motin	537650f54d	Add property and sysctl to control how ZVOLs are exposed to OS. New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL between three modes: geom -- existing fully functional behavior (default); dev -- exposing volumes only as raw disk device file in devfs; none -- not exposing volumes outside ZFS. The "dev" mode is less functional (can't be partitioned, mounted, etc), but it is faster, and in some scenarios with untrusted consumers safer. It can be useful for NAS, VM block storages, etc. The "none" mode may be convenient for backup servers, etc. that don't need direct data access. Due to the way ZVOL is integrated with main ZFS code, those property and sysctl are checked only during pool import and volume creation. MFC after: 1 month Sponsored by: iXsystems, Inc.	2014-04-05 13:01:44 +00:00
Alexander Motin	89e84aead6	MFV r258922: 3580 Want zvols to return volblocksize when queried for physical block size illumos/illumos-gate@a0b60564df It is irrelevant for FreeBSD, just reducing diff.	2014-04-03 20:18:55 +00:00
Alexander Motin	4a03e8b64d	Add BIO_DELETE support to ZVOL. It is an adapted merge from the vendor branch of: 701 UNMAP support for COMSTAR (in part related to ZFS) 2130 zvol DKIOCFREE uses nested DMU transactions	2014-04-03 15:04:32 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Alexander Motin	68d17718e0	Report ZVOL block size as GEOM stripesize. MFC after: 2 weeks	2014-03-13 19:26:26 +00:00
Xin LI	8e41e26f65	MFV r262983: 4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot! illumos/illumos-gate@2144b121c0	2014-03-11 00:23:50 +00:00
Xin LI	ba680558a0	All callers of static method load_nvlist() in spa.c handles error case, so there is no reason to assert that we won't hit an error. Instead, just return that error to caller and have the upper layer handle it. Obtained from: FreeNAS Reported by: rodrigc Reviewed by: Matthew Ahrens MFC after: 2 weeks	2014-03-02 02:41:33 +00:00
Xin LI	5f62f8cdcb	MFV r261619: 4574 get_clones_stat does not call zap_count in non-debug kernel zap_count(...) is never called in non-DEBUG kernel. As result "count" variable is always 0, and "goto fail" is always reached. This means get_clones_stat function never makes up list of clones for "clones" properties. MFC after: 2 weeks	2014-02-08 05:35:36 +00:00
Xin LI	bea6313e6b	MFV r260834: Fix memory leak of compressed buffers in l2arc_write_done (Illumos #3995).	2014-01-18 01:45:39 +00:00
Andriy Gapon	6d03ca5789	traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT This is done to ensure that visited object IDs are always increasing. Also, pass correct object ID to prefetch_dnode_metadata for os_groupused_dnode. Without this change we would hit an assert if traversal was paused on a GROUPUSED object, which is unlikely but possible. Apparently the same change was independently developed by Deplhix. Reviewed by: Matthew Ahrens <mahrens@delphix.com> MFC after: 10 days Sponsored by: HybridCluster	2014-01-17 10:23:46 +00:00
Andriy Gapon	fec721bc43	fix a build problem with INVARIANTS enabled introduced in r260704 Reported by: glebius MFC after: 5 days X-MFC with: r260704	2014-01-16 13:44:37 +00:00
Andriy Gapon	876fa2c17b	fix a bug in ZFS mirror code for handling multiple DVAa The bug was introduced in r256956 "Improve ZFS N-way mirror read performance". The code in vdev_mirror_dva_select erroneously considers already tried DVAs for the next attempt. Thus, it is possible that a failing DVA would be retried forever. As a secondary effect, if the attempts fail with checksum error, then checksum error reports are accumulated until the original request ultimately fails or succeeds. But because retrying is going on indefinitely the cheksum reports accumulation will effectively be a memory leak. Reviewed by: gibbs MFC after: 13 days Sponsored by: HybridCluster	2014-01-16 13:24:10 +00:00
Andriy Gapon	00126789e6	Revert r260705: wrong patch committed by accident An earlier, less efficient version was committed by accident.	2014-01-16 13:20:20 +00:00
Andriy Gapon	19f5e9076b	zfs_deleteextattr: name buffer from namei is needed by zfs_rename If we prematurely free the name buffer and it gets quickly recycled, then zfs_rename may see data from another lookup or even unmapped memory via cn_nameptr. MFC after: 6 days Sponsored by: HybridCluster	2014-01-16 12:31:27 +00:00
Andriy Gapon	2f9a31944f	fix a bug in ZFS mirror code for handling multiple DVAa The bug was introduced in r256956 "Improve ZFS N-way mirror read performance". The code in vdev_mirror_dva_select erroneously considers already tried DVAs for the next attempt. Thus, it is possible that a failing DVA would be retried forever. As a secondary effect, if the attempts fail with checksum error, then checksum error reports are accumulated until the original request ultimately fails or succeeds. But because retrying is going on indefinitely the cheksum reports accumulation will effectively be a memory leak. Reviewed by: gibbs MFC after: 13 days Sponsored by: HybridCluster	2014-01-16 12:26:54 +00:00
Andriy Gapon	b8ca4667ed	zfs: getnewvnode_reserve must be called outside of a zfs transaction Otherwise we could run into the following deadlock. A thread has a transaction open and assigned to a transaction group. That would prevent the transaction group from be quiesced and synced. The thread is blocked in getnewvnode_reserve waiting for a vnode to a be reclaimed. vnlru thread is blocked trying to enter ZFS VOP because a filesystem is suspended by an ongoing rollback or receive operation. In its turn the operation is waiting for the current transaction group to be synced. zfs_zget is always used outside of active transactions, but zfs_mknode is always used in a transaction context. Thus, we hoist getnewvnode_reserve from zfs_mknode to its callers. While there, assert that ZFS always calls getnewvnode while having a vnode reserved. Reported by: adrian Tested by: adrian MFC after: 17 days Sponsored by: HybridCluster	2014-01-16 12:22:46 +00:00
Alexander Motin	ce05e707c4	In dmu_zfetch_stream_reclaim() replace division with multiplication and move it out of the loop and lock.	2014-01-03 18:44:37 +00:00
Xin LI	7c88e58f46	MFV r260155: When we encounter an I/O error on a piece of metadata while deleting a file system or zvol, we don't update the bptree_entry_phys_t's bookmark. This would lead to double free of bp's which will lead to space map corruption. Instead of tolerating and allowing the corruption, panic immediately. See Illumos #4390 for more details. 4391 panic system rather than corrupting pool if we hit bug 4390 Illumos/illumos-gate@8b36997aa2 MFC after: 2 weeks	2014-01-02 08:10:35 +00:00
Xin LI	ab0b9f6b30	MFV r260154 + 260182: 4369 implement zfs bookmarks 4368 zfs send filesystems from readonly pools Illumos/illumos-gate@78f1710053 MFC after: 2 weeks	2014-01-02 07:34:36 +00:00
Xin LI	6f2791f53a	Fix build on platforms where atomic_swap_64 is not available.	2014-01-02 03:24:44 +00:00
Xin LI	647795d181	MFV r260153: 4121 vdev_label_init should treat request as succeeded when pool is read only Illumos/illumos-gate@973c78e94b MFC after: 2 weeks	2014-01-01 01:26:39 +00:00
Xin LI	f4c8ba8370	MFV r259170: 4370 avoid transmitting holes during zfs send 4371 DMU code clean up illumos/illumos-gate@43466aae47 NOTE: Make sure the boot code is updated if a zpool upgrade is done on boot zpool. MFC after: 2 weeks	2014-01-01 00:45:28 +00:00
Xin LI	cca1e7c623	MFV r258385: (Note: this change is not applicable to FreeBSD and the file is not included in build. It's integrated for completeness). 4128 disks in zpools never go away when pulled illumos/illumos-gate@39cddb10a3 MFC after: 2 weeks	2013-12-31 21:24:00 +00:00
Xin LI	db2aff5f8b	MFV r242733: 3306 zdb should be able to issue reads in parallel 3321 'zpool reopen' command should be documented in the man page and help message illumos/illumos-gate@31d7e8fa33 FreeBSD porting notes: the kernel part of this changeset depends on Solaris buf(9S) interfaces and are not really applicable for our use. vdev_disk.c is patched as-is to reduce diverge from upstream, but vdev_file.c is left intact. MFC after: 2 weeks	2013-12-31 19:39:15 +00:00
Xin LI	1aaa945f67	MFV r258374: 4171 clean up spa_feature_*() interfaces 4172 implement extensible_dataset feature for use by other zpool features illumos/illumos-gate@2acef22db7 MFC after: 2 weeks	2013-12-24 07:14:25 +00:00
Xin LI	ec097c1634	MFV r258373: 4168 ztest assertion failure in dbuf_undirty 4169 verbatim import causes zdb to segfa 4170 zhack leaves pool in ACTIVE state illumos/illumos-gate@7fdd916c47 MFC after: 2 weeks	2013-12-24 06:56:17 +00:00
Pawel Jakub Dawidek	4106732882	MFV r258923: 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0 illumos/illumos-gate@bb411a08b0 MFC after: 3 days	2013-12-18 21:45:46 +00:00
Alan Somers	cd730bd6b2	sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c When a da or ada device dissappears, outstanding IOs fail with ENXIO, not EIO. The check for EIO was probably copied from Illumos, where that is indeed the correct errno. Without this change, pulling a busy drive from a zpool would usually turn it into UNAVAIL, even though pulling an idle drive would turn it into REMOVED. With this change, it is REMOVED every time. Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that results in devd getting two resource.fs.zfs.removed events. The comment said that the event had to be sent directly instead of through the async removal thread because "the DE engine is using this information to discard prevoius I/O errors". However, the fact that vdev_geom_io_intr was never actually sending the events until now, and that vdev_geom_orphan never sent them at all, and that vdev_geom_orphan usually gets called about 2 seconds after the actual removal, means that FreeBSD's userland can cope with a late event just fine. Approved by: ken (mentor) Sponsored by: Spectra Logic Corporation MFC after: 4 weeks	2013-12-12 00:27:22 +00:00
Alexander Motin	f192c4873d	Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE (64MB). Even if we would find one somehow, ZFS kernel code rejects such devices. It is funny to look on attempts to read 4 256K vdev labels from 1.44MB floppy, though it is not very practical and quite slow.	2013-12-10 12:36:44 +00:00
Xin LI	9b11826d3d	Expose spa_asize_inflation. X-MFC-With: r258632	2013-12-06 23:49:16 +00:00
Andriy Gapon	f77ffe1b22	zfs: add zfs_freebsd_putpages this should be more optimal than writing pages one-by-one via zfs_write -> update_pages in the case of multi-page putpages call MFC after: 16 days	2013-11-29 15:39:39 +00:00
Andriy Gapon	6c5b7fffce	zfs: add dmu_write_pages variant for freebsd The freebsd variant of dmu_write_pages is hidden under _KERNEL to avoid needlessly pulling in vm_page_t declaration. Besides, this function seems to be useless for ZFS userland counterpart. MFC after: 15 days	2013-11-29 15:34:43 +00:00
Andriy Gapon	fdbcc95a47	zfs: make zfs_map_page / zfs_unmap_page public MFC after: 15 days	2013-11-29 15:33:40 +00:00
Andriy Gapon	ac79eedf85	zfs mappedread_sf: assert that a page is never partially valid ZFS never partially validates or invalidates a page. The higher level VM should not do that either. mappedread_sf correct operation depends on a page being either fully valid or invalid. MFC after: 7 days	2013-11-29 12:19:52 +00:00
Andriy Gapon	be3d0087dc	MFV r258665: 4347 ZPL can use dmu_tx_assign(TXG_WAIT) illumos/illumos-gate@e722410c49 MFC after: 9 days X-MFC after: r258632	2013-11-28 19:44:36 +00:00
Andriy Gapon	456a87bb3b	MFV r258371,r258372: 4101 metaslab_debug should allow for fine-grained control 4101 metaslab_debug should allow for fine-grained control 4102 space_maps should store more information about themselves 4103 space map object blocksize should be increased 4104 ::spa_space no longer works 4105 removing a mirrored log device results in a leaked object 4106 asynchronously load metaslab illumos/illumos-gate@0713e232b7 Note that some tunables have been removed and some new tunables have been added. Of particular note, FreeBSD-only knob vfs.zfs.space_map_last_hope is removed as it was a nop for some time now (after one of the previous merges from upstream). MFC after: 11 days Sponsored by: HybridCluster [merge]	2013-11-28 19:37:22 +00:00
Andriy Gapon	7bc07f0575	fix a serious bug in r258632: offset parameter must be set in zio In illumos all ioctl zio-s are "global" at the moment. That is they act on a whole disk, e.g. a cache flush command, and thus do not need either offset or size parameters. FreeBSD, on the other hand, has support for TRIM command and that command requires proper offset and size parameters. Without this fix all TRIM commands act on the start of any disk or partition used by ZFS destroying any data there. Pointyhat to: avg Tested by: sbruno MFC after: 3 days X-MFC with: r258632 Sponsored by: HybridCluster	2013-11-28 08:48:49 +00:00
Andriy Gapon	2ac1eeec44	fix debug.zfs_flags sysctl description in r258638 Pointyhat to: avg MFC after: 3 days	2013-11-26 10:57:09 +00:00
Andriy Gapon	78affb8591	expose zfs_flags as debug.zfs_flags r/w tunable and sysctl This knob is purposefully hidden under debug. MFC after: 5 days Sponsored by: HybridCluster	2013-11-26 10:46:43 +00:00
Andriy Gapon	3761ac95f7	MFV r258376: 3964 L2ARC should always compress metadata buffers illumos/illumos-gate@e4be62a2b7 MFC after: 10 days Sponsored by: HybridCluster [merge]	2013-11-26 10:14:23 +00:00
Andriy Gapon	fd51e905e2	MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit 4080 zpool clear fails to clear pool 4081 need zfs_mg_noalloc_threshold illumos/illumos-gate@22e30981d8 MFC after: 10 days Sponsored by: HybridCluster [merge]	2013-11-26 10:02:02 +00:00
Andriy Gapon	2a4704ab01	MFV r255255: 4045 zfs write throttle & i/o scheduler performance work illumos/illumos-gate@69962b5647 Please note the following changes: - zio_ioctl has lost its priority parameter and now TRIM is executed with 'now' priority - some knobs are gone and some new knobs are added; not all of them are exposed as tunables / sysctls yet MFC after: 10 days Sponsored by: HybridCluster [merge]	2013-11-26 09:57:14 +00:00
Andriy Gapon	fb8171c240	MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot illumos/illumos-gate@ec94d32216 MFC after: 9 days Sponsored by: HybridCluster [merge]	2013-11-26 09:45:48 +00:00
Andriy Gapon	34140e78ab	734 taskq_dispatch_prealloc() desired 943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP illumos/illumos-gate@5aeb94743e Essentially FreeBSD taskqueues already operate in a mode that was added to Illumos with taskq_dispatch_ent change. We even exposed the superior FreeBSD interface as taskq_dispatch_safe. Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and struct struct ostask to taskq_ent_t, so that code differences will be minimal. After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no longer needed. Note that this commit is not an MFV because the upstream change was not individually committed to the vendor area. MFC after: 8 days	2013-11-26 09:26:18 +00:00
Pawel Jakub Dawidek	1cef014007	When append-only, immutable or read-only flag is set don't allow for hard links creation. This matches UFS behaviour. Reported by: Oleg Ginzburg <olevole@olevole.ru> MFC after: 1 month	2013-11-25 21:17:14 +00:00
Andriy Gapon	a7236350c3	MFV r258378: 4089 NULL pointer dereference in arc_read() illumos/illumos-gate@57815f6b95 Tested by: adrian MFC after: 4 days	2013-11-20 11:52:32 +00:00
Andriy Gapon	c5f4a0a2eb	MFV r258377: 4088 use after free in arc_release() illumos/illumos-gate@ccc22e1304 MFC after: 5 days	2013-11-20 11:47:50 +00:00
Andriy Gapon	3fd7f7bef7	zfs page_busy: fix the boundaries of the cleared range This is a fix for a regression introduced in r246293. vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries, otherwise it extends them. Thus it can happen that the whole page is marked clean while actually having some small dirty region(s). This commit makes the range properly aligned and ensures that only the clean data is marked as such. It would interesting to evaluate how much benefit clearing with DEV_BSIZE granularity produces. Perhaps instead we should clear the whole page when it is completely overwritten and don't bother clearing any bits if only a portion a page is written. Reported by: George Hartzell <hartzell@alerce.com>, Richard Todd <rmtodd@servalan.servalan.com> Tested by: George Hartzell <hartzell@alerce.com>, Reviewed by: kib MFC after: 5 days	2013-11-19 18:43:47 +00:00
Alexander Motin	c5068af559	Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261. On machines with seveal CPUs and enough RAM this can easily twice improve ZFS performance or twice reduce CPU usage. It was disabled three years ago due to memory and KVA exhaustion reports, but our VM subsystem got improved a lot since that time, hopefully enough to make another try.	2013-11-19 11:19:07 +00:00
Steven Hartland	8dfd07b976	Fix ZFS deadlock when sending a snapshot which is mounted. MFC after: 1 week Sponsored by: Multiplay	2013-11-18 11:28:19 +00:00
Alexander Motin	e5056f9882	Introduce allocation cache to store LZ4 compression contexts without kicking VM subsystem twice for every written record. Tests on 24-core system show double reduction of CPU time spent on copying single large well-compressed file. This patch is not really needed on illumos (while not harm either) since their memory allocator by default uses caching for all requests up to 128K. Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>	2013-11-14 15:54:54 +00:00
Steven Hartland	c28078e903	Improve ZFS N-way mirror read performance by using load and locality information. The existing algorithm selects a preferred leaf vdev based on offset of the zio request modulo the number of members in the mirror. It assumes the devices are of equal performance and that spreading the requests randomly over both drives will be sufficient to saturate them. In practice this results in the leaf vdevs being under utilized. The new algorithm takes into the following additional factors: * Load of the vdevs (number outstanding I/O requests) * The locality of last queued I/O vs the new I/O request. Within the locality calculation additional knowledge about the underlying vdev is considered such as; is the device backing the vdev a rotating media device. This results in performance increases across the board as well as significant increases for predominantly streaming loads and for configurations which don't have evenly performing devices. The following are results from a setup with 3 Way Mirror with 2 x HD's and 1 x SSD from a basic test running multiple parrallel dd's. With pre-fetch disabled (vfs.zfs.prefetch_disable=1): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s With pre-fetch enabled (vfs.zfs.prefetch_disable=0): == Stripe Balanced (default) == Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s == Load Balanced (zfslinux) == Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s == Load Balanced (locality freebsd) == Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s In addition to the performance changes the code was also restructured, with the help of Justin Gibbs, to provide a more logical flow which also ensures vdevs loads are only calculated from the set of valid candidates. The following additional sysctls where added to allow the administrator to tune the behaviour of the load algorithm: * vfs.zfs.vdev.mirror.rotating_inc * vfs.zfs.vdev.mirror.rotating_seek_inc * vfs.zfs.vdev.mirror.rotating_seek_offset * vfs.zfs.vdev.mirror.non_rotating_inc * vfs.zfs.vdev.mirror.non_rotating_seek_inc These changes where based on work started by the zfsonlinux developers: https://github.com/zfsonlinux/zfs/pull/1487 Reviewed by: gibbs, mav, will MFC after: 2 weeks Sponsored by: Multiplay	2013-10-23 09:54:58 +00:00
Steven Hartland	70c3432663	Use the vdev's ashift to calculate the supported min block size passed to zio_compress_data(..) when compressing l2arc buffers. This eliminates l2arc I/O errors, which resulted in very poor performance on vdev's configured with block size greater than 512b due to compression assuming a smaller min block size than the vdev supports. MFC after: 2 days	2013-10-22 13:31:36 +00:00
Alexander Motin	40ea77a036	Merge GEOM direct dispatch changes from the projects/camlock branch. When safety requirements are met, it allows to avoid passing I/O requests to GEOM g_up/g_down thread, executing them directly in the caller context. That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid several context switches per I/O. The defined now safety requirements are: - caller should not hold any locks and should be reenterable; - callee should not depend on GEOM dual-threaded concurency semantics; - on the way down, if request is unmapped while callee doesn't support it, the context should be sleepable; - kernel thread stack usage should be below 50%. To keep compatibility with GEOM classes not meeting above requirements new provider and consumer flags added: - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request); - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done); - G_PF_DIRECT_SEND -- provider code meets caller requirements (done); - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request). Capable GEOM class can set them, allowing direct dispatch in cases where it is safe. If any of requirements are not met, request is queued to g_up or g_down thread same as before. Such GEOM classes were reviewed and updated to support direct dispatch: CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE, VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL, MAP, FLASHMAP, etc). To declare direct completion capability disk(9) KPI got new flag equivalent to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk drivers got it set now thanks to earlier CAM locking work. This change more then twice increases peak block storage performance on systems with manu CPUs, together with earlier CAM locking changes reaching more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to 256 user-level threads). Sponsored by: iXsystems, Inc. MFC after: 2 months	2013-10-22 08:22:19 +00:00
Andriy Gapon	5d8fac897e	MFV r255257: 4082 zfs receive gets EFBIG from dmu_tx_hold_free() illumos change 14172:be36a38bac3d: illumos ZFS issues: 4082 zfs receive gets EFBIG from dmu_tx_hold_free() Please note that this change is slightly different from r255257, because it is merged out of order with other (larger) upstream changes. PR: kern/182570 Reported by: Keith White <kwhite@site.uottawa.ca> Tested by: Keith White <kwhite@site.uottawa.ca> Approved by: re (glebius) MFC after: 1 week X-MFC after: r254753	2013-10-10 09:53:46 +00:00
Xin LI	6eb151f212	Improve lzjb decompress performance by reorganizing the code to tighten the copy loop. Submitted by: Denis Ahrens <denis h3q com> MFC after: 2 weeks Approved by: re (gjb)	2013-10-08 01:38:24 +00:00
Justin T. Gibbs	69d1b777e8	Optimize the block size used on ZFS cache devices as is already done for data and log devices. Reported by: Dmitryy Makarov Submitted by: smh Reviewed by: gibbs Approved by: re (delphij) MFC after: 2 weeks	2013-09-21 03:52:08 +00:00
Xin LI	253aa02fc3	MFV r254750: Add support of Illumos dumps on zvol over RAID-Z. Note that this only adds the features. FreeBSD would still need more work to support dumping on zvols. Illumos ZFS issues: 2932 support crash dumps to raidz, etc. pools MFC after: 1 month Approved by: re (ZFS blanket)	2013-09-21 00:17:26 +00:00
Davide Italiano	a25a7e386a	Fixup cross-device rename checks in ZFS. Add a check for the case where 'fdvp' is a directory, 'tvp' is an already existing directory and they have different mount points. Reported by: avg, pjd Reviewed by: pjd Approved by: re (rodrigc)	2013-09-20 23:22:00 +00:00
Xin LI	e8de677c74	MFV r247844 (illumos-gate 13975:ef6409bc370f) Illumos ZFS issues: 3582 zfs_delay() should support a variable resolution 3584 DTrace sdt probes for ZFS txg states Provide a compatibility shim for Solaris's cv_timedwait_hires to help aid future porting. Approved by: re (ZFS blanket)	2013-09-10 01:46:47 +00:00
Pawel Jakub Dawidek	7e473ea146	Add sysctl/tunables for various metaslab variables.	2013-09-05 00:53:01 +00:00
Pawel Jakub Dawidek	7008be5bd7	Change the cap_rights_t type from uint64_t to a structure that we can extend in the future in a backward compatible (API and ABI) way. The cap_rights_t represents capability rights. We used to use one bit to represent one right, but we are running out of spare bits. Currently the new structure provides place for 114 rights (so 50 more than the previous cap_rights_t), but it is possible to grow the structure to hold at least 285 rights, although we can make it even larger if 285 rights won't be enough. The structure definition looks like this: struct cap_rights { uint64_t cr_rights[CAP_RIGHTS_VERSION + 2]; }; The initial CAP_RIGHTS_VERSION is 0. The top two bits in the first element of the cr_rights[] array contain total number of elements in the array - 2. This means if those two bits are equal to 0, we have 2 array elements. The top two bits in all remaining array elements should be 0. The next five bits in all array elements contain array index. Only one bit is used and bit position in this five-bits range defines array index. This means there can be at most five array elements in the future. To define new right the CAPRIGHT() macro must be used. The macro takes two arguments - an array index and a bit to set, eg. #define CAP_PDKILL CAPRIGHT(1, 0x0000000000000800ULL) We still support aliases that combine few rights, but the rights have to belong to the same array element, eg: #define CAP_LOOKUP CAPRIGHT(0, 0x0000000000000400ULL) #define CAP_FCHMOD CAPRIGHT(0, 0x0000000000002000ULL) #define CAP_FCHMODAT (CAP_FCHMOD \| CAP_LOOKUP) There is new API to manage the new cap_rights_t structure: cap_rights_t cap_rights_init(cap_rights_t rights, ...); void cap_rights_set(cap_rights_t rights, ...); void cap_rights_clear(cap_rights_t rights, ...); bool cap_rights_is_set(const cap_rights_t rights, ...); bool cap_rights_is_valid(const cap_rights_t rights); void cap_rights_merge(cap_rights_t dst, const cap_rights_t src); void cap_rights_remove(cap_rights_t dst, const cap_rights_t src); bool cap_rights_contains(const cap_rights_t big, const cap_rights_t little); Capability rights to the cap_rights_init(), cap_rights_set(), cap_rights_clear() and cap_rights_is_set() functions are provided by separating them with commas, eg: cap_rights_t rights; cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT); There is no need to terminate the list of rights, as those functions are actually macros that take care of the termination, eg: #define cap_rights_set(rights, ...) \ __cap_rights_set((rights), __VA_ARGS__, 0ULL) void __cap_rights_set(cap_rights_t *rights, ...); Thanks to using one bit as an array index we can assert in those functions that there are no two rights belonging to different array elements provided together. For example this is illegal and will be detected, because CAP_LOOKUP belongs to element 0 and CAP_PDKILL to element 1: cap_rights_init(&rights, CAP_LOOKUP \| CAP_PDKILL); Providing several rights that belongs to the same array's element this way is correct, but is not advised. It should only be used for aliases definition. This commit also breaks compatibility with some existing Capsicum system calls, but I see no other way to do that. This should be fine as Capsicum is still experimental and this change is not going to 9.x. Sponsored by: The FreeBSD Foundation	2013-09-05 00:09:56 +00:00
Xin LI	1c1075ed93	Previously, both zfs_rename and zfs_link does a check on whether the passed vnode belongs to the same mount point (v_vfsp or also known as v_mount in FreeBSD). This check prevents the code from proceeding further on vnodes that do not belong to ZFS, for instance, on UFS or NULLFS. The recent change (merged as r254585) on upstream changes the check of v_vfsp to instead check the znode's z_zfsvfs. On Illumos this would work because when the vnode comes from lofs, the VOP_REALVP() would give the right vnode, this is not true on FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is not guaranteed to be a ZFS vnode, and will later trigger a failed assertion when verifying the vnode. This changeset modifies our local shims (zfs_freebsd_rename and zfs_freebsd_link) to check if v_mount matches before proceeding further. Reported by: many Diagnostic work by: avg	2013-08-28 00:39:47 +00:00
Xin LI	439024135c	MFV r254749: Don't hold dd_lock for long by breaking it when not doing dsl_dir accounting. It is not necessary to hold the lock while manipulating the parent's accounting, because there is no interface for userland to see a consistent picture of both parent and child at the same time anyway. Illumos ZFS issues: 4046 dsl_dataset_t ds_dir->dd_lock is highly contended	2013-08-24 00:42:37 +00:00
Xin LI	00e37ef129	MFV r254747: Fix a panic from dbuf_free_range() from dmu_free_object() while doing zfs receive. This is a regression from FreeBSD r253821. Illumos ZFS issues: 4047 panic from dbuf_free_range() from dmu_free_object() while doing zfs receive	2013-08-24 00:19:26 +00:00
Andriy Gapon	2073a41a42	zfs: do not reject any operations on a pool just because it's a boot pool Unlike the upstream FreeBSD supports booting to all kinds of pools. Requested by: many Tested by: sbruno MFC after: 12 days	2013-08-23 14:43:32 +00:00
Andriy Gapon	05869c0ea7	zfs: inline and remove zfs_vnode_lock It didn't serve any useful purpose, but obscured file and line information useful for debugging. MFC after: 5 days X-MFC with: r254445	2013-08-23 14:40:09 +00:00
Konstantin Belousov	5944de8ecd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
Kenneth D. Merry	7da1a731c6	Expand the use of stat(2) flags to allow storing some Windows/DOS and CIFS file attributes as BSD stat(2) flags. This work is intended to be compatible with ZFS, the Solaris CIFS server's interaction with ZFS, somewhat compatible with MacOS X, and of course compatible with Windows. The Windows attributes that are implemented were chosen based on the attributes that ZFS already supports. The summary of the flags is as follows: UF_SYSTEM: Command line name: "system" or "usystem" ZFS name: XAT_SYSTEM, ZFS_SYSTEM Windows: FILE_ATTRIBUTE_SYSTEM This flag means that the file is used by the operating system. FreeBSD does not enforce any special handling when this flag is set. UF_SPARSE: Command line name: "sparse" or "usparse" ZFS name: XAT_SPARSE, ZFS_SPARSE Windows: FILE_ATTRIBUTE_SPARSE_FILE This flag means that the file is sparse. Although ZFS may modify this in some situations, there is not generally any special handling for this flag. UF_OFFLINE: Command line name: "offline" or "uoffline" ZFS name: XAT_OFFLINE, ZFS_OFFLINE Windows: FILE_ATTRIBUTE_OFFLINE This flag means that the file has been moved to offline storage. FreeBSD does not have any special handling for this flag. UF_REPARSE: Command line name: "reparse" or "ureparse" ZFS name: XAT_REPARSE, ZFS_REPARSE Windows: FILE_ATTRIBUTE_REPARSE_POINT This flag means that the file is a Windows reparse point. ZFS has special handling code for reparse points, but we don't currently have the other supporting infrastructure for them. UF_HIDDEN: Command line name: "hidden" or "uhidden" ZFS name: XAT_HIDDEN, ZFS_HIDDEN Windows: FILE_ATTRIBUTE_HIDDEN This flag means that the file may be excluded from a directory listing if the application honors it. FreeBSD has no special handling for this flag. The name and bit definition for UF_HIDDEN are identical to the definition in MacOS X. UF_READONLY: Command line name: "urdonly", "rdonly", "readonly" ZFS name: XAT_READONLY, ZFS_READONLY Windows: FILE_ATTRIBUTE_READONLY This flag means that the file may not written or appended, but its attributes may be changed. ZFS currently enforces this flag, but Illumos developers have discussed disabling enforcement. The behavior of this flag is different than MacOS X. MacOS X uses UF_IMMUTABLE to represent the DOS readonly permission, but that flag has a stronger meaning than the semantics of DOS readonly permissions. UF_ARCHIVE: Command line name: "uarch", "uarchive" ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE Windows name: FILE_ATTRIBUTE_ARCHIVE The UF_ARCHIVED flag means that the file has changed and needs to be archived. The meaning is same as the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute. msdosfs and ZFS have special handling for this flag. i.e. they will set it when the file changes. sys/param.h: Bump __FreeBSD_version to 1000047 for the addition of new stat(2) flags. chflags.1: Document the new command line flag names (e.g. "system", "hidden") available to the user. ls.1: Reference chflags(1) for a list of file flags and their meanings. strtofflags.c: Implement the mapping between the new command line flag names and new stat(2) flags. chflags.2: Document all of the new stat(2) flags, and explain the intended behavior in a little more detail. Explain how they map to Windows file attributes. Different filesystems behave differently with respect to flags, so warn the application developer to take care when using them. zfs_vnops.c: Add support for getting and setting the UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN, UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags. All of these flags are implemented using attributes that ZFS already supports, so the on-disk format has not changed. ZFS currently doesn't allow setting the UF_REPARSE flag, and we don't really have the other infrastructure to support reparse points. msdosfs_denode.c, msdosfs_vnops.c: Add support for getting and setting UF_HIDDEN, UF_SYSTEM and UF_READONLY in MSDOSFS. It supported SF_ARCHIVED, but this has been changed to be UF_ARCHIVE, which has the same semantics as the DOS archive attribute instead of inverse semantics like SF_ARCHIVED. After discussion with Bruce Evans, change several things in the msdosfs behavior: Use UF_READONLY to indicate whether a file is writeable instead of file permissions, but don't actually enforce it. Refuse to change attributes on the root directory, because it is special in FAT filesystems, but allow most other attribute changes on directories. Don't set the archive attribute on a directory when its modification time is updated. Windows and DOS don't set the archive attribute in that scenario, so we are now bug-for-bug compatible. smbfs_node.c, smbfs_vnops.c: Add support for UF_HIDDEN, UF_SYSTEM, UF_READONLY and UF_ARCHIVE in SMBFS. This is similar to changes that Apple has made in their version of SMBFS (as of smb-583.8, posted on opensource.apple.com), but not quite the same. We map SMB_FA_READONLY to UF_READONLY, because UF_READONLY is intended to match the semantics of the DOS readonly flag. The MacOS X code maps both UF_IMMUTABLE and SF_IMMUTABLE to SMB_FA_READONLY, but the immutable flags have stronger meaning than the DOS readonly bit. stat.h: Add definitions for UF_SYSTEM, UF_SPARSE, UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY and UF_HIDDEN. The definition of UF_HIDDEN is the same as the MacOS X definition. Add commented-out definitions of UF_COMPRESSED and UF_TRACKED. They are defined in MacOS X (as of 10.8.2), but we do not implement them (yet). ufs_vnops.c: Add support for getting and setting UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY, UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS. Alphabetize the flags that are supported. These new flags are only stored, UFS does not take any action if the flag is set. Sponsored by: Spectra Logic Reviewed by: bde (earlier version)	2013-08-21 23:04:48 +00:00
Justin T. Gibbs	5119608387	Add kstat entries for ZFS compression statistics. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c: Add module lifetime functions to allocate and teardown state data. Report: - Compression attempts. - Buffers found to be empty. - Compression calls that are skipped because the data length is already less than or equal to the minimum block length. - Compression attempts that fail to yield a 12.5% compression ratio. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c: Add calls to the zio_compress.c module's init and fini functions. Sponosred by: Spectra Logic Corporation MFC after: 2 weeks	2013-08-21 19:40:43 +00:00
Justin T. Gibbs	439d30d121	Enhance the ZFS vdev layer to maintain both a logical and a physical minimum allocation size for devices. Use this information to automatically increase ZFS's minimum allocation size for new top-level vdevs to a value that more closely matches the optimum device allocation size. Use GEOM's stripesize attribute, if set, as the physical sector size of the GEOM. Calculate the minimum blocksize of each metaslab class. Use the calculated value instead of SPA_MINBLOCKSIZE (512b) when determining the likelyhood of compression yeilding a reduction in physical space usage. Report devices with sub-optimal block size configuration in "zpool status". Also properly fail attempts to attach devices with a logical block size greater than 8kB, since this will cause corruption to ZFS's label area. Sponsored by: Spectra Logic Corporaion MFC after: 2 weeks Background ========== Many modern devices use physical allocation units that are much larger than the minimum logical allocation size accessible by external commands. Two prevalent examples of this are 512e disk drives (512b logical sector, 4K physical sector) and flash devices (512b logical sector, 4K or larger allocation block size, and 128k or larger erase block size). Operations that modify less than the physical sector size result in a costly read-modify-write or garbage collection sequence on these devices. Simply exporting the true physical sector of the device to ZFS would yield optimal performance, but has two serious drawbacks: 1) Existing pools created with devices that have different logical and physical block sizes, but were configured to use the logical block size (e.g. because the OS version used for pool construction reported the logical block size instead of the physical block size) will suddenly find that the vdev allocation size has increased. This can be easily tolerated for active members of the array, but ZFS would prevent replacement of a vdev with another identical device because it now appears that the smaller allocation size required by the pool is not supported by the new device. 2) The device's physical block size may be too large to be supported by ZFS. The optimal allocation size for the vdev may be quite large. For example, a RAID controller may export a vdev that requires read-modify-write cycles unless accessed using 64k aligned/sized requests. ZFS currently has an 8k minimum block size limit. Reporting both the logical and physical allocation sizes for vdevs solves these problems. A device may be used so long as the logical block size is compatible with the configuration. By comparing the logical and physical block sizes, new configurations can be optimized and administrators can be notified of any existing pools that are sub-optimal. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h: Add the SPA_ASHIFT constant. ZFS currently has a hard upper limit of 13 (8k) for ashift and this constant is used to both document and enforce this limit. sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h: Add the VDEV_AUX_ASHIFT_TOO_BIG error code. Add fields for exporting the configured, logical, and physical ashift to the vdev_stat_t structure. Add VDEV_STAT_VALID() macro which can be used to verify the presence of required vdev_stat_t fields in nvlist data. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: Provide a SYSCTL_PROC handler for "max_auto_ashift". Since the limit is only referenced long after boot when a create operation occurs, there's no compelling need for it to be a boot time configurable tunable. This also allows the validation code for the max_auto_ashift value to be contained within the sysctl handler. Populate the new fields in the vdev_stat_t structure. Fail vdev opens if the vdev reports an ashift larger than SPA_MAXASHIFT. Propogate vdev_logical_ashift and vdev_physical_ashift between child and parent vdevs as is done for vdev_ashift. In vdev_open(), restore code that fails opens for devices where vdev_ashift grows. This can only happen now if the device's logical ashift grows, which means it really isn't safe to use the device. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c: Update the vdev_open() API so that both logical (what was just ashift before) and physical ashift are reported. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h: Add two new fields, vdev_physical_ashift and vdev_logical_ashift, to vdev_t. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c: Add vdev_ashift_optimize(). Call it anytime a new top-level vdev is allocated. cddl/contrib/opensolaris/cmd/zpool/zpool_main.c: Add text for the VDEV_AUX_ASHIFT_TOO_BIG error. For each sub-optimally configured leaf vdev, report configured and native block sizes. cddl/contrib/opensolaris/cmd/zpool/zpool_main.c: cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h: cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c: Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT. This status is reported on healthy pools containing vdevs configured to use a block size smaller than their reported physical block size. cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c: Update find_vdev_problem() and supporting functions to provide the full vdev_stat_t structure to problem checking routines, and to allow decent into replacing vdevs. Add a vdev_non_native_ashift() validator which is used on the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT. cddl/contrib/opensolaris/lib/libzpool/common/kernel.c: cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h: Enhance sysctl userland stubs now that a SYSCTL_PROC handler is used in vdev.c. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h: When the group membership of a metaslab class changes (i.e. when a vdev is added or removed from a pool), walk the group list to determine the smallest block size currently available and record this in the metaslab class. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c: Add the metaslab_class_get_minblocksize() accessor. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c: sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c: In zio_compress_data(), take the minimum blocksize as an input parameter instead of assuming SPA_MINBLOCKSIZE. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c: In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum blocksize of the device. The l2arc code performs has it's own code for deciding if compression is worth while, so this effectively disables zio_compress_data() from second guessing the original decision. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c: In zio_write_bp_init(), use the minimum blocksize of the normal metaslab class when compressing data.	2013-08-21 04:10:24 +00:00
Xin LI	2640fb93f5	MFV r254421: Illumos ZFS issues: 3996 want a libzfs_core API to rollback to latest snapshot	2013-08-21 00:04:31 +00:00
Xin LI	c21d9cfe3d	MFV r254220: Illumos ZFS issues: 4039 zfs_rename()/zfs_link() needs stronger test for XDEV	2013-08-20 22:31:13 +00:00
Pawel Jakub Dawidek	2c40899ecc	Remove redundant variable.	2013-08-17 14:09:46 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Xin LI	43667c1f68	MFV r254079: Illumos ZFS issues: 3957 ztest should update the cachefile before killing itself 3958 multiple scans can lead to partial resilvering 3959 ddt entries are not always resilvered 3960 dsl_scan can skip over dedup-ed blocks if physical birth != logical birth 3961 freed gang blocks are not resilvered and can cause pool to suspend 3962 ztest should print out zfs debug buffer before exiting	2013-08-08 23:38:31 +00:00
Xin LI	9d2f243aa6	MFV r254071: Fix a regression introduced by fix for Illumos bug #3834. Quote from Matthew Ahrens on the Illumos issue: ztest fails this assertion because ztest_dmu_read_write() does dmu_tx_hold_free(tx, bigobj, bigoff, bigsize); and then dmu_object_set_checksum(os, bigobj, (enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx); If the region to free is past the end of the file, the DMU assumes that there will be nothing to do for this object. However, ztest does set_checksum(), which must modify the dnode. The fix is for ztest to also call dmu_tx_hold_bonus(tx, bigobj); so we can account for the dirty data associated with setting the checksum Illumos ZFS issues: 3955 ztest failure: assertion refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite	2013-08-07 22:21:00 +00:00
Xin LI	4f7b34578b	MFV r254070: Merge vendor bugfix for ZFS test suite that triggers false positives. Illumos ZFS issues: 3949 ztest fault injection should avoid resilvering devices 3950 ztest: deadman fires when we're doing a scan 3951 ztest hang when running dedup test 3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together	2013-08-07 21:16:14 +00:00
Xin LI	c668ff330e	MFV r254011: This change have no effect to FreeBSD but integrated for completeness. Illumos ZFS issues: 348 ZFS should handle DKIOCGMEDIAINFOEXT failure	2013-08-06 21:36:01 +00:00
Alexander Motin	d9aca4ed74	Block reporting of ZFS features for suspended pools. Before executing any subcommand, zpool tool fetches pools configuration from the kernel. Before features support was added, kernel was regenerating that configuration based on data always present in memory. Unfortunately, pool features list and activity counters are not such. They are stored in ZAP, that normally resides in ARC, but under heavy memory pressure may be swapped out. If pool is suspended at this point, there is no way to recover it back since any zpool command will stuck. This change has one predictable flaw: `zpool upgrade` always wish to upgrade suspended pools, but fortunately it can't do it due to the suspension.	2013-08-06 14:41:41 +00:00
Alexander Motin	f8dcf872c4	Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really disable TRIM otherwise. r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has no lock dependencies and should complete immediately. Unfortunately, with our TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still acquires SCL_ZIO lock for read to be sure devices won't disappear. When TRIM is disabled, this patch enables direct free execution from r252840 and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from the pipeline to avoid lock acquisition. Otherwise it queues free request as it was before r252840.	2013-08-06 14:30:28 +00:00
Alexander Motin	526bb4af8a	Make `zpool clear` to reopen also reconnected cache and spare devices. Since `zpool status` reports about such kinds of errors, it is strange that they are not cleared by `zpool clear`.	2013-08-06 14:23:33 +00:00
Alexander Motin	ad727e8d64	Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events. Existing async thread is running only on successfull spa_sync() completion, that is impossible in case of pool loosing required (last) disk(s). That indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the lost disks, preventing GEOM/CAM from destroying devices and reusing names on later disk reattach. In earlier version of the patch I've tried to just run existing thread immediately, unrelated to spa_sync() completion, but that exposed number of situations where it could stuck due to locks held by stuck spa_sync(), that are required for other kinds of async events. Experiments with OpenIndiana snapshot confirmed that they also have this issue with lost disks reattach.	2013-08-06 14:20:41 +00:00
Attilio Rao	be99683637	Revert r253939: We cannot busy a page before doing pagefaults. Infact, it can deadlock against vnode lock, as it tries to vget(). Other functions, right now, have an opposite lock ordering, like vm_object_sync(), which acquires the vnode lock first and then sleeps on the busy mechanism. Before this patch is reinserted we need to break this ordering. Sponsored by: EMC / Isilon storage division Reported by: kib	2013-08-05 08:55:35 +00:00
Attilio Rao	3b6714cacb	The page hold mechanism is fast but it has couple of fallouts: - It does not let pages respect the LRU policy - It bloats the active/inactive queues of few pages Try to avoid it as much as possible with the long-term target to completely remove it. Use the soft-busy mechanism to protect page content accesses during short-term operations (like uiomove_fromphys()). After this change only vm_fault_quick_hold_pages() is still using the hold mechanism for page content access. There is an additional complexity there as the quick path cannot immediately access the page object to busy the page and the slow path cannot however busy more than one page a time (to avoid deadlocks). Fixing such primitive can bring to complete removal of the page hold mechanism. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff Tested by: pho	2013-08-04 21:07:24 +00:00
Steven Hartland	e44e975c1b	zfs_ioc_rename should not leave the value of zc_name passed in via zc altered on return. MFC after: 1 week	2013-08-04 11:38:08 +00:00
Xin LI	bd3d1456a5	MFV r253783: Skip eviction step of processing free records when doing ZFS receive to avoid the expensive search operation of non-existent dbufs in dn_dbufs. Illumos ZFS issues: 3834 incremental replication of 'holey' file systems is slow MFC after: 2 weeks	2013-07-30 21:35:02 +00:00
Xin LI	1c4ead73c6	MFV r253782: To quote Illumos issue #3888: When 'zfs recv -F' is used with an incremental recv it rolls back any changes made since the last snapshot in case new changes were made to the file system while the recv is in progress (without -F the recv would fail when it does it's final check to commit the recv-ed data as the recv-ed data conflicts with the newly written data). However, if there is a snapshot taken after the recv began rolling back to the 'latest' snapshot will not help and the recv will still fail. 'zfs recv -F' should be extended to destroy any snapshots created since the source snapshot when finishing the recv (effectively rolling back through all snapshots, instead of just to the latest snapshot). Illumos ZFS issues: 3888 zfs recv -F should destroy any snapshots created since the incremental source MFC after: 2 weeks	2013-07-30 21:20:12 +00:00
Xin LI	d637247e1f	MFV r253781 + r253871: Illumos ZFS issues: 3894 zfs should not allow snapshot of inconsistent dataset MFC after: 2 weeks	2013-07-30 21:02:09 +00:00
Xin LI	44e362e207	MFV r253780: To quote Illumos #3875: The problem here is that if we ever end up in the error path, we drop the locks protecting access to the zfsvfs_t prior to forcibly unmounting the filesystem. Because z_os is NULL, any thread that had already picked up the zfsvfs_t and was sitting in ZFS_ENTER() when we dropped our locks in zfs_resume_fs() will now acquire the lock, attempt to use z_os, and panic. Illumos ZFS issues: 3875 panic in zfs_root() after failed rollback MFC after: 2 weeks	2013-07-30 20:37:32 +00:00

1 2 3 4 5 ...

734 Commits