freebsd-nq

Author	SHA1	Message	Date
Andriy Gapon	17055fcda7	fix unsafe modification of zfs_vnodeops when DIAGNOSTIC is enabled The idea was to avoid a false assertion in zfs_lock, but it was implemented very dangerously and incorrectly. Reported by: pho Tested by: pho MFC after: 1 week	2016-11-20 14:00:50 +00:00
Andriy Gapon	2ec31e84cc	zfs: fix up after the removal of PG_CACHED pages in r308691 PR: 214629 Reported by: mshirk@daemon-security.com Reviewed by: alc Tested by: Shawn Webb <shawn.webb@hardenedbsd.org> X-MFC with: 308691	2016-11-19 08:12:57 +00:00
Mark Johnston	188011dbf2	Support fetching RFLAGS in fasttrap_getreg(). MFC after: 1 week	2016-11-18 03:11:11 +00:00
Alexander Motin	14b5719f6a	After some ZIL changes 6 years ago zil_slog_limit got partially broken due to zl_itx_list_sz not updated when async itx'es upgraded to sync. Actually because of other changes about that time zl_itx_list_sz is not really required to implement the functionality, so this patch removes some unneeded broken code and variables. Original idea of zil_slog_limit was to reduce chance of SLOG abuse by single heavy logger, that increased latency for other (more latency critical) loggers, by pushing heavy log out into the main pool instead of SLOG. Beside huge latency increase for heavy writers, this implementation caused double write of all data, since the log records were explicitly prepared for SLOG. Since we now have I/O scheduler, I've found it can be much more efficient to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG. Existing ZIL implementation had problem with space efficiency when it has to write large chunks of data into log blocks of limited size. In some cases efficiency stopped to almost as low as 50%. In case of ZIL stored on spinning rust, that also reduced log write speed in half, since head had to uselessly fly over allocated but not written areas. This change improves the situation by offloading problematic operations from z_log_write() to zil_lwb_commit(), which knows real situation of log blocks allocation and can split large requests into pieces much more efficiently. Also as side effect it removes one of two data copy operations done by ZIL code WR_COPIED case. While there, untangle and unify code of z_log_write() functions. Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing block boundary, that may also improve efficiency if ZPL is made to do that. Sponsored by: iXsystems, Inc.	2016-11-17 21:01:27 +00:00
Alexander Motin	eb9bfc257d	Revert r307392: I've found a way to avoid big allocations completely.	2016-11-17 20:44:51 +00:00
Alan Cox	7667839a7e	Remove most of the code for implementing PG_CACHED pages. (This change does not remove user-space visible fields from vm_cnt or all of the references to cached pages from comments. Those changes will come later.) Reviewed by: kib, markj Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8497	2016-11-15 18:22:50 +00:00
Mark Johnston	375c8b20dc	Remove the DTrace printt and typeref actions. These are FreeBSD-specific and were added in r178576 to provide the ability to pretty-print instances of compound types. However, the print action has long since been augmented to provide this functionality with a simpler interface. Discussed with: gnn Differential Revision: https://reviews.freebsd.org/D8478	2016-11-12 19:26:12 +00:00
Bryan Drewery	28323add09	Fix improper use of "its". Sponsored by: Dell EMC Isilon	2016-11-08 23:59:41 +00:00
Oleksandr Tymoshenko	d30e308465	Fix include order as required post r308415	2016-11-07 20:02:18 +00:00
Alexander Motin	8acf168aab	Fix ZIL records ordering when ZVOL opened both with and without FSYNC. Before this an earlier writes to a ZVOL opened without FSYNC could get to ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt), marking all log records sync if anybody opened the ZVOL with FSYNC. MFC after: 2 weeks	2016-11-01 16:03:31 +00:00
Alexander Motin	2d1d8f4c8f	Pass to zvol_log_truncate() same sync values as to zvol_log_write(). Surplus marking of TX_TRUNCATE records as sync could result in putting them into ZIL before previous writes if ones were async. MFC after: 2 weeks	2016-11-01 12:47:19 +00:00
Alexander Motin	74a148f46f	Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz.	2016-10-29 23:25:12 +00:00
Andriy Gapon	97371ba2a9	zfsbootcfg: a simple tool to set next boot (one time) options for zfsboot (gpt)zfsboot will read one-time boot directives from a special ZFS pool area. The area was previously described as "Boot Block Header", but currently it is know as Pad2, marked as reserved and is zeroed out on pool creation. The new code interprets data in this area, if any, using the same format as boot.config. The area is immediately wiped out. Failure to parse the directives results in a reboot right after the cleanup. Otherwise the boot sequence proceeds as usual. zfsbootcfg writes zfsboot arguments specified on its command line to the Pad2 area of a disk identified by vfs.zfs.boot.primary_pool and vfs.zfs.boot.primary_vdev kenv variables that are set by loader during boot. Please see the manual page for more. Thanks to all who reviewed, contributed and made suggestions! There are many potential improvements to the feature, please see the review for details. Reviewed by: wblock (docs) Discussed with: jhb, tsoome MFC after: 3 weeks Relnotes: yes Differential Revision: https://reviews.freebsd.org/D7612	2016-10-29 14:09:32 +00:00
Alexander Motin	471cf6ce7d	Add vdev_reopening support to vdev_geom. It allows to avoid extra GEOM providers flapping without significant need. Since GEOM got resize support, we don't need to reopen provider to get new size. If provider was orphaned and no longer valid, ZFS should already know that, and in such case reopen should be done in full as expected. MFC after: 2 weeks	2016-10-28 17:05:14 +00:00
Alexander Motin	f106f43aa2	Matching GUIDs, handle possible race on vdev detach. In case of vdev detach, causing top level mirror vdev destruction, leaf vdev changes its GUID to one of the destroyed mirror, that creates race condition when GUID in vdev label may not match one in the pool config. This change replicates logic nuance of vdev_validate() by adding special exception, matching the vdev GUID against the top level vdev GUID. Since this exception is not completely reliable (may give false positives if we fail to erase label on detached vdev), use it only as last resort. Quick way to reproduce this scenario now is detach vdev from a pool with enabled autoextend. During vdev detach autoextend logic tries to reopen remaining vdev, that always fails now since in-memory configuration is already updated, while on-disk labels are not yet. MFC after: 2 weeks	2016-10-28 16:21:31 +00:00
Alexander Motin	4be4cba048	Improve few debugging log messages.	2016-10-28 15:30:10 +00:00
Andriy Gapon	539fc86f2e	3746 ZRLs are racy illumos/illumos-gate@260af64db7 `260af64db7` https://www.illumos.org/issues/3746 From the original change log: It was possible for a reference to be added even with the lock held, and for references added just after a lock release to be lost. This bug was also independently found and reported in wesunsolve.net issues 6985013 6995524. In zrl_add(), always use an atomic operation to update the refcount. The mutex in the ZRL only guarantees that wakeups occur for waiters on the lock. It offers no protection against concurrent updates of the refcount. The only refcount transition that is safe to perform without an atomic operation is from ZRL_LOCKED back to 0, since this can only be performed by the thread which has the ZRL locked. Authored by: Will Andrews <will@freebsd.org> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com> Reviewed by: Yuri Pankov <yuri.pankov@gmail.com> Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com> Approved by: Matt Ahrens <mahrens@delphix.com> Author: Youzhong Yang <yyang@mathworks.com> PR: 204037 MFC after: 1 week	2016-10-27 07:38:07 +00:00
Alexander Motin	f0cbbdecbc	Fix panic after ZVOL renamed to name invalid for DEVFS. MFC after: 2 weeks	2016-10-24 12:24:24 +00:00
Alexander Motin	9be66df1e1	Add vfs.zfs.zil_log_limit sysctl. It is at least partially broken now, but that is another question.	2016-10-16 18:49:15 +00:00
Alexander Motin	a059d8ccbc	Optimize ZIL itx memory allocation on FreeBSD. These allocations can reach up to 128KB, while FreeBSD kernel allocator can cache allocations only up to 64KB. To avoid expensive allocations for each large ZIL write use caching zio_buf_alloc() allocator instead. To make it possible de-inline few instances of zil_itx_destroy().	2016-10-16 10:43:12 +00:00
Alexander Motin	1899e205d1	MFV r307314: 6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates Using a benchmark which creates 2 million files in one TXG, I observe that the thread running spa_sync() is on CPU almost the entire time we are syncing, and therefore can be a performance bottleneck. About 50% of the time in spa_sync() is in dmu_objset_do_userquota_updates(). The problem is that dmu_objset_do_userquota_updates() calls zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was modified (or created). In this benchmark, all the files are owned by the same user/group, so all 2 million calls to zap_increment_int() are modifying the same entry in the zap. The same issue exists for the DMU_GROUPUSED_OBJECT. We should keep an in-memory map from user to space delta while we are syncing, and when we finish, iterate over the in-memory map and modify the ZAP once per entry. This reduces the number of calls to zap_increment_int() from "number of objects modified" to "number of owners/groups of modified files". This reduced the time spent in spa_sync() in the file create benchmark by ~33%, from 11 seconds to 7 seconds. Closes #107 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@5fc46359c5	2016-10-14 12:03:04 +00:00
Alexander Motin	b3a8b04807	MFV r307313: 5120 zfs should allow large block/gzip/raidz boot pool (loader project) Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Andrew Stormont <andyjstormont@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Toomas Soome <tsoome@me.com> openzfs/openzfs@c8811bd3e2 FreeBSD still does not support booting from gzip-compressed datasets, so keep one chunk of this commit out.	2016-10-14 12:01:33 +00:00
Konstantin Belousov	5975e53d40	Fix a race in vm_page_busy_sleep(9). Suppose that we have an exclusively busy page, and a thread which can accept shared-busy page. In this case, typical code waiting for the page xbusy state to pass is again: VM_OBJECT_WLOCK(object); ... if (vm_page_xbusied(m)) { vm_page_lock(m); VM_OBJECT_WUNLOCK(object); <---1 vm_page_busy_sleep(p, "vmopax"); goto again; } Suppose that the xbusy state owner locked the object, unbusied the page and unlocked the object after we are at the line [1], but before we executed the load of the busy_lock word in vm_page_busy_sleep(). If it happens that there is still no waiters recorded for the busy state, the xbusy owner did not acquired the page lock, so it proceeded. More, suppose that some other thread happen to share-busy the page after xbusy state was relinquished but before the m->busy_lock is read in vm_page_busy_sleep(). Again, that thread only needs vm_object lock to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to the VPB_SHARERS_WORD(1). In this case, all tests in vm_page_busy_sleep(9) pass and we are going to sleep, despite the page being share-busied. Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9) to also accept shared-busy state if we only wait for the xbusy state to pass. Merge sequential if()s with the same 'then' clause in vm_page_busy_sleep(). Note that the current code does not share-busy pages from parallel threads, the only way to have more that one sbusy owner is right now is to recurse. Reported and tested by: pho (previous version) Reviewed by: alc, markj Sponsored by: The FreeBSD Foundation MFC after: 1 week Differential revision: https://reviews.freebsd.org/D8196	2016-10-13 14:41:05 +00:00
Konstantin Belousov	f71d08566c	Limit scope of the optimization in r306608 to dounmount() caller only. Other uses of cache_purgevfs() do rely on the cache purge for correct operations, when paths are invalidated without unmount. Reported and tested by: jkim Discussed with: mjg Sponsored by: The FreeBSD Foundation	2016-10-07 11:38:28 +00:00
Andriy Gapon	6f98c83306	implement zfs_vptocnp() using z_parent property This should allow vn_fullpath() to work even when vfs name cache is disabled for zfs, which is the case when zfs properties like casesensitivity and normalization are set non-default values. The new code should be 100% reliable for directories and "mostly" reliable for files, that is, when hardlinks across directories are not used. Reported by: Frederic Chardon <chardon.frederic@gmail.com> Reviewed by: kib (vfs contract) MFC after: 2 weeks Differential Revision: https://reviews.freebsd.org/D8146	2016-10-07 06:29:24 +00:00
Andriy Gapon	9ba3abc30e	zfs: fix a wrong assertion for extended attributes For the extended attributes the order between z_teardown_lock and the vnode lock is different. The bug was triggered only with DIAGNOSTIC turned on. This fix is developed in cooperation with avos. PR: 213112 Reported by: avos Tested by: avos MFC after: 1 week	2016-10-04 08:09:25 +00:00
Mark Johnston	4538cee5bf	Allow tracing of functions prefixed by "__". This restriction was inherited from upstream but is not relevant on FreeBSD. Furthermore, it hindered the tracing of locking primitive subroutines. MFC after: 1 week	2016-10-02 00:35:00 +00:00
Alexander Motin	863ef2ca62	Add #ifdef _KERNEL around send_holes_without_birth_time sysctl. Reported by: avg@	2016-09-29 17:48:53 +00:00
Alexander Motin	226a11f81e	MFV r306423: 7402 Create tunable to ignore hole_birth feature Until we can resolve the numerous hole_birth bugs that have cropped up recently, and come up with a way going forwards to protect users from corruption, we should disable the hole_birth feature. Using a tunable allows those who are confident that their data is correct to continue to take advantage of the feature. Closes #188 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-29 00:00:37 +00:00
Alexander Motin	bb97118138	MFV r306422: 7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs dsl_dataset_space is looking at the ds_bp's fill count while dmu_objset_write_ready() is concurrently modifying it. This fix adds an rrwlock to protect the ds_bp. Closes #180 Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-28 23:54:47 +00:00
Mark Johnston	9e579a58c3	Move implementations of uread() and uwrite() to the illumos compat layer. MFC after: 1 week	2016-09-24 21:40:14 +00:00
Andriy Gapon	d26312a4e4	fix vnode lock assertion for extended attributes directory Background. In ZFS a file with extended attributes has a special directory associated with it where each extended attribute is a file. The attribute's name is a file name and its value is a file content. When the ownership of a file with extended attributes is changed, ZFS also changes ownership of the special directory. This is where the bug was hit. The bug was introduced in r209158. Nota bene. ZFS vnode locks are typically acquired before z_teardown_lock (i.e., before ZFS_ENTER). But this is not the case for the vnodes that represent the extended attribute directory and files. Those are always locked after ZFS_ENTER. This is confusing and fragile. PR: 212702 Reported by: Christian Fuss to FreeNAS Tested by: mav MFC after: 1 week	2016-09-24 08:13:15 +00:00
Mark Johnston	36f5d07745	Re-check the systrace probe ID before calling dtrace_probe(). Otherwise there exists a narrow window during which a syscall probe can be disabled and cause a concurrently-running thread to call dtrace_probe() with an invalid probe ID. Reported by: ngie MFC after: 1 week Sponsored by: Dell EMC Isilon	2016-09-22 23:22:53 +00:00
Allan Jude	c2b475d0ee	MFV r268120: 4936 lz4 could theoretically overflow a pointer with a certain input illumos/illumos-gate@58d0718061 Reviewed by: delphij MFC after: 2 weeks Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D7850	2016-09-11 17:48:06 +00:00
Alexander Motin	20e45e033c	Switch random_get_pseudo_bytes() shim to arc4rand(). Our shim for Solaris random_get_bytes() uses read_random(), that looks reasonable, since it guaranties reliably seeded random data. On the other side Solaris random_get_pseudo_bytes() does not provide this guarantie, and its original Solaris implementation is equivalent to our arc4rand(), using software crypto without stressing slower hardware RNG.	2016-09-10 09:37:41 +00:00
Alexander Motin	4605bf63c4	MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of this patch: commit ca0cc3918a1789fa839194af2a9245f801a06b1a Author: Matthew Ahrens <mahrens@delphix.com> Date: Fri Jul 24 09:53:55 2015 -0700 5959 clean up per-dataset feature count code Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: George Wilson <george@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> This patch simply removes this macro from dsl_dataset.h. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-07 20:09:24 +00:00
Alexander Motin	de1fdddeda	MFV r305560: 7278 tuning zfs_arc_max does not impact arc_c_min When changing zfs_arc_max (e.g. as zdb does), it may be set to less than the default arc_c_min. arc_c_min should decrease to not be more than arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective. Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@608764bead	2016-09-07 20:05:10 +00:00
Andriy Gapon	1a82707cd7	fix zfs pool creation accidentally broken by r305331 The upstream change introduced a new load state, SPA_LOAD_CREATE, and vdev_geom code needs to be aware of it. Tested by: cy MFC after: 1 week X-MFC with: r305331	2016-09-06 06:09:12 +00:00
Alexander Motin	9b9258a12a	Missed FreeBSD-specific piece of r305338.	2016-09-03 11:17:33 +00:00
Alexander Motin	d7e781bda3	MFC r305337: 7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object Using a benchmark which has 32 threads creating 2 million files in the same directory, on a machine with 16 CPU cores, I observed poor performance. I noticed that dmu_tx_hold_zap() was using about 30% of all CPU, and doing dnode_hold() 7 times on the same object (the ZAP object that is being held). dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the dnode_t that we already have in hand, rather than repeatedly calling dnode_hold(). To do this, we need to pass the dnode_t down through all the intermediate calls that dmu_tx_hold_zap() makes, making these routines take the dnode_t* rather than an objset_t* and a uint64_t object number. In particular, the following routines will need to have analogous *_by_dnode() variants created: dmu_buf_hold_noread() dmu_buf_hold() zap_lookup() zap_lookup_norm() zap_count_write() zap_lockdir() zap_count_write() This can improve performance on the benchmark described above by 100%, from 30,000 file creations per second to 60,000. (This improvement is on top of that provided by working around the object allocation issue. Peak performance of ~90,000 creations per second was observed with 8 CPUs; adding CPUs past that decreased performance due to lock contention.) The CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds to 40 CPU-seconds. Sponsored by: Intel Corp. Closes #109 Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@d3e523d489	2016-09-03 11:00:29 +00:00
Alexander Motin	4ad4b70e77	MFV r305336: 7247 zfs receive of deduplicated stream fails This resolves two 'zfs recv' issues. First, when receiving into an existing filesystem, a snapshot created during the receive process is not added to the guid->dataset map for the stream, resulting in failed lookups for deduped streams when a WRITE_BYREF record refers to a snapshot received earlier in the stream. Second, the newly created snapshot was also not set properly, referencing the snapshot before the new receiving dataset rather than the existing filesystem. Closes #159 Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Author: Chris Williamson <chris.williamson@delphix.com> openzfs/openzfs@b09697c8c1	2016-09-03 10:59:05 +00:00
Alexander Motin	070da3f779	MFV r305335: 7003 zap_lockdir() should tag hold zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which tags the hold on the zap. This will help diagnose programming errors which misuse the hold on the ZAP. Sponsored by: Intel Corp. Closes #108 Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com> Reviewed by: Steve Gonczi <steve.gonczi@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Author: Matthew Ahrens <mahrens@delphix.com> openzfs/openzfs@0780b3eab5	2016-09-03 10:58:14 +00:00
Alexander Motin	d3ec2cdb4a	MFV r304157: 7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records illumos/illumos-gate@12b90ee2d3 https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f e1d9cfa https://www.illumos.org/issues/7230 A test failure occurred where a send stream had only a BEGIN record. This should not be possible if the send returns without error. Prevented this from happening in the future by adding an assertion to dmu_send_impl() to verify that if the function returns 0 (success) both a BEGIN and END record are present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o r END records sent), having dump_record() set flags appropriately, adding VERIFY statement to dmu_send_impl(). Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matt Krantz <matt.krantz@delphix.com>	2016-09-03 10:10:58 +00:00
Alexander Motin	7aafc9d4c8	MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr illumos/illumos-gate@bd56f80007 https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec 844a96b https://www.illumos.org/issues/7235 The function dsl_dataset_set_blkptr() is unused. We should remove it. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-03 10:09:23 +00:00
Alexander Motin	c9fa25c110	MFV r304155: 7090 zfs should improve allocation order and throttle allocations illumos/illumos-gate@0f7643c737 https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b 10c846a https://www.illumos.org/issues/7090 When write I/Os are issued, they are issued in block order but the ZIO pipelin e will drive them asynchronously through the allocation stage which can result i n blocks being allocated out-of-order. It would be nice to preserve as much of the logical order as possible. In addition, the allocations are equally scattered across all top-level VDEVs but not all top-level VDEVs are created equally. The pipeline should be able t o detect devices that are more capable of handling allocations and should allocate more blocks to those devices. This allows for dynamic allocation distribution when devices are imbalanced as fuller devices will tend to be slower than empty devices. The change includes a new pool-wide allocation queue which would throttle and order allocations in the ZIO pipeline. The queue would be ordered by issued time and offset and would provide an initial amount of allocation of work to each top-level vdev. The allocation logic utilizes a reservation system to reserve allocations that will be performed by the allocator. Once an allocatio n is successfully completed it's scheduled on a given top-level vdev. Each top- level vdev maintains a maximum number of allocations that it can handle (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels * mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab groups and round robin across all eligible metaslab groups to distribute the work. As top-levels complete their work, they receive additional work from the pool-wide allocation queue until the allocation queue is emptied. Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Reviewed by: Christopher Siden <christopher.siden@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Sebastien Roy <sebastien.roy@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: George Wilson <george.wilson@delphix.com>	2016-09-03 10:04:37 +00:00
Alexander Motin	0b51a59fc7	MFV r303078: 7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer illumos/illumos-gate@926549256b https://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7 8fa08ef https://www.illumos.org/issues/7086 In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the db_blkptr, to prevent it from being changed by syncing context. Otherwise we may see that ztest got a segfault from this stack: libzpool.so.1`dva_get_dsize_sync+0x98(872f000, `b32b240`, fed7811b, 0, b4cda20, 0) libzpool.so.1`bp_get_dsize+0x60(872f000, `b32b240`, 0, 97cb780, 9d4c1a8, 0) libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530) libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1) libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1) libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780) libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a) ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0) ztest_remove+0x9f(ea294588, ea293f50, 4, 3) ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1) ztest_dmu_object_alloc_free+0x71(ea294588, 13) ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0) ztest_execute+0x89(a, 807c720, 13, 0) ztest_thread+0xea(13, 0, 0, 0) libc.so.1`_thrp_setup+0x88(f0983240) libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0) Looking into it a bit, we see that this is an embedded blockpointer, so BP_GET_NDVAS should have returned 0: b32b240::blkptr EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L Instead, it looks like another thread is modifying this blockpointer: b32b240::ugrep \| ::whatis f47a0e0c is in [ stack tid=0x19f ] ebd6ec40 is in [ stack tid=0x226 ] ea293bd0 is in [ stack tid=0x244 ] ea293be4 is in [ stack tid=0x244 ] Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-03 08:43:43 +00:00
Alexander Motin	84c3781ac9	MFV r303077: 7072 zfs fails to expand if lun added when os is in shutdown state illumos/illumos-gate@c39a2aae1e `c39a2aae1e` https://www.illumos.org/issues/7072 upstream: 38733 zfs fails to expand if lun added when os is in shutdown state DLPX-36910 spares and caches should not display expandable space DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Alex Reece <alex@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: George Wilson <george.wilson@delphix.com>	2016-09-03 08:42:12 +00:00
Alexander Motin	efa0867fb0	MFV r302991: 6950 ARC should cache compressed data illumos/illumos-gate@dcbf3bd6a1 `dcbf3bd6a1` https://www.illumos.org/issues/6950 When reading compressed data from disk, the ARC should keep the compressed block cached and only decompress it when consumers access the block. The uncompressed data should be short-lived allowing the ARC to cache a much larger amount of data. The DMU would also maintain a smaller cache of uncompressed blocks to minimize the impact of decompressing frequently accessed blocks. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matt Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: George Wilson <george.wilson@delphix.com>	2016-09-03 08:30:51 +00:00
Alexander Motin	c543b519be	MFV r304158: 7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information 7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often illumos/illumos-gate@b72b6bb10a https://github.com/illumos/illumos-gate/commit/b72b6bb10ad55121a1b352c6f68ebdc8e 20c9086 https://www.illumos.org/issues/7136 6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents whenever an aux device gets removed from a pool. However, those sysevents will be created without the vdev_guid and vdev_path fields. It would be better to always populate those fields. https://www.illumos.org/issues/7115 The addition of spa_event_notify in vdev removal code (see #6922) causes event s to be generated even if the spare failed to be removed with EBUSY. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Alan Somers <asomers@gmail.com>	2016-09-01 18:37:11 +00:00
Alexander Motin	25584d12e7	MFV r302993: 7104 increase indirect block size illumos/illumos-gate@4b5c8e93ca https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3 be1db1c https://www.illumos.org/issues/7104 The current default indirect block size is 16KB. We can improve performance by increasing it to 128KB. This is especially helpful for any workload that needs to read most of the metadata, e.g. scrub/resilver, file deletion, filesystem deletion, and zfs send. We also need to fix a few space estimation errors to make the tests pass. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 18:33:39 +00:00
Alexander Motin	dd7f7cb7ac	MFV r302992: 7071 lzc_snapshot does not fill in errlist on ENOENT illumos/illumos-gate@25f7d993ad https://github.com/illumos/illumos-gate/commit/25f7d993adbfb3452ac4625b379167074 6d35ae3 https://www.illumos.org/issues/7071 upstream DLPX-40482 lzc_snapshot does not fill in errlist on ENOENT Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 18:25:49 +00:00
Alexander Motin	3d1e0e0830	MFV r302662: 6447 handful of nvpair cleanups illumos/illumos-gate@759e89be35 https://github.com/illumos/illumos-gate/commit/759e89be359f2af635e4122d147df56bc e948773 https://www.illumos.org/issues/6447 I got a patch from someone who uses nvpair code outside of illumos. It fixes a couple of gcc warnings/bugs for him. 1. silence uninitialized use warnings 2. add parentheses around assignment used as truth value 3. fix printf format specifier (ll is for integers only) 4. strstr, strspn, strcspn, and strcmp are declared in string.h, not strings.h. 5. avoid scanning integer into boolean variable Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Andy Stormont <astormont@racktopsystems.com> Reviewed by: Garrett D'Amore <garrett@damore.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Steve Dougherty <sdougherty@barracuda.com>	2016-09-01 15:17:39 +00:00
Alexander Motin	3421688c2d	MFV r302661: 7082 bptree_iterate() passes wrong args to zfs_dbgmsg() illumos/illumos-gate@10e67aa0db https://github.com/illumos/illumos-gate/commit/10e67aa0db0823d5464aafdd681f3c966 155c68e https://www.illumos.org/issues/7082 upstream DLPX-40542 bptree_iterate() passes wrong args to zfs_dbgmsg() Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 15:10:40 +00:00
Alexander Motin	41b9077ef6	MFV r302660: 6314 buffer overflow in dsl_dataset_name illumos/illumos-gate@9adfa60d48 https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6 92b6660 https://www.illumos.org/issues/6314 Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but dsl_dataset_name copies the datasets' name PLUS the snapshot name to it, resulting in a max of 2 * ZFS_MAXNAMELEN + '@'. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 15:08:27 +00:00
Alexander Motin	e12a269749	MFV r302659: 6931 lib/libzfs: cleanup gcc warnings illumos/illumos-gate@88f61dee20 `88f61dee20` https://www.illumos.org/issues/6931 need cleanup: CERRWARN += -_gcc=-Wno-switch CERRWARN += -_gcc=-Wno-parentheses CERRWARN += -_gcc=-Wno-unused-function Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Igor Kozhukhov <ikozhukhov@gmail.com>	2016-09-01 14:57:06 +00:00
Alexander Motin	a95a9fe945	MFV r302651: 7054 dmu_tx_hold_t should use refcount_t to track space illumos/illumos-gate@0c779ad424 https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009 189202b https://www.illumos.org/issues/7054 upstream: ee0003de7d3e598499be7ac3fe6b61efcc47cb7f DLPX-40399 dmu_tx_hold_t should use refcount_t to track space Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 14:38:25 +00:00
Alexander Motin	96bf48b8cb	MFV r302648: 7019 zfsdev_ioctl skips secpolicy when FKIOCTL is set Note that the bulk of the upstream change is not applicable to FreeBSD and the affected files are not even in the vendor area. illumos/illumos-gate@45b1747515 `45b1747515` https://www.illumos.org/issues/7019 Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set, skips all processing of secpolicy functions. This means that ZFS is not doing any kind of verification of the credentials or access rights of the caller and assuming that (as it is an in-kernel client) all such checks have already been done. This turns out to be quite a dangerous assumption, especially with respect to sdev. In general I don't think it's particularly reasonable to offload this enforcement of access rights onto other kernel subsystems when ZFS has some particular local semantics in this area (delegated datasets etc) and does not provide any kind of API to allow other subsystems to avoid code duplication when doing it. ZFS should apply its normal access policy to requests from within the kernel, and callers should take care to give it the correct credentials and call it from the correct context in order to get the results they need. You can observe the currently unfortunate consequences of this bug in any non- global zone that has access to /dev/zvol or any subset of it via sdev profiles. In particular, a zone used to contain a KVM or similar which has a single zvol passed through to it using a <device match= block in its zone XML. Even though sdev makes something of an attempt to control for whether the caller should have access to nodes in /dev/zvol, it doesn't do this correctly, or really at all in the lookup call path. So, if we have a zone that's been given access to any part of /dev/zvol, it can simply look up the full path to any other zvol on the entire system, and the node will appear and be able to be used. Reviewed by: Robert Mustacchi <rm@joyent.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alex Wilson <alex.wilson@joyent.com>	2016-09-01 14:24:54 +00:00
Alexander Motin	13876b47d7	MFV r302647: 6922 Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device illumos/illumos-gate@63364b0ee2 https://github.com/illumos/illumos-gate/commit/63364b0ee2604783e7a55f84258888677 68eafa4 https://www.illumos.org/issues/6922 ZFS does not do a config_sync after removing an aux (spare, log, or cache) device. AFAICT this isn't being done because it is slow and was deemed unnecessary. However, it should be such a rare operation that speed doesn't matter, and not doing it results in two problems: 1) It is theoretically possible to remove an aux device from one pool and attach it to another, then lose power. When power is restored, both pools woul d think that they own the aux device. 2) Removal of the aux device doesn't send any useful sysevents to userland. Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alan Somers <asomers@gmail.com>	2016-09-01 14:17:30 +00:00
Alexander Motin	1c7d88abed	MFV r302646: 6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch illumos/illumos-gate@ea4a67f462 https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84 9de5371 https://www.illumos.org/issues/6980 doing zfs send -i snap1 snap2 >testfile results in internal error: Invalid argument Abort (core dumped) Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Matthew Ahrens <mahrens@delphix.com>	2016-09-01 14:06:30 +00:00
Alexander Motin	4536fd9bed	MFV r302643: 6902 speed up listing of snapshots if requesting name only and sorting by name This was our change from the beginning, so just reduce the upstream diff.	2016-09-01 13:29:53 +00:00
Alexander Motin	5fd28943d6	MFV r302642: 6876 Stack corruption after importing a pool with a too-long name illumos/illumos-gate@c971037baa `c971037baa` https://www.illumos.org/issues/6876 Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for trouble. We should check every dataset on import, using a 1024 byte buffer and checking each time to see if the dataset's new name is longer than 256 bytes. Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com>	2016-09-01 13:04:36 +00:00
Alexander Motin	9007a8679a	Fix kernel panic when inheriting properties without default. There are two writable hidden properties "iscsioptions" and "stmf_sbd_lu", that have no default string value. Attempt to unset them or replicate caused kernel panic. This simple bandaid seems fixes the problem nicely. MFC after: 2 weeks	2016-08-31 11:55:31 +00:00
Toomas Soome	f1624ed8c4	Bug 212114 - loader: zio_checksum_verify() must test spa for NULL pointer The issue was introduced with adding support for salted checksums, and was revealed by bhyve userboot.so. During pool discovery the loader is reading pool label from disks, and at that time the spa structure is not yet set up, so the NULL pointer is passed for spa. This condition must be checked to avoid the corruption of the memory and NULL pointer dereference. PR: 212114 Reported by: tsoome@freebsd.com Reviewed by: allanjude Approved by: allanjude (mentor) Differential Revision: https://reviews.freebsd.org/D7634	2016-08-24 16:30:15 +00:00
Toomas Soome	2c55d0903d	Add SHA512, skein, large blocks support for loader zfs. Updated sha512 from illumos. Using skein from freebsd crypto tree. Since loader itself is using 64MB memory for heap, updated zfsboot to use same, and this also allows to support zfs large blocks. Note, adding additional features does increate zfsboot code, therefore this update does increase zfsboot code to 128k, also I have ported gptldr.S update to zfsldr.S to support 64k+ code. With this update, boot1.efi has almost reached the current limit of the size set for it, so one of the future patches for boot1.efi will need to increase the limit. Currently known missing zfs features in boot loader are edonr and gzip support. Reviewed by: delphij, imp Approved by: imp (mentor) Obtained from: sha256.c update and skein_zfs.c stub from illumos. Differential Revision: https://reviews.freebsd.org/D7418	2016-08-18 00:37:07 +00:00
Mark Johnston	59ceeddecf	MFV r301526: 7035 string-related subroutines should validate input earlier Reviewed by: Alex Wilson <alex.wilson@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Patrick Mooney <pmooney@pfmooney.com> illumos/illumos-gate@771e39c3b1 MFC after: 2 weeks	2016-08-16 02:25:19 +00:00
Mark Johnston	f66200ee22	MFV r301525: 7033 ustack helper should fault on bad return values Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@a2f72b65eb MFC after: 2 weeks	2016-08-16 02:20:02 +00:00
Mark Johnston	4aea8f31b1	MFV r301524: 7034 negative record sizes should be rejected Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Bryan Cantrill <bryan@joyent.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@0b8049bfb0 MFC after: 2 weeks	2016-08-16 02:18:34 +00:00
Mark Johnston	b7125fa9cd	MFV r296989: 6734 dtrace_canstore_statvar() fails for some valid static variables Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Bryan Cantrill <bryan@joyent.com> illumos/illumos-gate@d65f2bb4e5 MFC after: 2 weeks	2016-08-16 02:16:54 +00:00
Andriy Gapon	96762fe314	fix a zfs cross-device rename crash introduced in r303763 The problem was that 'zfsvfs' variable was not initialized if the error was detected, but in the exit path the variable was dereferenced before the error code was checked. Reported by: np MFC after: 3 days X-MFC with: r303763	2016-08-09 06:11:24 +00:00
Justin Hibbits	161c415133	Two fixups for dtrace * Use the right incantation to get the next stack pointer. Since powerpc uses special frames for traps, dereferencing the stack pointer straight up won't get us the next stack pointer in every case. * Clear EE using the correct instruction sequence. The PowerISA states that 'andi.' ANDs the register with 0\|\|<imm>, instead of sign extending or filling out the unavailable bits with 1. Even if it did sign extend, PSL_EE is 0x8000, so ~PSL_EE is 0x7fff, and the upper bits would be cleared. Use rlwinm in the 32-bit case, and a two-rotate sequence in the 64-bit case, the latter chosen to follow the output generated by gcc. MFC after: 1 week	2016-08-06 15:06:19 +00:00
Andriy Gapon	4fb51b52ef	fix .zfs-related cases in zfs_lookup that were broken by r303763 The problem is that the special .zfs nodes are not represented by znodes but by special gfs-based nodes. r303763 changed interface of zfs_dirlook such that started operating on znodes rather than on vnodes and, thus, the function became unsuitable for handling .zfs entities. The solution is to move the handling of the special cases to zfs_lookup, the only consumer of zfs_dirlook. I already had this solution applied in D7421, but for different reasons. Reported by: asomers MFC after: 3 days X-MFC with: r303763	2016-08-06 11:02:07 +00:00
Andriy Gapon	f79bc17233	zfs: honour and make use of vfs vnode locking protocol ZFS POSIX Layer is originally written for Solaris VFS which is very different from FreeBSD VFS. Most importantly many things that FreeBSD VFS manages on behalf of all filesystems are implemented in ZPL in a different way. Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS functionality or, in the worst cases, badly interacts / interferes with VFS. The most prominent problem is a deadlock caused by the lock order reversal of vnode locks that may happen with concurrent zfs_rename() and lookup(). The deadlock is a result of zfs_rename() not observing the vnode locking contract expected by VFS. This commit removes all ZPL internal locking that protects parent-child relationships of filesystem nodes. These relationships are protected by vnode locks and the code is changed to take advantage of that fact and to properly interact with VFS. Removal of the internal locking allowed all ZPL dmu_tx_assign calls to use TXG_WAIT mode. Another victim, disputable perhaps, is ZFS support for filesystems with mixed case sensitivity. That support is not provided by the OS anyway, so in ZFS it was a buch of dead code. To do: - replace ZFS_ENTER mechanism with VFS managed / visible mechanism - replace zfs_zget with zfs_vget[f] as much as possible - get rid of not really useful now zfs_freebsd_* adapters - more cleanups of unneeded / unused code - fix / replace .zfs support PR: 209158 Reported by: many Tested by: many (thank you all!) MFC after: 5 days Sponsored by: HybridCluster / ClusterHQ Differential Revision: https://reviews.freebsd.org/D6533	2016-08-05 06:23:06 +00:00
Ruslan Bukin	98f50c44e3	Update RISC-V port to Privileged Architecture Version 1.9. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-08-02 14:50:14 +00:00
Allan Jude	4deb8929ea	Make boot code and loader check for unsupported ZFS feature flags OpenZFS uses feature flags instead of a zpool version number to track features since the split from Oracle. In addition to avoiding confusion on ZFS vs OpenZFS version numbers, this also allows features to be added to different operating systems that use OpenZFS in different order. The previous zfs boot code (gptzfsboot) and loader (zfsloader) blindly tries to read the pool, and if failed provided only a vague error message. With this change, both the boot code and loader check the MOS features list in the ZFS label and compare it against the list of features that the loader supports. If any unsupported feature is active, the pool is not considered as a candidate for booting, and a helpful diagnostic message is printed to the screen. Features that are merely enabled via zpool upgrade, but not in use, do not block booting from the pool. Submitted by: Toomas Soome <tsoome@me.com> Reviewed by: delphij, mav Relnotes: yes Differential Revision: https://reviews.freebsd.org/D6857	2016-08-01 19:37:43 +00:00
Enji Cooper	4fcae4df7e	Conditionalize code which defines sysctls per _KERNEL #ifdef guard This resolves several issues when compiling libzpool (userspace library), i.e. -Wimplicit-function-declaration and -Wmissing-declarations issues. MFC after: 2 weeks Reported by: clang Tested with: clang 3.8.1, gcc 4.2.1, gcc 5.3.0 Sponsored by: EMC / Isilon Storage Division	2016-07-31 06:34:49 +00:00
Mark Johnston	57185c52de	Restore an ifdef that should not have been removed in r303535. X-MFC-With: r303535	2016-07-30 07:05:32 +00:00
Mark Johnston	6d1ffb50fc	Include fasttrap handling for DATAMODEL_ILP32 when compiling for amd64. MFC after: 1 month	2016-07-30 03:11:53 +00:00
Ruslan Bukin	573d53050e	Remove unused variables.	2016-07-29 12:29:17 +00:00
Mark Johnston	e86e17af79	Merge {amd64,i386}/instr_size.c into x86_instr_size.c. Also reduce the diff between us and upstream: the input data model will always be DATAMODEL_NATIVE because of a bug (p_model is never set but is always initialized to 0), so we don't need to override the caller anyway. This change is also necessary to support the pid provider for 32-bit processes on amd64. MFC after: 2 weeks	2016-07-20 00:02:10 +00:00
Andriy Gapon	70e3da3892	MFV r302645: 6878 Add scrub completion info to "zpool history" illumos/illumos-gate@1825bc56e5 `1825bc56e5` https://www.illumos.org/issues/6878 Summary of changes: * Replace generic "scan done" message with "scan aborted, restarting", "scan cancelled", or "scan done" * Log number of errors using spa_get_errlog_size * Refactor scan restarting check into static function Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Nav Ravindranath <nav@delphix.com> MFC after: 2 weeks	2016-07-14 11:53:39 +00:00
Andriy Gapon	39a6b17491	MFV r302650: 6940 Cannot unlink directories when over quota illumos/illumos-gate@99189164df `99189164df` https://www.illumos.org/issues/6940 Similar to #6334, but this time with empty directories: $ zfs create tank/quota $ zfs set quota=10M tank/quota $ zfs snapshot tank/quota@snap1 $ zfs set mountpoint=/mnt/tank/quota tank/quota $ mkdir /mnt/tank/quota/dir # create an empty directory $ mkfile 11M /mnt/tank/quota/11M /mnt/tank/quota/11M: initialized 9830400 of 11534336 bytes: Disc quota exceeded $ rmdir /mnt/tank/quota/dir # now unlink the empty directory rmdir: directory "/mnt/tank/quota/dir": Disc quota exceeded From user perspective, I would expect that ZFS is always able to remove files and directories even when the quota is exceeded. Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Simon Klinkert <simon.klinkert@gmail.com> MFC after: 2 weeks	2016-07-14 11:51:01 +00:00
Andriy Gapon	fe0cc75230	MFV r302644: 6513 partially filled holes lose birth time illumos/illumos-gate@8df0bcf0df `8df0bcf0df` https://www.illumos.org/issues/6513 If a ZFS object contains a hole at level one, and then a data block is created at level 0 underneath that l1 block, l0 holes will be created. However, these l0 holes do not have the birth time property set; as a result, incremental sends will not send those holes. Fix is to modify the dbuf_read code to fill in birth time data. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Paul Dagnelie <pcd@delphix.com> MFC after: 3 weeks	2016-07-14 11:48:42 +00:00
Andriy Gapon	e7ed92bbbc	MFV r302641: 6844 dnode_next_offset can detect fictional holes illumos/illumos-gate@11ceac77ea `11ceac77ea` https://www.illumos.org/issues/6844 dnode_next_offset is used in a variety of places to iterate over the holes or allocated blocks in a dnode. It operates under the premise that it can iterate over the blockpointers of a dnode in open context while holding only the dn_struct_rwlock as reader. Unfortunately, this premise does not hold. When we create the zio for a dbuf, we pass in the actual block pointer in the indirect block above that dbuf. When we later zero the bp in zio_write_compress, we are directly modifying the bp. The state of the bp is now inconsistent from the perspective of dnode_next_offset: the bp will appear to be a hole until zio_dva_allocate finally finishes filling it in. In the meantime, dnode_next_offset can detect a hole in the dnode when none exists. I was able to experimentally demonstrate this behavior with the following setup: 1. Create a file with 1 million dbufs. 2. Create a thread that randomly dirties L2 blocks by writing to the first L0 block under them. 3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle of a file. 4. Do dnode_next_offset in a loop until we skip over such a non-existent hole. The fix is to ensure that it is valid to iterate over the indirect blocks in a dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP and updating the actual BP in dbuf_write_ready while holding the lock. Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Boris Protopopov <bprotopopov@hotmail.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Alex Reece <alex@delphix.com> MFC after: 3 weeks	2016-07-14 11:42:53 +00:00
Andriy Gapon	875e6e5b04	MFV r302640: 6874 rollback and receive need to reset ZPL state to what's on disk illumos/illumos-gate@1fdcbd00c9 `1fdcbd00c9` https://www.illumos.org/issues/6874 When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the ZPL doesn't completely reload the state from the DMU; some values remain cached in the zfsvfs_t. steps to reproduce: ``` #!/bin/bash -x zfs destroy -R test/fs zfs destroy -R test/recvd zfs create test/fs zfs snapshot test/fs@a zfs set userquota@$USER=1m test/fs zfs snapshot test/fs@b zfs send test/fs@a \| zfs recv test/recvd zfs send -i @a test/fs@b \| zfs recv test/recvd zfs userspace test/recvd 1. should show 1m quota dd if=/dev/urandom of=/test/recvd/file bs=1k count=1024 sync dd if=/dev/urandom of=/test/recvd/file2 bs=1k count=1024 2. should fail with ENOSPC sync zfs unmount test/recvd zfs mount test/recvd zfs userspace test/recvd 3. if bug above, now shows 1m quota dd if=/dev/urandom of=/test/recvd/file3 bs=1k count=1024 4. if bug above, now fails with ENOSPC ``` Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Approved by: Garrett D'Amore <garrett@damore.org> Author: Matthew Ahrens <mahrens@delphix.com> MFC after: 3 weeks	2016-07-14 11:39:36 +00:00
Andriy Gapon	ac3623e090	re-apply r299908: zfsctl_snapdir_lookup: clear VV_ROOT of snapshot's root The change has been undone in r301275 on the assumption that it was no longer required. But that was incorrect, because in this case (and only in this case) the snapshot root vnode is looked up before z_parent is fixed up. MFC after: 5 days	2016-07-13 15:16:51 +00:00
Mark Johnston	ca1ef36cf4	Avoid truncating the return value of DTrace predicates. Predicates are DIF objects whose return value is compared with zero to determine whether the corresponding probe body is to be executed. The return value itself is the contents of a 64-bit DIF register, but it was being truncated to an int before the comparison. This meant that a predicate such as /0x100000000/ would evaluate to false. Reported by: rwatson MFC after: 3 days	2016-07-09 22:41:21 +00:00
Steven Hartland	ae8420ed72	Fix ZFS ARC min / max tunable Due to ARC initial configuration not being done and kmem information not being available we need to blindly set zfs_arc_max and zfs_arc_min when configured via the tunable. This fixes vfs.zfs.arc_(min\|max) configuration via loader.conf broken by r302265. Approved by: re(gjb) MFC after: 1 week	2016-07-06 23:49:19 +00:00
Nathan Whitehorn	96c85efb4b	Replace a number of conflations of mp_ncpus and mp_maxid with either mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try, for example, to run tasks on CPUs that did not exist or to allocate too few buffers on systems with sparse CPU IDs in which there are holes in the range and mp_maxid > mp_ncpus. Such circumstances generally occur on systems with SMT, but on which SMT is disabled. This patch restores system operation at least on POWER8 systems configured in this way. There are a number of other places in the kernel with potential problems in these situations, but where sparse CPU IDs are not currently known to occur, mostly in the ARM machine-dependent code. These will be fixed in a follow-up commit after the stable/11 branch. PR: kern/210106 Reviewed by: jhb Approved by: re (glebius)	2016-07-06 14:09:49 +00:00
Alexander Motin	e36599916f	Revert r299454 and r299448. Those changes were found confusing FreeBSD libc ACL code, that doesn't differentiate ACL for directories and files, and report ACLs for all directories created after those patches as non-trivial. On the other side these changes were considered wrong from POSIX and NFSv4 points of view. Until further investigation done upstream, revert those changes locally in preparation for FreeBSD 11.0 release. Approved by: re (hrs)	2016-06-30 14:55:49 +00:00
Steven Hartland	f535b2d7d8	Allow ZFS ARC min / max to be tuned at runtime Prior to this change ZFS ARC min / max could only be changed using boot time tunables, this allows the values to be tuned at runtime using the sysctls: * vfs.zfs.arc_max * vfs.zfs.arc_min When adjusting ZFS ARC minimum the memory used will only reduce to the new minimum given memory pressure. Reviewed by: allanjude Approved by: re (gjb) MFC after: 2 weeks Relnotes: yes Sponsored by: Multiplay Differential Revision: https://reviews.freebsd.org/D5907	2016-06-29 07:55:45 +00:00
Andriy Gapon	0f7dcde977	fix deadlock-prone code in getzfsvfs() getzfsvfs() called vfs_busy() in the waiting mode while having a hold on a pool (via a call to dmu_objset_hold). In other words, dp_config_rwlock was held in the shared mode while a thread could be sleeping in vfs_busy(). The pool's txg sync thread needs to take dp_config_rwlock in the exclusive mode for some actions, e.g., for executing sync tasks. If the sync thread gets blocked, then any thread waiting for its sync task to get executed is also blocked. Which, in turn, could mean that vfs_busy() will keep waiting indefinitely. The solution is to use vfs_ref() in the locked section and to call vfs_busy() only after dropping other locks. Note that a reference on a struct mount object does not prevent an associated zfsvfs_t object from being destroyed. So, we have to be careful to operate only on the struct mount object until we successfully vfs_busy it. Approved by: re (gjb) MFC after: 2 weeks	2016-06-23 07:01:54 +00:00
Alan Somers	54edbcfb69	Fix uninitialized variable from r300881 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Initialize needs_update in vdev_geom_set_physpath PR: 210409 Reported by: kp Reviewed by: kp Approved by: re (hrs) MFC after: 4 weeks X-MFC-With: 300881 Sponsored by: Spectra Logic Corp	2016-06-21 15:27:16 +00:00
Konstantin Belousov	cacbedfc46	Fix gcc build. Reported andt tested by: swills Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-18 20:20:00 +00:00
Konstantin Belousov	4a0d95f810	Use vnlru_free(9) to implement dnlc_reduce_cache(). This apparently puts ARC back under the limits after the vnode pressure rework in r291244, in particular due to the kmem exhaustion. Based on patch by: mckusick Reviewed by: avg, mckusick Tested by: allanjude, madpilot Sponsored by: The FreeBSD Foundation Approved by: re (gjb)	2016-06-17 17:34:28 +00:00
Andriy Gapon	55be2f79e5	l2arc: reset b_tmp_cdata to NULL in the case of unset b_daddr The change is in arc_buf_l2_cdata_free(). Without this we can trip the assertion in arc_hdr_realloc() if INVARIANTS option is enabled. Approved by: re (kib) MFC after: 1 week	2016-06-13 18:39:13 +00:00
Andriy Gapon	aa14503e8e	zfs_vptocnp: check for an invalid znode ... which can arise after the receive or rollback and failed zfs_rezget(). Approved by: re (kib) MFC after: 1 week	2016-06-13 10:53:34 +00:00
Andriy Gapon	a68789426a	zfs: set VROOT / VV_ROOT consistently and in a single place This is a followup to r300131. A filesystem's root vnode can be reached not only through VSF_ROOT, but by other means as well. For example, via a dot-dot lookup. Also, a root vnode can get reclaimed and then re-created. For these reasons it was insufficient to clear VV_ROOT flag from a root vnode of a snapshot mounted under .zfs in zfsctl_snapdir_lookup(). So, now we set the flag in zfs_znode_sa_init() only if a vnode represent a root of a filesystem or a standalone snapshot. That is, the flag is not set for snapshots mounted under .zfs. MFC after: 2 weeks	2016-06-03 14:37:18 +00:00
Andriy Gapon	d1cf30f4f1	zfs_root: fix a potential root vnode reference leak It could happen in an unlikely case that we fail to lock the root vnode with requested flags (which appear to never include LK_NOWAIT). MFC after: 1 week	2016-06-03 14:22:12 +00:00
Alan Somers	b1a6b8dcd2	Improve the English in a comment sys/cddl/contrib/opensolaris/uts/common/sys/acl.h: Improve the english in a comment. No functional changes Submitted by: gibbs MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-06-01 22:21:42 +00:00
Andrew Turner	1cb290d2f2	Set oldfp so the check for fp == oldfp works as expected. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-31 11:32:09 +00:00
Allan Jude	0144ad3e78	Connect the SHA-512t256 and Skein hashing algorithms to ZFS Support for the new hashing algorithms in ZFS was introduced in r289422 However it was disconnected because FreeBSD lacked implementations of SHA-512 (truncated to 256 bits), and Skein. These implementations were introduced in r300921 and r300966 respectively This commit connects them to ZFS and enabled these new checksum algorithms This new algorithms are not supported by the boot blocks, so do not use them on your root dataset if you boot from ZFS. Relnotes: yes Sponsored by: ScaleEngine Inc.	2016-05-31 04:12:14 +00:00
Bryan Drewery	fdd9048a63	Avoid more literal-suffix errors with C++11	2016-05-29 00:40:29 +00:00
Alan Somers	7a0c41d5d7	zfsd(8), the ZFS fault management daemon Add zfsd, which deals with hard drive faults in ZFS pools. It manages hotspares and replements in drive slots that publish physical paths. cddl/usr.sbin/zfsd Add zfsd(8) and its unit tests cddl/usr.sbin/Makefile Add zfsd to the build lib/libdevdctl A C++ library that helps devd clients process events lib/Makefile share/mk/bsd.libnames.mk share/mk/src.libnames.mk Add libdevdctl to the build. It's a private library, unusable by out-of-tree software. etc/defaults/rc.conf By default, set zfsd_enable to NO etc/mtree/BSD.include.dist Add a directory for libdevdctl's include files etc/mtree/BSD.tests.dist Add a directory for zfsd's unit tests etc/mtree/BSD.var.dist Add /var/db/zfsd/cases, where zfsd stores case files while it's shut down. etc/rc.d/Makefile etc/rc.d/zfsd Add zfsd's rc script sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c Fix the resource.fs.zfs.statechange message. It had a number of problems: It was only being emitted on a transition to the HEALTHY state. That made it impossible for zfsd to take actions based on drives getting sicker. It compared the new state to vdev_prevstate, which is the state that the vdev had the last time it was opened. That doesn't make sense, because a vdev can change state multiple times without being reopened. vdev_set_state contains logic that will change the device's new state based on various conditions. However, the statechange event was being posted _before_ that logic took effect. Now it's being posted after. Submitted by: gibbs, asomers, mav, allanjude Reviewed by: mav, delphij Relnotes: yes Sponsored by: Spectra Logic Corp, iX Systems Differential Revision: https://reviews.freebsd.org/D6564	2016-05-28 17:43:40 +00:00
Enji Cooper	dfdbdb0c82	Fix up r300870 The sys/types.h fix I proposed was only tested with zfs(4), not with libzpool, which is where the build failure actually existed Remove vm/vm_pageout.h from arc.c and zfs_vnops.c because they're both unneeded MFC after: 1 week X-MFC with: r300865, r300870 In collaboration with: kib Submitted by: alc Sponsored by: EMC / Isilon Storage Division	2016-05-27 22:56:00 +00:00
Alan Somers	151746b244	Avoid issuing spa config updates for physical path when not necessary ZFS's configuration needs to be updated whenever the physical path for a device changes, but not when a new device is introduced. This is because new devices necessarily cause config updates, but only if they are actually accepted into the pool. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When setting the vdev's physical path, only request a config update if the physical path has changed. Don't request it when opening a device for the first time, because the config sync will happen anyway upstack. sys/geom/geom_dev.c Split g_dev_set_physpath and g_dev_set_media out of g_dev_attrchanged Submitted by: will, asomers MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6428	2016-05-27 22:32:44 +00:00
Enji Cooper	765daefd68	Unbreak the zfs(4) build vm/vm_pageout.h grew a dependency on the bool typedef in r300865 arc.c didn't include sys/types.h, which included the definition for the typedef Other items (ofed, drm2) might need to be chased for this commit. X-MFC with: r300865 MFC after: 1 week Pointyhat to: alc Sponsored by: EMC / Isilon Storage Division	2016-05-27 20:33:38 +00:00
Ruslan Bukin	bdca9b1d52	Correct the implementation of dtrace_interrupt_disable/enable. Pointed out by: andrew Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-05-27 17:58:10 +00:00
Andrew Turner	12aac6b55b	Fix dtrace_interrupt_disable and dtrace_interrupt_enable by having the former return the current status for the latter to use. Without this we could enable interrupts when they shouldn't be. It's still not quite right as it should only update the bits we care about, bit should be good enough until the correct fix can be tested. PR: 204270 Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-27 12:02:12 +00:00
Bryan Drewery	74c5734766	Use netinet/in.h to avoid include/arpa dependency for DIRDEPS_BUILD. Sponsored by: EMC / Isilon Storage Division	2016-05-26 23:20:17 +00:00
Bjoern A. Zeeb	9c759b587f	Try to unbreak the build after r300611 by including the header defining VM_MIN_KERNEL_ADDRESS. Sponsored by: DARPA/AFRL	2016-05-24 17:38:27 +00:00
Ruslan Bukin	fed1ca4b71	Add initial DTrace support for RISC-V. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-05-24 16:41:37 +00:00
Andrew Turner	0d0da76911	Mark all memory before the kernel as toxic to DTrace. Obtained from: ABT Systems Ltd Sponsored by: The FreeBSD Foundation	2016-05-24 13:57:23 +00:00
Andriy Gapon	fabe7e4ecc	add vop_print methods to vnode operatios of various zfsctl node types This should help with diagnostics of zfsctl problems. MFC after: 2 weeks	2016-05-18 13:21:29 +00:00
Andriy Gapon	e34c8d727b	move zfsctl_freebsd_root_lookup right next to zfsctl_root_lookup That makes it easier to reason about the code. MFC after: 5 weeks	2016-05-18 08:29:39 +00:00
Andriy Gapon	a4bbed22d2	zfsctl_common_fid: remove redundant assignment "Reinterpret cast" to zfid_short_t and assignment of zf_len do the job already. MFC after: 1 week	2016-05-18 08:26:09 +00:00
Andriy Gapon	e6d4eefe2a	zfsctl: tighten an assertion and remove an unused definition There are only two entries under .zfs and 'shares' has an ID of a special persistent object in its filesystem. MFC after: 1 week	2016-05-18 08:23:39 +00:00
Andriy Gapon	439e9b6804	zfs_root: no need to set the root flag here That was both redundant as zfs_znode_sa_init() already does the job and insufficient as the root vnode can be reached via other means. MFC after: 1 weeks	2016-05-18 08:19:41 +00:00
Andriy Gapon	74a3df2b1f	zfsctl_freebsd_root_lookup: gfs_vop_lookup may return a doomed vnode gfs code is (almsot) completely agnostic of FreeBSD VFS locking, so it does not handle doomed but not yet dead vnodes and may return them. Check for those vnodes here and retry a lookup. Note that ZFS and gfs have additional protections that ensure that a parent vnode of the current vnode is never doomed. The fixed problem is an occasional failure to lookup a 'snapshot' or 'shares' directories under .zfs. Note that for the above reason all uses of zfsctl_root_lookup() are better be replaced with VOP_LOOKUP. MFC after: 5 weeks	2016-05-18 08:02:49 +00:00
Alan Somers	5f7b3969e9	Speed up vdev_geom_open_by_guids Speedup is hard to measure because the only time vdev_geom_open_by_guids gets called on many drives at the same time is during boot. But with vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations like "zpool create" speed up by 65%. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c * Read all of a vdev's labels in parallel instead of sequentially. * In vdev_geom_read_config, don't read the entire label, including the uberblock. That's a waste of RAM. Just read the vdev config nvlist. Reduces the IO and RAM involved with tasting from 1MB to 448KB. Reviewed by: avg MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D6153	2016-05-17 15:17:23 +00:00
Andriy Gapon	857a214d03	zfs_ioc_rename: fix a reversed condition FreeBSD zfs_ioc_rename() has an option, not present upstream, that allows to rename snapshots without unmounting them first. I am not sure what is a rationale for that option, but its actual behavior was the opposite of the intended behavior. That is, by default the snapshots were not unmounted. The option was introduced as part of a large update from upstream in r248498. One of the consequences was a havoc under .zfs/snapshot after the rename. The snapshots got new names but were mounted on top of directories with old names, so readdir would list the new names, but lookup would still find the old mounts. PR: 209093 Reported by: Frédéric VANNIÈRE <f.vanniere@planet-work.com> MFC after: 5 days	2016-05-17 07:56:05 +00:00
Andriy Gapon	afe674f089	do not destroy 'snapdir' when it becomes inactive That was just wrong. In fact, we can safely keep this static entry when it's inactive. Now the destructive action is moved to the reclaim method and the function is renamed from zfsctl_snapdir_inactive(0 to zfsctl_snapdir_reclaim(). Also, we can use gfs_vop_reclaim() instead of gfs_dir_inactive() + kmem_free(). Lastly, we can just assert that the node does not any children when it is reclaimed, even on the force unmount. That's because zfs_umount() does an extra vflush() pass which should destroy all snapshot-mountpoint vnodes that are the snapdir's children. MFC after: 5 weeks	2016-05-16 15:48:56 +00:00
Andriy Gapon	9c3e205296	try to recycle "snap" vnodes as soon as possible Those vnodes should not linger. "Stale" nodes may get out of synchronization with actual snapshots. For example if we destroy a snapshot and create a new one with the same name. Or when we rename a snapshot. While there fix the argument type for zfsctl_snapshot_reclaim(). Also, its original argument can be passed to gfs_vop_reclaim() directly. Bug 209093 could be related although I have not specifically verified that. Referencing just in case. PR: 209093 MFC after: 5 weeks	2016-05-16 15:37:41 +00:00
Andriy Gapon	0ab1aa90fa	fix locking in zfsctl_root_lookup Dropping the root vnode's lock after VFS_ROOT() didn't really help the fact that we acquired the lock while holding its child's, .zfs, lock while performing the operaiton. So, directly use zfs_zget() to get the root vnode. While there simplify the code in zfsctl_freebsd_root_lookup. We know that .zfs is always exclusively locked. We know that there is already a reference on *vpp, so no need for an extra one. Account for the fact that .. lookup may ask for a different lock type, not necessarily LK_EXCLUSIVE. And handle a possible failure to acquire the lock given the lock flags. MFC after: 5 weeks	2016-05-16 15:28:39 +00:00
Andriy Gapon	705e6b8170	gfs_lookup_dot() does not have to acquire any locks In fact, that was dangerous. For example, zfsctl_snapshot_reclaim() calls gfs_dir_lookup() on ".." path and that ends up calling gfs_lookup_dot() which violated locking order by acquiring the parent's directory vnode lock after the child's vnode lock. Also, the previous behavior was inconsistent as gfs_dir_lookup() returned a locked vnode for . and .. lookups, but not for any other. Now gfs_lookup_dot() just references a resulting vnode and the locking is done in its consumers, where necessary. Note that we do not enable shared locking support for any gfs / zfsctl vnodes. This commit partially reverts r273641. MFC after: 5 weeks	2016-05-16 15:13:16 +00:00
Andriy Gapon	7223645bd1	avoid deadlock between zfsctl_snapdir_lookup and zfsctl_snapshot_reclaim The former acquired a snap vnode lock while holding sd_lock while the latter does the opposite. The solution is drop sd_lock before acquiring the vnode lock. That should be okay as we are still holding a lock on the 'snapshot' directory in the exclusive mode. That lock ensures that there are no concurrent lookups in the directory and thus no concurrent mount attempts. But now we have to account for the possibility that the snap vnode might get reclaim after we drop sd_lock and before we can get the node lock. So, check for that case and retry. MFC after: 5 weeks	2016-05-16 15:03:52 +00:00
Andriy Gapon	c6cd01d924	fix a vnode reference leak caused by illumos compat traverse() This commit partially reverts r273641 which introduced the leak. It did so to accomodate for some consumers of traverse() that expected the starting vnode to stay as-is. But that introduced the leak in the case when a mounted filesystem was found and its root vnode was returned. r299914 removed the troublesome consumers and now there is no reason to keep the starting vnode. So, now the new rules are: - if there is no mounted filesystem, then nothing is changed - otherwise the starting vnode is always released - the root vnode of the mounted filesystem is returned locked and referenced in the case of success MFC after: 5 weeks X-MFC after: r299914	2016-05-16 12:15:19 +00:00
Andriy Gapon	20ec8b0f9b	fix up r299902: mount_snapshot requires that the covered vnode is locked Previously that was not strictly enforced. MFC after: 4 weeks X-MFC with: r299902	2016-05-16 11:48:43 +00:00
Andriy Gapon	cf7aa80bbd	zfsctl_ops_snapshot: remove methods should never be called We pretend that snapshots mounted under .zfs are part of the original filesystem and we try very hard to hide vnodes on top of which the snapshots are mounted. Given that I believe that the removed operations should never be called. They might have been called previously because of issues fixed in r299906, r299908 and r299913. MFC after: 5 weeks	2016-05-16 07:24:30 +00:00
Andriy Gapon	cb68fd3513	zfsctl_snapdir_lookup: always clear VV_ROOT flag of snapshot's root VV_ROOT Previosuly we did that only if the snapshot was mounted earlier, its root vnode got recycled and then we accessed it again. We never cleared the flag for a freshly mounted snapshot. That was very inconsistent and probably a source of some bugs. Or maybe that painted over some bugs which might get revealed now. We should consistently clear the flag because we try very hard to pretend that snapshots auto-mounted under .zfs are part of their original filesystem. In other words, we try to hide the fact that they are different filesystems / mountpoints. MFC after: 5 weeks	2016-05-16 06:49:09 +00:00
Andriy Gapon	4df590b5b6	add zfs_vptocnp with special handling for snapshots under .zfs The logic is similar to that already present in zfs_dirlook() to handle a dot-dot lookup on a root vnode of a snapshot mounted under .zfs/snapshot/. illumos does not have an equivalent of vop_vptocnp, so there only the lookup had to be patched up. MFC after: 4 weeks	2016-05-16 06:40:51 +00:00
Andriy Gapon	a03fb1cf6a	mount_snapshot: consolidate all error handling This makes sure that the original vnode is always unlocked and released if any error happens. MFC after: 4 weeks	2016-05-16 06:30:25 +00:00
Andriy Gapon	3055925d42	zfsctl: fix several problems with reference counts * Remove excessive references on a snapshot mountpoint vnode. zfsctl_snapdir_lookup() called VN_HOLD() on a vnode returned from zfsctl_snapshot_mknode() and the latter also had a call to VN_HOLD() on the same vnode. On top of that gfs_dir_create() already returns the vnode with the use count of 1 (set in getnewvnode). So there was 3 references on the vnode. * mount_snapshot() should keep a reference to a covered vnode. That reference is owned by the mountpoint (mounted snapshot filesystem). * Remove cryptic manipulations of a covered vnode in zfs_umount(). FreeBSD dounmount() already does the right thing and releases the covered vnode. PR: 207464 Reported by: dustinwenz@ebureau.com Tested by: Howard Powell <hpowell@lighthouseinstruments.com> MFC after: 3 weeks	2016-05-16 06:24:04 +00:00
John Baldwin	fdce57a042	Add an EARLY_AP_STARTUP option to start APs earlier during boot. Currently, Application Processors (non-boot CPUs) are started by MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until SI_SUB_SMP at which point they are released to run kernel threads. SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter the scheduler and start running threads until fairly late in the boot. This change moves SI_SUB_SMP up to just before software interrupt threads are created allowing the APs to start executing kernel threads much sooner (before any devices are probed). This allows several initialization routines that need to perform initialization on all CPUs to now perform that initialization in one step rather than having to defer the AP initialization to a second SYSINIT run at SI_SUB_SMP. It also permits all CPUs to be available for handling interrupts before any devices are probed. This last feature fixes a problem on with interrupt vector exhaustion. Specifically, in the old model all device interrupts were routed onto the boot CPU during boot. Later after the APs were released at SI_SUB_SMP, interrupts were redistributed across all CPUs. However, several drivers for multiqueue hardware allocate N interrupts per CPU in the system. In a system with many CPUs, just a few drivers doing this could exhaust the available pool of interrupt vectors on the boot CPU as each driver was allocating N * mp_ncpu vectors on the boot CPU. Now, drivers will allocate interrupts on their desired CPUs during boot meaning that only N interrupts are allocated from the boot CPU instead of N * mp_ncpu. Some other bits of code can also be simplified as smp_started is now true much earlier and will now always be true for these bits of code. This removes the need to treat the single-CPU boot environment as a special case. As a transition aid, the new behavior is available under a new kernel option (EARLY_AP_STARTUP). This will allow the option to be turned off if need be during initial testing. I plan to enable this on x86 by default in a followup commit in the next few days and to have all platforms moved over before 11.0. Once the transition is complete, the option will be removed along with the !EARLY_AP_STARTUP code. These changes have only been tested on x86. Other platform maintainers are encouraged to port their architectures over as well. The main things to check for are any uses of smp_started in MD code that can be simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in the EARLY_AP_STARTUP case (e.g. the interrupt shuffling). PR: kern/199321 Reviewed by: markj, gnn, kib Sponsored by: Netflix	2016-05-14 18:22:52 +00:00
Enji Cooper	622282b3b8	Include arpa/inet.h to get the htonl(3) definition MFC after: 2 weeks Reported by: clang Sponsored by: EMC / Isilon Storage Division	2016-05-13 11:15:33 +00:00
Conrad Meyer	3b56262303	compat/opensolaris: Don't redefined off64_t if already defined A follow-up to r299456. Reported by: gjb Sponsored by: EMC / Isilon Storage Division	2016-05-11 16:05:32 +00:00
Alexander Motin	c59a902fa3	MFV r299453: 6765 zfs_zaccess_delete() comments do not accurately reflect delete permissions for ACLs Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> openzfs/openzfs@a40149b935	2016-05-11 13:53:29 +00:00
Alexander Motin	0eb65a5367	MFV r299451: 6764 zfs issues with inheritance flags during chmod(2) with aclmode=passthrough Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Albert Lee <trisk@nexenta.com> openzfs/openzfs@1bcf0d240b	2016-05-11 13:50:34 +00:00
Alexander Motin	85a69dbf66	MFV r299449: 6763 aclinherit=restricted masks inherited permissions by group perms (groupmask) Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Albert Lee <trisk@nexenta.com> openzfs/openzfs@eebb483d0c	2016-05-11 13:48:15 +00:00
Alexander Motin	2a219f349e	MFV r299442: 6762 POSIX write should imply DELETE_CHILD on directories - and some additional considerations Reviewed by: Gordon Ross <gwr@nexenta.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Author: Kevin Crowe <kevin.crowe@nexenta.com> openzfs/openzfs@d316fffc9c	2016-05-11 13:43:20 +00:00
Alexander Motin	42a54f9745	MFV r299440: 6736 ZFS per-vdev ZAPs Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Don Brady <don.brady@intel.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Joe Stein <joe.stein@delphix.com> openzfs/openzfs@215198a6ad	2016-05-11 12:54:00 +00:00
Alexander Motin	7d54dbae83	MFV r299438: 6842 Fix empty xattr dir causing lockup Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Chunwei Chen <tuxoko@gmail.com> openzfs/openzfs@02525cd08f	2016-05-11 12:46:07 +00:00
Alexander Motin	d7ff478705	MFV r299436: 6843 Make xattr dir truncate and remove in one tx Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Chunwei Chen <tuxoko@gmail.com> openzfs/openzfs@399cc7d5d9	2016-05-11 12:43:54 +00:00
Alexander Motin	0b99ac761e	MFV r299434: 6841 Undirty freed spill blocks Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Tim Chase <tim@chase2k.com> openzfs/openzfs@445e67805d	2016-05-11 12:38:07 +00:00
Ruslan Bukin	d7dc6bae03	Implement FBT provider (MD part) for DTrace on MIPS. Tested on MIPS64. Sponsored by: DARPA, AFRL Sponsored by: HEIF5	2016-05-05 13:54:50 +00:00
Alan Somers	c9a807447d	Fix a use-after-free when "zpool import" fails clear vd->vdev_tsd in vdev_geom_close_locked instead of vdev_geom_detach. In the latter function, it would fail to happen in certain circumstances where cp->private was unset. Ideally, the latter should never happen, but it can happen when vdev open fails, or where spares are involved. MFC after: 4 weeks X-MFC-With: 298786 Sponsored by: Spectra Logic Corp	2016-04-29 21:29:37 +00:00
Andriy Gapon	27b6c49726	add invpcid instruction to i386 dtrace disassembler tables MFC after: 2 weeks	2016-04-29 15:45:22 +00:00
Alan Somers	663f649ff6	Refactor vdev_geom_attach and friends to reduce code duplication sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Move checks for provider's sectorsize and mediasize into a single location in vdev_geom_attach. Remove the zfs::vdev::taste class; it's ok to use the regular vdev class for tasting. Consolidate guid checks into a single location in vdev_attach_ok. Consolidate some error handling code from vdev_geom_attach into vdev_geom_detach, closing a resource leak of geom consumers in the process. Reviewed by: avg MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D5974	2016-04-29 15:23:51 +00:00
Mark Johnston	676a03fa6a	Increase DTRACE_FUNCNAMELEN from 128 to 192. This allows for the long function components encountered in www/firefox. This constant is part of DTrace's userland ABI, so this change may not be MFC'ed. PR: 207735	2016-04-25 18:44:11 +00:00
Mark Johnston	328d8adb9b	Allow DOF sections with excessively long probe function components. Without this change, DTrace will refuse to load a DOF section if the function component of any of its probes exceeds DTRACE_FUNCNAMELEN (128). Probes in C++ programs can have very long function components. Rather than rejecting all probes if a single probe exceeds the limit, simply skip the invalid probe and emit a warning. This ensures that valid probes are instantiated. PR: 207735 MFC after: 2 weeks	2016-04-25 18:40:57 +00:00
Mark Johnston	cd8bbc382d	Add a kern.dtrace.err_verbose sysctl to control dtrace_err_verbose. When this flag is turned on, DOF and DIF validation errors are printed to the kernel message buffer. This is useful for debugging. Also remove the debug.dtrace.debug sysctl, which has no effect.	2016-04-25 18:09:36 +00:00
Andriy Gapon	2d69831b85	lahf/sahf are supported on some amd64 processors While the instructions were not included into the original instruction set, their support can be indicated by a special feature bit. For example: CPU: AMD Phenom(tm) II X4 955 Processor (3214.71-MHz K8-class CPU) ... AMD Features2=0x37ff<LAHF, ...> Clang 3.8 uses lahf/sahf as a faster alternative to pushf/popf where possible. MFC after: 2 weeks	2016-04-22 13:44:12 +00:00
Andriy Gapon	dbbcddb426	MFV r298471: 6052 decouple lzc_create() from the implementation details illumos/illumos-gate@26455f9efc `26455f9efc` https://www.illumos.org/issues/6052 At the moment type parameter of lzc_create() is of dmu_objset_type_t type. That exposes an implementation detail and requires sys/fs/zfs.h to be included in libzfs_core.h creating unnecessary coupling between libzfs_core interface and ZFS internals. I think that dmu_objset_type_t should be replaced with a libzfs_core enumeration of supported dataset types. For ABI reasons the new enumeration could be bit-compatible with dmu_objset_type_t. For example: typedef enum { LZC_DST_ZFS = 2, LZC_DST_ZVOL } lzc_dataset_type_t; Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Andriy Gapon <andriy.gapon@clusterhq.com> MFC after: 2 weeks Sponsored by: ClusterHQ	2016-04-22 13:00:27 +00:00
Mark Johnston	6c2806594b	Make the second argument of dtrace_invop() a trapframe pointer. Currently this argument is a pointer into the stack which is used by FBT to fetch the first five probe arguments. On all non-x86 architectures it's simply the trapframe address, so this change has no functional impact. On amd64 it's a pointer into the trapframe such that stack[1 .. 5] gives the first five argument registers, which are deliberately grouped together in the amd64 trapframe definition. A trapframe argument simplifies the invop handlers on !x86 and makes the x86 FBT invop handler easier to understand. Moreover, it allows for invop handlers that may want to modify the register set of the interrupted thread.	2016-04-17 23:08:47 +00:00
Andriy Gapon	e01dd79f9a	zfs_rezget: z_vnode can not be NULL if zp is valid MFC after: 3 weeks	2016-04-16 07:41:56 +00:00
Andriy Gapon	c2d36fc5cd	zfs: enable vn_io_fault support Note that now we have to account for possible partial writes in dmu_write_uio_dbuf(). It seems that on illumos either all or none of the data are expected to be written. But the partial writes are quite expected when vn_io_fault support is enabled. Reviewed by: kib MFC after: 7 weeks Differential Revision: https://reviews.freebsd.org/D2790	2016-04-16 07:35:53 +00:00
Alan Somers	739f4ae3b1	Don't corrupt ZFS label's physpath attribute when booting while a disk is missing Prior to this change, vdev_geom_open_by_path would call vdev_geom_attach prior to verifying the device's GUIDs. vdev_geom_attach calls vdev_geom_attrchange to set the physpath in the vdev object. The result is that if the disk could not be found, then the labels for other disks in the same TLD would overwrite the missing disk's physpath with the physpath of whichever disk currently has the same devname as the missing one used to have. MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-04-15 16:36:17 +00:00
Alan Somers	c29088b5c7	Add more debugging statements in vdev_geom.c Log a debugging message whenever geom functions fail in vdev_geom_attach. Printing these messages is controlled by vfs.zfs.debug MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-04-14 23:14:41 +00:00
Alan Somers	f0ac053088	Update a debugging message in vdev_geom_open_by_guids for consistency with similar messages elsewhere in the file. MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2016-04-14 19:20:31 +00:00
Alan Somers	4e3ab010a2	Fix rare double free in vdev_geom_attrchanged sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c Don't drop the g_topology_lock before freeing old_physpath. That opens up a race where one thread can call vdev_geom_attrchanged, set old_physpath, drop the g_topology_lock, then block trying to acquire the SCL_STATE lock. Then another thread can come into vdev_geom_attrchanged, set old_physpath to the same value, and proceed to free it. When the first thread resumes, it will free the same location. It turns out that the SCL_STATE lock isn't needed. It was originally added by gibbs to protect vd->vdev_physpath while updating the same. However, the update process subsequently was switched to an atomic operation (a pointer swap). Now, there is no need for the SCL_STATE lock, and hence no need to drop the g_topology_lock. Reviewed by: delphij MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D5413	2016-04-12 19:11:14 +00:00
Andriy Gapon	c3249989ef	l2arc: make sure that all writes honor ashift of a cache device Previously uncompressed buffers did not obey that rule. Type of b_asize is changed to uint64_t for consistency, given that this is a zeta-byte filesystem. l2arc_compress_buf is renamed to l2arc_transform_buf to better reflect its new utility. Now not only we ensure that a compressed buffer has a size aligned to ashift, but we also allocate a properly sized temporary buffer if the original buffer is not compressed and it has an odd size. This ensures that all I/O to the cache device is always ashift-aligned, in terms of both a request offset and a request size. If the aligned data is larger than the original data, then we have to use a temporary buffer when reading it as well. Also, enhance physical zio alignment checks using vdev_logical_ashift. On FreeBSD we have this information, so we can make stricter assertions. Reviewed by: smh, mav MFC after: 1 month Sponsored by: ClusterHQ Differential Revision: https://reviews.freebsd.org/D2789	2016-04-12 06:56:35 +00:00
Andriy Gapon	6a50036052	Revert r297396 Modify "4958 zdb trips assert on pools with ashift >= 0xe" A better fix is following.	2016-04-12 06:54:18 +00:00
Alexander Motin	78aec5c610	MFV r297831: 6322 ZFS indirect block predictive prefetch Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Author: Alexander Motin <mav@FreeBSD.org> Improve speculative prefetch of indirect blocks. Scalability of many operations on wide ZFS pool can be limited by requirement to prefetch indirect blocks first. Recently added asynchronous indirect block read partially helped, but did not solve the problem completely. This patch extends existing prefetcher functionality to explicitly work with indirect blocks. Before this change prefetcher issued reads for up to 8MB of data in advance. With this change it also issues indirect block reads for up to 64MB of data in advance, so that when it will be time to actually read those data, it can be done immediately. Alike effect can be achieved by just increasing maximal data prefetch distance, but at higher memory cost. Also this change introduces indirect block prefetch for rewrite operations, that was never done before. Previously ARC miss for Indirect blocks regularly blocked rewrites, converting perfectly aligned asynchronous operations into synchronous read-write pairs, significantly reducing maximal rewrite speed. While being there this issue was also fixed: - prefetch was done always, even if caching for the dataset was completely disabled. Testing on FreeBSD with zvol on top of 6x striped 2x mirrored pool of 12 assorted HDDs shown me such performance numbers: ------- BEFORE -------- Write 491363677 bytes/sec Read 312430631 bytes/sec Rewrite 97680464 bytes/sec -------- AFTER -------- Write 493524146 bytes/sec Read 438598079 bytes/sec Rewrite 277506044 bytes/sec Closes #65 Closes #80 openzfs/openzfs@792fd28ac0	2016-04-11 21:09:15 +00:00
Steven Hartland	2dcee04b3a	Only include sysctl in kernel build Only include sysctl in kernel builds fixing warning about implicit declaration of function 'sysctl_handle_int'. PR: 204140 MFC after: 1 week X-MFC-With: r297813 Sponsored by: Multiplay	2016-04-11 13:17:11 +00:00
Steven Hartland	7bc47b4ea3	Only include sysctl in kernel build Only include sysctl in kernel builds fixing warning about implicit declaration of function 'sysctl_handle_int'. Sponsored by: Multiplay	2016-04-11 08:57:54 +00:00
Andriy Gapon	1da2e1e353	zio: align use of "no dump" flag between use_uma and !use_uma cases At the moment no ZFS buffers are included into a crash dump unless ZFS_DEBUG (or INVARIANTS) kernel option is enabled. That's not very helpful for debugging of ZFS problems, because important information often resides in metadata buffers. This change switches the dumping behavior when UMA is used from the illumos behavior to a more useful behavior that we have on FreeBSD when ZFS buffers are allocated via malloc. Reviewed by: smh, mav MFC after: 3 weeks Differential Revision: https://reviews.freebsd.org/D5892	2016-04-11 07:11:20 +00:00
Mark Johnston	b529028676	Implement support for boot-time DTrace. This allows one to enable DTrace probes relatively early during boot, during SI_SUB_DTRACE_ANON, before dtrace(1) can invoked. The desired enabling is created using dtrace -A, which writes a /boot/dtrace.dof file and uses nextboot(8) to ensure that DTrace kernel modules are loaded and that the DOF file describing the enabling is loaded by loader(8) during the subsequent boot. The trace output can then be fetched with dtrace -a. With this commit, boot-time DTrace is only functional on i386 and amd64: on other architectures, the high-resolution timer frequency is initialized during SI_SUB_CLOCKS and is thus not available when the anonymous tracing state is initialized. On x86, the TSC is used and is thus available earlier. MFC after: 1 month Relnotes: yes	2016-04-10 01:25:48 +00:00
Mark Johnston	33b454938a	Initialize SDT probes during SI_SUB_DTRACE_PROVIDER. This is consistent with all other DTrace providers and ensures that SDT probes are available for boot-time tracing. MFC after: 2 weeks	2016-04-10 01:24:27 +00:00
Mark Johnston	e1e33ff912	Initialize DTrace hrtimer frequency during SI_SUB_CPU on i386 and amd64. This allows the hrtimer to be used earlier during boot. This is required for boot-time DTrace: anonymous enablings are created during SI_SUB_DTRACE_ANON, which runs before APs are started. In particular, the DTrace deadman timer requires that the hrtimer be functional. MFC after: 2 weeks	2016-04-10 01:23:39 +00:00
Alexander Motin	eaee150e3f	MFV r297760: 6418 zpool should have a label clearing command Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Author: Will Andrews <will@firepipe.net> Closes #83 Closes #32 openzfs/openzfs@9663688425 FreeBSD already had `zpool labelclear` functionality, so this is mostly just a diff reduction. MFC after: 1 month	2016-04-09 20:30:50 +00:00
Andriy Gapon	c8ff459286	zio write issue threads should have lower (numerically greater) priority This is because they might do data compression which is quite CPU expensive. The original code is correct for illumos, because there a higher priority corresponds to a greater number. MFC after: 2 weeks	2016-04-08 11:58:24 +00:00
Alexander Motin	309b1c7ade	Alike to r293708 relax pool check in vdev_geom_open_by_path(). This made impossible spare disk open by known path, which kind of worked only because the same fix was applied to vdev_geom_attach_by_guids() in r293708. MFC after: 1 week	2016-04-07 12:54:44 +00:00
Edward Tomasz Napierala	ae34b6ff96	Add four new RCTL resources - readbps, readiops, writebps and writeiops, for limiting disk (actually filesystem) IO. Note that in some cases these limits are not quite precise. It's ok, as long as it's within some reasonable bounds. Testing - and review of the code, in particular the VFS and VM parts - is very welcome. MFC after: 1 month Relnotes: yes Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D5080	2016-04-07 04:23:25 +00:00
Wojciech Macek	1c7c13aa0e	Implement dtrace_getupcstack in ARM64 Allow using DTRACE for performance analysis of userspace applications - the function call stack can be captured. This is almost an exact copy of AMD64 solution. Obtained from: Semihalf Sponsored by: Cavium Reviewed by: emaste, gnn, jhibbits Differential Revision: https://reviews.freebsd.org/D5779	2016-04-06 05:13:36 +00:00
Andriy Gapon	e881d8757c	remove emulation of VFS_HOLD and VFS_RELE from opensolaris compat On FreeBSD VFS_HOLD/VN_RELE were mapped to MNT_REF/MNT_REL that manipulate mnt_ref. But the job of properly maintaining the reference count is already automatically performed by insmntque(9) and delmntque(9). So, in effect all ZFS vnodes referenced the corresponding mountpoint twice. That was completely harmless, but we want to be very explicit about what FreeBSD VFS APIs are used, because illumos VFS_HOLD and FreeBSD MNT_REF provide quite different guarantees with respect to the held vfs_t / mountpoint. On illumos VFS_HOLD is sufficient to guarantee that vfs_t.vfs_data stays valid. On the other hand, on FreeBSD MNT_REF does not provide the same guarantee about mnt_data. We have to use vfs_busy() to get that guarantee. Thus, the calls to VFS_HOLD/VFS_RELE on vnode init and fini are removed. VFS_HOLD calls are replaced with vfs_busy in the ioctl handlers. And because vfs_busy has a richer interface that can not be dumbed down in all cases it's better to explicitly use it rather than trying to mask it behind VFS_HOLD. This change fixes a panic that could result from a race between zfs_umount() and zfs_ioc_rollback(). We observed a case where zfsvfs_free() tried to destroy data that zfsvfs_teardown() was still using. That happened because there was nothing to prevent unmounting of a ZFS filesystem that was in between zfs_suspend_fs() and zfs_resume_fs(). Reviewed by: kib, smh MFC after: 3 weeks Sponsored by: ClusterHQ Differential Revision: https://reviews.freebsd.org/D2794	2016-04-02 16:25:46 +00:00
Alexander Motin	baf8ceac9e	MFV r297506: 6738 zfs send stream padding needs documentation Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Eli Rosenthal <eli.rosenthal@delphix.com> illumos/illumos-gate@c20404ff77	2016-04-02 08:36:24 +00:00
Alexander Motin	4cf6fde5e8	MFV r297504: 6681 zfs list burning lots of time in dodefault() via dsl_prop_* Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Alex Wilson <alex.wilson@joyent.com> illumos/illumos-gate@d09e4475f6	2016-04-02 08:28:46 +00:00
Gleb Smirnoff	a8b2b39cce	Fix an error in r292373. Use proper count to update "pages in" counter. Noticed by: pfg via Coverity	2016-03-31 21:15:00 +00:00
Alexander Motin	30a0e024ee	Plug open count leak on zvol rename. MFC after: 2 weeks	2016-03-30 16:54:18 +00:00
Alexander Motin	b39dea9308	Switch from using make_dev_p() to make_dev_s() to close races.	2016-03-30 16:48:57 +00:00
Alexander Motin	86b0daa373	Modify "4958 zdb trips assert on pools with ashift >= 0xe". Unlike Illumos FreeBSD has concept of logical ashift, that specifies really minimal vdev block size that can be accessed. This knowledge allows properly pad physical I/O and correctly assert its alignment. This change fixes L2ARC write errors when device has logical sector size above 512 bytes. MFC after: 1 month	2016-03-29 19:18:34 +00:00
Alexander Motin	78b127f2fc	Pass through error code from make_dev_p(). ENAMETOOLONG is much more informative in logs then ENXIO. MFC after: 1 week	2016-03-28 08:12:29 +00:00
Alexander Motin	52c9b0b539	Unify ignoring EEXIST from zvol_create_minor(). This fixes creation of zvol devices for snapshots during zfs receive, that previously failed with "ZFS WARNING: Unable to create ZVOL" message. This solution is not perfect, but IMHO better then it was before. MFC after: 2 weeks	2016-03-24 10:10:41 +00:00
Mark Johnston	48cc2d5e22	Remove unused variables dtrace_in_probe and dtrace_in_probe_addr.	2016-03-17 18:55:54 +00:00
Alexander Motin	fa9de58388	Fix small memory leak on attempt to access deleted snapshot. MFC after: 3 days	2016-03-15 21:21:28 +00:00
Alexander Motin	5db0866658	Make ZFS ignore stripe sizes above SPA_MAXASHIFT (8KB). If device has stripe size bigger then maximal sector size supported by ZFS, there is nothing can be done to avoid read-modify-write cycles. Taking that stripe size into account will only reduce space efficiency and pointlessly bother user with warnings that can not be fixed. Discussed with: smh	2016-03-10 16:39:46 +00:00
Alexander Motin	eef192d85c	Make ZFS more picky to GEOM stripe sizes and offsets. Use of misaligned or non-power-of-2 stripes is not really useful for ZFS, since increased ashift won't help to avoid read-modify-write cycles, and only reduce pool space efficiency and compression rates.	2016-03-10 14:18:14 +00:00
Alexander Motin	a151f3a7ef	MFV r296609: 6370 ZFS send fails to transmit some holes Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Chris Williamson <chris.williamson@delphix.com> Reviewed by: Stefan Ring <stefanrin@gmail.com> Reviewed by: Steven Burgess <sburgess@datto.com> Reviewed by: Arne Jansen <sensille@gmx.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Paul Dagnelie <pcd@delphix.com> In certain circumstances, "zfs send -i" (incremental send) can produce a stream which will result in incorrect sparse file contents on the target. The problem manifests as regions of the received file that should be sparse (and read a zero-filled) actually contain data from a file that was deleted (and which happened to share this file's object ID). Note: this can happen only with filesystems (not zvols, because they do not free (and thus can not reuse) object IDs). Note: This can happen only if, since the incremental source (FromSnap), a file was deleted and then another file was created, and the new file is sparse (i.e. has areas that were never written to and should be implicitly zero-filled). We suspect that this was introduced by 4370 (applies only if hole_birth feature is enabled), and made worse by 5243 (applies if hole_birth feature is disabled, and we never send any holes). The bug is caused by the hole birth feature. When an object is deleted and replaced, all the holes in the object have birth time zero. However, zfs send cannot tell that the holes are new since the file was replaced, so it doesn't send them in an incremental. As a result, you can end up with invalid data when you receive incremental send streams. As a short-term fix, we can always send holes with birth time 0 (unless it's a zvol or a dataset where we can guarantee that no objects have been reused). Closes #37 openzfs/openzfs@adef853162	2016-03-10 09:01:19 +00:00
Alexander Motin	7370229e8d	Add new IOCTL compat shims for ABI breakage caused by r296510: MFV r296505: 6531 Provide mechanism to artificially limit disk performance	2016-03-09 11:16:15 +00:00
Alexander Motin	8d0e2eb06b	MFV r296529: 6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt() 6673 want a macro to convert seconds to nanoseconds and vice-versa Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> Reviewed by: Robert Mustacchi <rm@joyent.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Eli Rosenthal <eli.rosenthal@delphix.com> illumos/illumos-gate@a8f6344fa0	2016-03-08 18:28:24 +00:00
Alexander Motin	468bca03ef	MFV r296527: 6659 nvlist_free(NULL) is a no-op Reviewed by: Toomas Soome <tsoome@me.com> Reviewed by: Marcel Telka <marcel@telka.sk> Approved by: Robert Mustacchi <rm@joyent.com> Author: Josef 'Jeff' Sipek <jeffpc@josefsipek.net> illumos/illumos-gate@aab83bb83b	2016-03-08 18:11:38 +00:00
Alexander Motin	26802705d3	MFV r296522: 6541 Pool feature-flag check defeated if "verify" is included in the dedup property value Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Richard Laager <rlaager@wiktel.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: ilovezfs <ilovezfs@icloud.com> illumos/illumos-gate@971640e6aa	2016-03-08 17:58:02 +00:00
Alexander Motin	178f2c2b8e	MFV r296520: 6562 Refquota on receive doesn't account for overage Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@5f7a8e6d75	2016-03-08 17:53:42 +00:00
Alexander Motin	7a90077752	MFV r296518: 5027 zfs large block support (add copyright) Author: Matthew Ahrens <matt@mahrens.org> illumos/illumos-gate@c3d26abc9e	2016-03-08 17:51:09 +00:00
Alexander Motin	c892984b84	MFV r296515: 6536 zfs send: want a way to disable setting of DRR_FLAG_FREERECORDS Reviewed by: Anil Vijarnia <avijarnia@racktopsystems.com> Reviewed by: Kim Shrier <kshrier@racktopsystems.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Andrew Stormont <astormont@racktopsystems.com> illumos/illumos-gate@880094b606	2016-03-08 17:43:21 +00:00
Alexander Motin	253159febf	MFV r296513: 6450 scrub/resilver unnecessarily traverses snapshots created after the scrub started Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Richard Lowe <richlowe@richlowe.net> Author: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@38d6103674	2016-03-08 17:34:58 +00:00
Alexander Motin	4427252c14	MFV r296511: 6537 Panic on zpool scrub with DEBUG kernel Reviewed by: Steve Gonczi <gonczi@comcast.net> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Matthew Ahrens <mahrens@delphix.com> Author: Gary Mills <gary_mills@fastmail.fm> illumos/illumos-gate@8c04a1fa3f	2016-03-08 17:32:24 +00:00
Alexander Motin	1b63fd68f4	MFV r296505: 6531 Provide mechanism to artificially limit disk performance Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@97e8130957	2016-03-08 17:27:13 +00:00
Mark Johnston	9610c89750	Fix a couple of silly mistakes in r291962. - Handle the case where no DOF helper is provided. This occurs with the currently-unused DTRACEHIOC_ADD ioctl. - Fix some checks that prevented the loading DOF in the (non-default) lazyload mode.	2016-03-08 00:46:03 +00:00
Mark Johnston	380344a7af	Fix fasttrap tracepoint locking. Upstream, tracepoints are protected by per-CPU mutexes. An unlinked tracepoint may be freed once all the tracepoint mutexes have been acquired and released - this is done in fasttrap_mod_barrier(). This mechanism was not properly ported: in some places, the proc lock is used in place of a tracepoint lock, and in others the locking is omitted entirely. This change implements tracepoint locking with an rmlock, where the read lock is used in fasttrap probe context. As a side effect, this fixes a recursion on the proc lock when the raise action is used from a userland probe. MFC after: 1 month	2016-03-08 00:43:03 +00:00
Mark Johnston	6b1bddce00	Remove the fasttrap implementation for sparc. Other machine-dependent code required for DTrace on sparc is not present in the tree, so there's no point to keeping the fasttrap code.	2016-03-08 00:18:46 +00:00
Mark Johnston	acaa855f6e	MFV r296306: 6604 harden DIF bounds checking Reviewed by: Alex Wilson <alex.wilson@joyent.com> Reviewed by: Patrick Mooney <patrick.mooney@joyent.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Bryan Cantrill <bryan@joyent.com> illumos/illumos-gate@1c0cef67db MFC after: 2 weeks	2016-03-08 00:14:14 +00:00
Steven Hartland	e283644b87	Removed unused label and fix mutex_exit order Remove unused done label from zfs_setacl fixing PVS-Studio V729. Fix mutex_exit order to mirror the mutex_enter order. MFC after: 1 week Sponsored by: Multiplay	2016-02-25 03:01:24 +00:00
Svatopluk Kraus	35a0bc1260	As <machine/vmparam.h> is included from <vm/vm_param.h>, there is no need to include it explicitly when <vm/vm_param.h> is already included. Suggested by: alc Reviewed by: alc Differential Revision: https://reviews.freebsd.org/D5379	2016-02-22 09:08:04 +00:00
Warner Losh	c55f57071a	Create an API to reset a struct bio (g_reset_bio). This is mandatory for all struct bio you get back from g_{new,alloc}_bio. Temporary bios that you create on the stack or elsewhere should use this before first use of the bio, and between uses of the bio. At the moment, it is nothing more than a wrapper around bzero, but that may change in the future. The wrapper also removes one place where we encode the size of struct bio in the KBI.	2016-02-17 17:16:02 +00:00
Michal Meloun	7a308c64b4	ARM: Rename remaining ARMv4 specific function in DTrace code. I missed it in r295319. Pointed by: tuexen	2016-02-06 11:16:15 +00:00
Andriy Gapon	984777c43f	MFV r294821: 6529 Properly handle updates of variably-sized SA entries. Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Ned Bass <bass6@llnl.gov> Reviewed by: Tim Chase <tim@chase2k.com> Approved by: Gordon Ross <gwr@nexenta.com> Author: Andriy Gapon <avg@icyb.net.ua> illumos/illumos-gate@e7e978b1f7 During the update process in sa_modify_attrs(), the sizes of existing variably-sized SA entries are obtained from sa_lengths[]. The case where a variably-sized SA was being replaced neglected to increment the index into sa_lengths[], so subsequent variable-length SAs would be rewritten with the wrong length. This patch adds the missing increment operation so all variably-sized SA entries are stored with their correct lengths. Another problem was that index into attr_desc[] was increased even when an attribute was removed. If that attribute was not the last attribute, then the last attribute was lost.	2016-02-01 15:40:40 +00:00
Alan Somers	d4b9233a96	Add a sysctl to allow ZFS pools backed by zvols Change 294329 removed the ability to build ZFS pools that are backed by zvols, because having that ability (even if it's not used) leads to deadlocks. By popular demand, I'm adding an off-by-default sysctl to reenable that ability. Reviewed by: lidl, delphij MFC after: Never Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4998	2016-01-29 17:08:26 +00:00
Ruslan Bukin	28029b68c0	Welcome the RISC-V 64-bit kernel. This is the final step required allowing to compile and to run RISC-V kernel and userland from HEAD. RISC-V is a completely open ISA that is freely available to academia and industry. Thanks to all the people involved! Special thanks to Andrew Turner, David Chisnall, Ed Maste, Konstantin Belousov, John Baldwin and Arun Thomas for their help. Thanks to Robert Watson for organizing this project. This project sponsored by UK Higher Education Innovation Fund (HEIF5) and DARPA CTSRD project at the University of Cambridge Computer Laboratory. FreeBSD/RISC-V project home: https://wiki.freebsd.org/riscv Reviewed by: andrew, emaste, kib Relnotes: Yes Sponsored by: DARPA, AFRL Sponsored by: HEIF5 Differential Revision: https://reviews.freebsd.org/D4982	2016-01-29 15:12:31 +00:00
Alexander Motin	5a97f48082	MFV r294819: 6495 Fix mutex leak in dmu_objset_find_dp Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Albert Lee <trisk@omniti.com> Author: Steven Hartland <steven.hartland@multiplay.co.uk> illumos/illumos-gate@2bad22584d	2016-01-26 13:45:41 +00:00
Alexander Motin	1cb4625f18	MFV r294816: 4986 receiving replication stream fails if any snapshot exceeds refquota Reviewed by: John Kennedy <john.kennedy@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Gordon Ross <gordon.ross@nexenta.com> Author: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@5878fad70d	2016-01-26 13:37:30 +00:00
Alexander Motin	75b810aee6	MFV r294814: 6393 zfs receive a full send as a clone Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Prakash Surya <prakash.surya@delphix.com> Reviewed by: Richard Elling <Richard.Elling@RichardElling.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Paul Dagnelie <pcd@delphix.com> illumos/illumos-gate@68ecb2ec93 This allows to do a full (non-incremental send) and receive it as a clone of an existing dataset. It can leverage nopwrite to share blocks with the origin. This can be used to change the relationship of datasets on the target. For example, maybe on the source you have: A ---- B ---- C And you have sent to the target a full of B, and the incremental B->C: B ---- C You later realize that you want to have A on the target. You will have to do a full send of A, but nopwrite can save you space on the target if you receive it as a clone of B, assuming that A and B have some blocks inxi common: B ---- C \ A	2016-01-26 13:14:39 +00:00
Alexander Motin	d2385b31f5	MFV r294812: 6434 sa_find_sizes() may compute wrong SA header size Reviewed-by: Ned Bass <bass6@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Andriy Gapon <avg@freebsd.org> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: James Pan <jiaming.pan@yahoo.com> illumos/illumos-gate@3502ed6e7c	2016-01-26 13:03:01 +00:00
Alexander Motin	49b7f6ef02	MFV r294810: 6414 vdev_config_sync could be simpler Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Will Andrews <will@firepipe.net> illumos/illumos-gate@eb5bb58421	2016-01-26 12:58:58 +00:00
Alexander Motin	70c71b4722	MFV r294808: 6421 Add missing multilist_destroy calls to arc_fini Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Jorgen Lundman <lundman@lundman.net> Approved by: Robert Mustacchi <rm@joyent.com> Author: Prakash Surya <prakash.surya@delphix.com> illumos/illumos-gate@57deb23282	2016-01-26 12:54:03 +00:00
Alexander Motin	6c941579b9	MFV r294806: 6388 Failure of userland copy should return EFAULT Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Richard Yao <ryao@gentoo.org> illumos/illumos-gate@c71c00bbe8	2016-01-26 12:52:16 +00:00
Alexander Motin	8ad8374efe	MFV r294804: 6386 Fix function call with uninitialized value in vdev_inuse Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Approved by: Robert Mustacchi <rm@joyent.com> Author: Richard Yao <ryao@gentoo.org> illumos/illumos-gate@5bdd995ddb	2016-01-26 12:50:14 +00:00
Alexander Motin	02404a5ad2	MFV r294802: 6334 Cannot unlink files when over quota Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Toomas Soome <tsoome@me.com> Approved by: Dan McDonald <danmcd@omniti.com> Author: Simon Klinkert <simon.klinkert@gmail.com> illumos/illumos-gate@6575bca013	2016-01-26 12:48:10 +00:00
Alexander Motin	2360c716f9	MFV r294800: 6385 Fix unlocking order in zfs_zget Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan Kimmel <dan.kimmel@delphix.com> Reviewed by: Andriy Gapon <avg@freebsd.org> Approved by: Robert Mustacchi <rm@joyent.com> Author: Richard Yao <ryao@gentoo.org> illumos/illumos-gate@eaef6a96de	2016-01-26 12:44:49 +00:00
Alexander Motin	81754f9788	MFV r294798: 6292 exporting a pool while an async destroy is running can leave entries in the deferred tree Reviewed by: Paul Dagnelie <pcd@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Andriy Gapon <avg@FreeBSD.org> Reviewed by: Fabian Keil <fk@fabiankeil.de> Approved by: Gordon Ross <gordon.ross@nexenta.com> illumos/illumos-gate@a443cc80c7	2016-01-26 12:37:23 +00:00
Alexander Motin	82abccb272	MFV r294796: 6319 assertion failed in zio_ddt_write: bp->blk_birth == txg Reviewed by: George Wilson <george.wilson@delphix.com> Approved by: Dan McDonald <danmcd@omniti.com> illumos/illumos-gate@b39b744be7 This is revert of 5693.	2016-01-26 12:33:58 +00:00
Alexander Motin	27a8d05bd7	MFV r294793: 6367 spa_config_tryenter incorrectly handles the multiple-lock case Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com> Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com> Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk> Approved by: Matthew Ahrens <mahrens@delphix.com> illumos/illumos-gate@e495b6e673	2016-01-26 12:28:53 +00:00
Edward Tomasz Napierala	aa9b057c08	Fix ru_oublocks accounting for ZFS. There are two code paths that can be called from zfs_write() - one of them, through dmu_write(), was handled correctly; the other wasn't. Reviewed by: avg@ MFC after: 1 month Sponsored by: The FreeBSD Foundation Differential Revision: https://reviews.freebsd.org/D4923	2016-01-23 12:13:09 +00:00
Alan Somers	34a484f353	Quell harmless CID about unchecked return value in nvlist_get_guids. The return value doesn't need to be checked, because nvlist_get_guid's callers check the returned values of the guids. Coverity CID: 1341869 MFC after: 1 week X-MFC-With: 292066 Sponsored by: Spectra Logic Corp	2016-01-19 23:16:24 +00:00
Alan Somers	f7b60097b5	Disallow zvol-backed ZFS pools Using zvols as backing devices for ZFS pools is fraught with panics and deadlocks. For example, attempting to online a missing device in the presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better to completely disable vdev_geom from ever opening a zvol. The solution relies on setting a thread-local variable during vdev_geom_open, and returning EOPNOTSUPP during zvol_open if that thread-local variable is set. Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent was to prevent a recursive mutex acquisition panic. However, the new check for the thread-local variable also fixes that problem. Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this function was set to panic. But it can occur that a device disappears during tasting, and it causes no problems to ignore this departure. Reviewed by: delphij MFC after: 1 week Relnotes: yes Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4986	2016-01-19 17:00:25 +00:00
Dimitry Andric	9516209bf2	MFV r294101: 6527 Possible access beyond end of string in zpool comment Reviewed by: George Wilson <george.wilson@delphix.com> Reviewed by: Matthew Ahrens <mahrens@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Approved by: Gordon Ross <gwr@nexenta.com> illumos/illumos-gate@2bd7a8d078 This fixes erroneous double increments of the 'check' variable in a loop in spa_prop_validate(). I ran into this in the clang380-import branch, where clang 3.8.0 warns about it. (It is already fixed there.) MFC after: 3 days	2016-01-15 21:45:53 +00:00
Alan Somers	cbedc01c9a	Fix race condition involving ZFS remove events When a ZFS drive disappears, ZFS sends a resource.fs.zfs.removed event to userland. A userland program like zfsd(8) can use that event, for example to activate a hotspare. The current code contains a race condition: vdev_geom will sent the sysevent _before_ spa.c would update the vdev's status, causing userland processes to see pool state that does not reflect the device removal. This change moves the sysevent to spa.c, closing the race. Reviewed by: delphij, Sean Eric Fagan MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4902	2016-01-14 18:19:05 +00:00
Alan Somers	53f6862723	Fix importing l2arc device by guid After r292066, vdev_geom verifies both the vdev and pool guids of device labels during open. However, spare and l2arc devices don't have pool guids, so opening them by guid will fail (opening by path, when the pathname is known, still succeeds). This change allows a vdev to be opened by guid if the label contains no pool_guid, which is the case for inactive spares and l2arc devices. PR: 292066 Reported by: delphij Reviewed by: delphij, smh MFC after: 2 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4861	2016-01-11 22:15:46 +00:00
Alan Somers	4e7787a9e9	Record physical path information in ZFS Vdevs sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: If available, record the physical path of a vdev in ZFS meta-data. Do this both when opening the vdev, and when receiving an attribute change notification from GEOM. Make vdev_geom_close() synchronous instead of deferring its work to a GEOM event handler. There is no benefit to deferring the work and this prevents a future open call from referencing a consumer that is scheduled for destruction. The close followed by an immediate open will occur during a vdev reprobe triggered by any type of I/O error. Consolidate vdev_geom_close() and vdev_geom_detach() into vdev_geom_close() and vdev_geom_close_locked(). This also moves the cross linking operations between vdev and GEOM consumer into a single place (linking in vdev_geom_attach() and unlinking in vdev_geom_close_locked()). Submitted by: gibbs, asomers MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4524	2016-01-11 17:57:26 +00:00
Steven Hartland	9697b154f2	Fix const conversion warning in lz4_decompress Fix const conversion warning in lz4_decompress which shows when warnings are enabled (to be done later). MFC after: 2 weeks X-MFC-With: r293268 Sponsored by: Multiplay	2016-01-06 20:28:09 +00:00
Allan Jude	7a3f5d11fb	Replace sys/crypto/sha2/sha2.c with lib/libmd/sha512c.c cperciva's libmd implementation is 5-30% faster The same was done for SHA256 previously in r263218 cperciva's implementation was lacking SHA-384 which I implemented, validated against OpenSSL and the NIST documentation Extend sbin/md5 to create sha384(1) Chase dependancies on sys/crypto/sha2/sha2.{c,h} and replace them with sha512{c.c,.h} Reviewed by: cperciva, des, delphij Approved by: secteam, bapt (mentor) MFC after: 2 weeks Sponsored by: ScaleEngine Inc. Differential Revision: https://reviews.freebsd.org/D3929	2015-12-27 17:33:59 +00:00
Andrew Turner	06ef48781d	Be stricter on which functions we can probe with FBT. We now only check the first instruction to see if it's either a pushm with lr, or a sub with sp. The former is the common case, with the latter used with va_args. This removes 12 probes. These are all hand-written assembly, with a few C functions with no stack usage. Submitted by: Howard Su <howard0su@gmail.com> Differential Revision: https://reviews.freebsd.org/D4419	2015-12-23 17:54:19 +00:00
Mark Johnston	8ff6d9dd22	Support an arbitrary number of arguments to DTrace syscall probes. Rather than pushing all eight possible arguments into dtrace_probe()'s stack frame, make the syscall_args struct for the current syscall available via the current thread. Using a custom getargval method for the systrace provider, this allows any syscall argument to be fetched, even in kernels that have modified the maximum number of system call arguments. Sponsored by: EMC / Isilon Storage Division	2015-12-17 00:00:27 +00:00
Gleb Smirnoff	f17f88d3e0	Fix breakage caused by r292373 in ZFS/FUSE/NFS/SMBFS. With the new VOP_GETPAGES() KPI the "count" argument counts pages already, and doesn't need to be translated from bytes to pages. While here make it consistent that rbehind and rahead are updated only if we doesn't return error. Pointy hat to: glebius	2015-12-16 23:48:50 +00:00
Mark Johnston	feea513564	Remove the unused systrace device file and fix style bugs. MFC after: 1 week	2015-12-16 23:46:27 +00:00
Gleb Smirnoff	b0cd20172d	A change to KPI of vm_pager_get_pages() and underlying VOP_GETPAGES(). o With new KPI consumers can request contiguous ranges of pages, and unlike before, all pages will be kept busied on return, like it was done before with the 'reqpage' only. Now the reqpage goes away. With new interface it is easier to implement code protected from race conditions. Such arrayed requests for now should be preceeded by a call to vm_pager_haspage() to make sure that request is possible. This could be improved later, making vm_pager_haspage() obsolete. Strenghtening the promises on the business of the array of pages allows us to remove such hacks as swp_pager_free_nrpage() and vm_pager_free_nonreq(). o New KPI accepts two integer pointers that may optionally point at values for read ahead and read behind, that a pager may do, if it can. These pages are completely owned by pager, and not controlled by the caller. This shifts the UFS-specific readahead logic from vm_fault.c, which should be file system agnostic, into vnode_pager.c. It also removes one VOP_BMAP() request per hard fault. Discussed with: kib, alc, jeff, scottl Sponsored by: Nginx, Inc. Sponsored by: Netflix	2015-12-16 21:30:45 +00:00
Alan Somers	670ffd5e5c	Change an important error message from ZFS_LOG to printf Submitted by: gibbs MFC after: 4 weeks Sponsored by: Spectra Logic Corp	2015-12-11 00:04:13 +00:00
Alan Somers	62ac7dd2bf	During vdev_geom_open, require that the vdev guids match the device's label except during split, add, or create operations. This fixes a bug where the wrong disk could be returned, and higher layers of ZFS would immediately eject it again. sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c: o When opening by GUID, require both the pool and vdev GUIDs to match. While it is highly unlikely for two vdevs to have the same vdev GUIDs, the ZFS storage pool allocator only guarantees they are unique within a pool. o Modify the open behavior to: - If we are opening a vdev that hasn't previously been opened, open by path without checking GUIDs. - Otherwise, open by path and verify GUIDs. - If that fails, search all geom providers for a device with matching GUIDs. - If that fails, return ENOENT. Submitted by: gibbs, asomers Reviewed by: smh MFC after: 4 weeks Sponsored by: Spectra Logic Corp Differential Revision: https://reviews.freebsd.org/D4486	2015-12-10 21:46:21 +00:00
Mark Johnston	1639290749	MFV r289003: 6271 dtrace caused excessive fork time Author: Bryan Cantrill <bryan@joyent.com> Reviewed by: Adam Leventhal <ahl@delphix.com> Reviewed by: Dan McDonald <danmcd@omniti.com> Reviewed by: Richard Lowe <richlowe@richlowe.net> Approved by: Gordon Ross <gwr@nexenta.com> illumos/illumos-gate@7bd3c1d12d	2015-12-07 21:49:32 +00:00
Mark Johnston	6e0f204c3f	Modify DTRACEHIOC_ADDDOF to copy the DOF section from the target process. r281257 added support for lazyload mode by allowing dtrace(1) to register a DOF section on behalf of a traced process. This was implemented by having libdtrace copy the DOF section into a heap-allocated buffer and passing its address to the ioctl handler. However, DTrace uses the DOF section address as a lookup key in certain cases, so the ioctl handler should be given the target process' DOF section address instead. This change modifies the ADDDOF handler to copy the DOF section in from the target process, rather than from dtrace(1).	2015-12-07 21:44:05 +00:00
Mark Johnston	711fbd17ec	Add helper functions proc_readmem() and proc_writemem(). These helper functions can be used to read in or write a buffer from or to an arbitrary process' address space. Without them, this can only be done using proc_rwmem(), which requires the caller to fill out a uio. This is onerous and results in code duplication; the new functions provide a simpler interface which is sufficient for most existing callers of proc_rwmem(). This change also adds a manual page for proc_rwmem() and the new functions. Reviewed by: jhb, kib Differential Revision: https://reviews.freebsd.org/D4245	2015-12-07 21:33:15 +00:00
Andrew Turner	5cde34a0ff	Allow the artificial profile frames to be adjusted as needed by the user. While here update for armv6 to a tested value. Submitted by: Howard Su <howard0su@gmail.com> Reviewed by: stat Differential Revision: https://reviews.freebsd.org/D4315	2015-12-05 10:00:01 +00:00
Andrew Turner	c218815337	Move the check to see if we are tracing a function with the DTrace Function Boundary Trace to assembly to reduce the overhead of these checks. Submitted by: Howard Su <howard0su@gmail.com> Relnotes: Yes Differential Revision: https://reviews.freebsd.org/D4266	2015-12-05 09:32:36 +00:00
Stanislav Sedov	314eeef290	Make the number of fasttrap probes and the size of the trace points hash table tunable via sysctl or kernel tunables. Illumos allows this parameters to be changed via the fasttrap.conf configuration file, but FreeBSD code hardcoded the parameters. Expose them under the kern.dtrace.fasttrap sysctl tree. MFC after: 2 weeks	2015-12-01 00:24:54 +00:00
Mark Johnston	4d7296f9aa	Fix a bug in the amd64 dtrace_getarg() implementation: when unwinding the stack, take into account the copy of rsi pushed between the breakpoint trapframe and the dtrace_invop frame. Prior to r287644, this was covered by the fact that sizeof(struct amd64_frame) was 24 rather than 16. Reported by: smh	2015-11-19 05:33:15 +00:00
Steven Hartland	465fed1c17	Switch zfs_panic_recover to panic for bad DVA As reported by Coverity a null pointer de-reference panic would be triggered when zfs_recover was set so switch to straight panic as it can never be recovered. Reported by: Coverity Scan MFC after: 1 X-MFC-With: r290401 Sponsored by: Multiplay	2015-11-06 20:45:19 +00:00
Steven Hartland	d0d400133f	Provide information about bad DVA Provide information about which vdev has an issue with a bad DVA. MFC after: 1 week Sponsored by: Multiplay	2015-11-05 17:12:41 +00:00
Steven Hartland	ab66c9067a	Allow zfs_recover to be changed at runtime MFC after: 1 week Sponsored by: Multiplay	2015-11-05 17:00:42 +00:00
Andrew Turner	e5ca5f2abd	Fix the open solaris atomic functions on arm64. Without this we may use the wrong value in the comparison, leading to incorrectly setting the new value. This has been observed in the ZFS code. Without this we can lose track of the reference count in a zrlock object. We should move to use the generic atomic functions, however as this has been observed I would prefer to have this working, then move to the generic functions. PR: 204037 Sponsored by: ABT Systems Ltd	2015-11-05 16:55:27 +00:00
Andriy Gapon	c34d46ff59	zfs: allow the lookup of extended attributes of an unlinked file That's required for extattr_get_fd(2) and the like to work properly. PR: 203201 MFC after: 17 days	2015-11-02 10:07:21 +00:00
Andriy Gapon	abc37121c4	l2arc: do not call trim_map_free() for blocks with zero b_asize b_asize can be zero if the block is compressed into an empty block (ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless zero-sized trimming is not attempted. The logic for calling trim_map_free() is extracted into a new function l2arc_trim() to minimize code duplication. PR: 203473 Reported by: Willem Jan Withagen <wjw@digiware.nl> Tested by: Willem Jan Withagen <wjw@digiware.nl> MFC after: 11 days	2015-10-30 12:00:34 +00:00

... 3 4 5 6 7 ...

1822 Commits