Merge vendor bugfix for ZFS test suite that triggers false positives.
Illumos ZFS issues:
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Before executing any subcommand, zpool tool fetches pools configuration from
the kernel. Before features support was added, kernel was regenerating that
configuration based on data always present in memory. Unfortunately, pool
features list and activity counters are not such. They are stored in ZAP,
that normally resides in ARC, but under heavy memory pressure may be swapped
out. If pool is suspended at this point, there is no way to recover it back
since any zpool command will stuck.
This change has one predictable flaw: `zpool upgrade` always wish to upgrade
suspended pools, but fortunately it can't do it due to the suspension.
disable TRIM otherwise.
r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.
When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition. Otherwise it queues free request as
it was before r252840.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s). That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.
In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.
Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.
Before this patch is reinserted we need to break this ordering.
Sponsored by: EMC / Isilon storage division
Reported by: kib
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages
Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).
After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).
Fixing such primitive can bring to complete removal of the page hold
mechanism.
Sponsored by: EMC / Isilon storage division
Discussed with: alc
Reviewed by: jeff
Tested by: pho
Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.
Illumos ZFS issues:
3834 incremental replication of 'holey' file systems is slow
MFC after: 2 weeks
To quote Illumos issue #3888:
When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).
However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).
Illumos ZFS issues:
3888 zfs recv -F should destroy any snapshots created since the
incremental source
MFC after: 2 weeks
To quote Illumos #3875:
The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.
Illumos ZFS issues:
3875 panic in zfs_root() after failed rollback
MFC after: 2 weeks
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.
This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness. There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.
- move init and fini code into separate functions (like it is done upstream)
- invoke fini code via shutdown_post_sync event hook
This should make zfs close its underlying devices during shutdown,
which may be important for their drivers.
MFC after: 20 days
All other places where a znode is allocated do not need z_vnode at all.
These are:
- zfs_create_share_dir
- zfs_create_fs
This chnage ensures two things:
- VN_LOCK_ASHARE is not erroneously called for VFIFO vnodes
- vn_lock is called on a fully constructed vnode with correct v_ops
The change also allows to make zfs_znode_cache_constructor a normal
kmem_cache constructor again (as it is in upstream).
This allows to avoid a problem where zfs_znode_cache_destructor
may be called on un-constructed znodes.
MFC after: 17 days
Unconditionally freeing a page is not good, especially if it is the page
that was wired by the caller. The checks are picked up from
kern_sendfile.
MFC after: 3 weeks
Quoting illumos issue #3836:
Currently zio_free() always puts the zio on a list for subsequent
processing by zio_free_sync(). This is only necessary for frees that
might need to issue reads (gang and dedup blocks).
By processing the majority of the frees as we encounter them, we reduce
the amount of time that the spa_sync() thread spends burning CPU and
not doing any i/o, thus increasing the overall write throughput of the
system.
Illumos ZFS issues:
3836 zio_free() can be processed immediately in the common case
MFC after: 1 week
I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's
with the new zfs_ioctl_register_legacy() function.
These operations do not modify pools or datasets so there is no need to
log them to pool history.
Reported by: Alexander Leidinger <ale@FreeBSD.org> and others on current@
MFC after: 3 days
This could happen if a thread doing a page-in loses a ZFS range lock
race to a thread writing to the same range
This fixes "panic: vm_page_alloc: pindex already allocated" in
http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel
Submitted by: avg
MFC after: 1 week
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).
Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
MFC after: 1 week
ZFS event processing should work on R/O root filesystems
Illumos ZFS issues:
3749 zfs event processing should work on R/O root filesystems
MFC after: 2 weeks
* Illumos ZFS issue #3805 arc shouldn't cache freed blocks
Quote from the Illumos issue:
ZFS should proactively evict freed blocks from the cache.
Even though these freed blocks will never be used again, and thus
will eventually be evicted, this causes us to use memory
inefficiently for 2 reasons:
1. A block that is freed has no chance of being accessed again, but
will be kept in memory preferentially to a block that was accessed
before it (and is thus older) but has not been freed and thus has
at least some chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the
primary control on how much memory is used to cache data vs metadata
is to simply try to keep the proportion the same as it has been in the
past (each bucket "evicts against" itself). The secondary control is
to evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has
more recently accessed, still-allocated blocks. Data in the useful
bucket (with still-allocated blocks) may be evicted in preference to
data in the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the
MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
However, the vast majority of data in the MRU metadata bucket (256GB)
was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket
(230MB) was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
MFC after: 2 weeks
* Illumos zfs issue #3137 L2ARC compression
Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.
The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.
MFC after: 2 weeks
USDT probes exits. This was previously done with a callout; however, it is
possible to sleep while holding the DTrace mutexes, so a panic will occur
on INVARIANTS kernels if the callout handler can't immediately acquire one
of these mutexes. This panic will be frequently triggered on systems where
a USDT-enabled program (perl, for instance) is often run.
This revision changes the fasttrap cleanup mechanism so that a dedicated
thread is used instead of a callout. The old behaviour is otherwise
preserved.
Reviewed by: rpaulo
MFC after: 1 month
users to guarantee that the output of DTrace scripts will be time-ordered.
This option is enabled by adding the line
#pragma D option temporal
to the beginning of a script, or by adding '-x temporal' to the arguments of
dtrace(1).
This change fixes a bug in the original port of the temporal option. This
bug was causing some assertions to fail, so they had been disabled; in this
revision the assertions are working properly and are enabled.
The DTrace version number has been bumped from 1.9.0 to 1.9.1 to reflect
the language change that's being introduced.
This change corresponds to part of illumos-gate commit e5803b76927480:
3021 option for time-ordered output from dtrace(1M)
Reviewed by: pfg
Obtained from: illumos
MFC after: 1 month
Merge vendor bugfix for a possible deadlock related to async destroy
and improve write performance by introducing a new lock protecting
tx_open_txg.
Illumos ZFS issues:
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock
MFC after: 1 week
impossible to set quota and reservation on pools lower than version 22.
Problem has been reported and a solution discussed with vendor.
Illumos ZFS issues:
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reported by: Steve Wills <swills@FreeBSD.org>
MFC after: 3 days
The following change from illumos brought caused DTrace to
pause in an interactive environment:
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider
This was not detected during testing because it doesn't
affect scripts.
We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.
Reference:
https://www.illumos.org/issues/3026
Reported by: Navdeep Parhar and Mark Johnston
Merge changes from illumos:
3021 option for time-ordered output from dtrace(1M)
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
3025 register leak in D code generation
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider
This brings yet another feature implemented in upstream DTrace.
A complete description is available here:
http://dtrace.org/blogs/ahl/2012/07/28/my-new-dtrace-favorite/
This change bumps the DT_VERS_* number to 1.9.1 in
accordance to what is done in illumos.
This change was somewhat complicated because upstream is mixed many
changes in an individual commit and some of the tests don't really
apply to us.
There are also appear to be differences in timestamping with Solaris
so we had to workaround some assertions making sure no regression
happened.
Special thanks to Fabian Keil for changes and testing.
Illumos Revisions: 13758:23432da34147
Reference:
https://www.illumos.org/issues/3021https://www.illumos.org/issues/3022https://www.illumos.org/issues/3023https://www.illumos.org/issues/3024https://www.illumos.org/issues/3025https://www.illumos.org/issues/1694
Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 months
Merge bugfixes accepted and integrated by vendor. Underlying problems
have been reported by us and fixed in r240942 and r249196.
Illumos ZFS issues:
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream
MFC after: 8 days