Unconditionally freeing a page is not good, especially if it is the page
that was wired by the caller. The checks are picked up from
kern_sendfile.
MFC after: 3 weeks
Quoting illumos issue #3836:
Currently zio_free() always puts the zio on a list for subsequent
processing by zio_free_sync(). This is only necessary for frees that
might need to issue reads (gang and dedup blocks).
By processing the majority of the frees as we encounter them, we reduce
the amount of time that the spa_sync() thread spends burning CPU and
not doing any i/o, thus increasing the overall write throughput of the
system.
Illumos ZFS issues:
3836 zio_free() can be processed immediately in the common case
MFC after: 1 week
I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's
with the new zfs_ioctl_register_legacy() function.
These operations do not modify pools or datasets so there is no need to
log them to pool history.
Reported by: Alexander Leidinger <ale@FreeBSD.org> and others on current@
MFC after: 3 days
This could happen if a thread doing a page-in loses a ZFS range lock
race to a thread writing to the same range
This fixes "panic: vm_page_alloc: pindex already allocated" in
http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel
Submitted by: avg
MFC after: 1 week
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).
Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
MFC after: 1 week
ZFS event processing should work on R/O root filesystems
Illumos ZFS issues:
3749 zfs event processing should work on R/O root filesystems
MFC after: 2 weeks
* Illumos ZFS issue #3805 arc shouldn't cache freed blocks
Quote from the Illumos issue:
ZFS should proactively evict freed blocks from the cache.
Even though these freed blocks will never be used again, and thus
will eventually be evicted, this causes us to use memory
inefficiently for 2 reasons:
1. A block that is freed has no chance of being accessed again, but
will be kept in memory preferentially to a block that was accessed
before it (and is thus older) but has not been freed and thus has
at least some chance of being accessed again.
2. We partition the ARC into several buckets:
user data that has been accessed only once (MRU)
metadata that has been accessed only once (MRU)
user data that has been accessed more than once (MFU)
metadata that has been accessed more than once (MFU)
The user data vs metadata split is somewhat arbitrary, and the
primary control on how much memory is used to cache data vs metadata
is to simply try to keep the proportion the same as it has been in the
past (each bucket "evicts against" itself). The secondary control is
to evict data before evicting metadata.
Because of this bucketing, we may end up with one bucket mostly
containing freed blocks that are very old, while another bucket has
more recently accessed, still-allocated blocks. Data in the useful
bucket (with still-allocated blocks) may be evicted in preference to
data in the useless bucket (with old, freed blocks).
On dcenter, we saw that the MFU metadata bucket was 230MB, while the
MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
However, the vast majority of data in the MRU metadata bucket (256GB)
was freed blocks, and thus useless. Meanwhile, the MFU metadata bucket
(230MB) was constantly evicting useful blocks that will be soon needed.
The problem of cache segmentation is a larger problem that needs more
investigation. However, if we stop caching freed blocks, it should
reduce the impact of this more fundamental issue.
MFC after: 2 weeks
* Illumos zfs issue #3137 L2ARC compression
Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.
The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.
MFC after: 2 weeks
Merge vendor bugfix for a possible deadlock related to async destroy
and improve write performance by introducing a new lock protecting
tx_open_txg.
Illumos ZFS issues:
3642 dsl_scan_active() should not issue I/O to determine if async
destroying is active
3643 txg_delay should not hold the tc_lock
MFC after: 1 week
impossible to set quota and reservation on pools lower than version 22.
Problem has been reported and a solution discussed with vendor.
Illumos ZFS issues:
3739 cannot set zfs quota or reservation on pool version < 22
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reported by: Steve Wills <swills@FreeBSD.org>
MFC after: 3 days
Merge bugfixes accepted and integrated by vendor. Underlying problems
have been reported by us and fixed in r240942 and r249196.
Illumos ZFS issues:
3645 dmu_send_impl: possibilty of pool hold leak
3692 Panic on zfs receive of a recursive deduplicated stream
MFC after: 8 days
doesn't copyout in this case.
To solve this issue a new struct zfs_iocparm_t is introduced consisting of:
- zfs_ioctl_version (future backwards compatibility purposes)
- user space pointer to zfs_cmd_t (copyin and copyout)
- size of zfs_cmd_t (verification purposes)
The copyin and copyout of zfs_cmd_t is now done the illumos (vendor) way
what makes porting of new changes easier and ensures correct behavior if
returning an error.
MFC after: 10 days
Do not list read-only pools in zpool.cache
Reduce diff against vendor in unused vdev_disk.c
Illumos ZFS issues:
3639 zpool.cache should skip over readonly pools
3640 want automatic devid updates
MFC after: 1 week
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.
Illumos ZFS issues:
3598 want to dtrace when errors are generated in zfs
MFC after: 3 weeks
Import vendor change to reduce diff, no effect on FreeBSD.
Illumos ZFS issues:
3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
Prior to r248571 spa_open was always called with a bare pool name,
but now it is called with a dataset name instead (spa_lookup handles
that).
So, when a ZFS root is mounted spa_open is called with a name of a root
dataset, which can very well be different from the pool name.
But zvol_create_minors should be called with the pool name, because it
performs a recursive traversal of all datasets under the name to find
all those that are volumes.
MFC after: 7 days
This particular scenario was easily reproduced using a NFS export. When the
first 'zfs unmount' occurred, it returned EBUSY via this path, while
vflush() had flushed references on the filesystem's root vnode, which in
turn caused its v_interlock to be destroyed. The next time 'zfs unmount'
was called, vflush() tried to obtain this lock, which caused this panic.
Since vflush() on FreeBSD is a definitive call, there is no need to check
vfsp->vfs_count after it completes. Simply #ifdef sun this check.
Submitted by: avg
Reviewed by: avg
Approved by: ken (mentor)
MFC after: 1 month
Previously TRIM processing was very bursty. This was made worse by the fact
that TRIM requests on SSD's are typically much slower than reads or writes.
This often resulted in stalls while large numbers of TRIM's where processed.
In addition due to the way the TRIM thread was only woken by writes, deletes
could stall in the queue for extensive periods of time.
This patch adds a number of controls to how often the TRIM thread for each
SPA processes its outstanding delete requests.
vfs.zfs.trim.timeout: Delay TRIMs by up to this many seconds
vfs.zfs.trim.txg_delay: Delay TRIMs by up to this many TXGs (reduced to 32)
vfs.zfs.vdev.trim_max_bytes: Maximum pending TRIM bytes for a vdev
vfs.zfs.vdev.trim_max_pending: Maximum pending TRIM segments for a vdev
vfs.zfs.trim.max_interval: Maximum interval between TRIM queue processing
(seconds)
Given the most common TRIM implementation is ATA TRIM the current defaults
are targeted at that.
Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
MFC after: 2 weeks
Currently, the trim module uses the same algorithm for data and cache
devices when deciding to issue TRIM requests, based on how far in the
past the TXG is.
Unfortunately, this is not ideal for cache devices, because the L2ARC
doesn't use the concept of TXGs at all. In fact, when using a pool for
reading only, the L2ARC is written but the TXG counter doesn't
increase, and so no new TRIM requests are issued to the cache device.
This patch fixes the issue by using time instead of the TXG number as
the criteria for trimming on cache devices. The basic delay principle
stays the same, but parameters are expressed in seconds instead of
TXGs. The new parameters are named trim_l2arc_limit and
trim_l2arc_batch, and both default to 30 second.
Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: 17122c31ac
MFC after: 2 weeks
This patch adds some improvements to the way the trim module considers
TXGs:
- Free ZIOs are registered with the TXG from the ZIO itself, not the
current SPA syncing TXG (which may be out of date);
- L2ARC are registered with a zero TXG number, as L2ARC has no concept
of TXGs;
- The TXG limit for issuing TRIMs is now computed from the last synced
TXG, not the currently syncing TXG. Indeed, under extremely unlikely
race conditions, there is a risk we could trim blocks which have been
freed in a TXG that has not finished syncing, resulting in potential
data corruption in case of a crash.
Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: 5b46ad40d9
MFC after: 2 weeks
The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.
I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.
Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: Source: 6a3cebaf7c
This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.
Reviewed by: pjd (mentor)
Approved by: pjd (mentor)
Obtained from: 31aae37399
MFC after: 2 weeks