Commit Graph

50 Commits

Author SHA1 Message Date
Andriy Gapon
7bc07f0575 fix a serious bug in r258632: offset parameter must be set in zio
In illumos all ioctl zio-s are "global" at the moment.  That is they act
on a whole disk, e.g. a cache flush command, and thus do not need either
offset or size parameters.
FreeBSD, on the other hand, has support for TRIM command and that
command requires proper offset and size parameters.
Without this fix all TRIM commands act on the start of any disk or
partition used by ZFS destroying any data there.

Pointyhat to:	avg
Tested by:	sbruno
MFC after:	3 days
X-MFC with:	r258632
Sponsored by:	HybridCluster
2013-11-28 08:48:49 +00:00
Andriy Gapon
fd51e905e2 MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold

illumos/illumos-gate@22e30981d8

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 10:02:02 +00:00
Andriy Gapon
2a4704ab01 MFV r255255: 4045 zfs write throttle & i/o scheduler performance work
illumos/illumos-gate@69962b5647

Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
  with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
  exposed as tunables / sysctls yet

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 09:57:14 +00:00
Andriy Gapon
fb8171c240 MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
illumos/illumos-gate@ec94d32216

MFC after:	9 days
Sponsored by:	HybridCluster [merge]
2013-11-26 09:45:48 +00:00
Andriy Gapon
34140e78ab 734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e

Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.

After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.

Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.

MFC after:	8 days
2013-11-26 09:26:18 +00:00
Alexander Motin
c5068af559 Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261.
On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage.  It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.
2013-11-19 11:19:07 +00:00
Justin T. Gibbs
439d30d121 Enhance the ZFS vdev layer to maintain both a logical and a physical
minimum allocation size for devices.  Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.

Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.

Calculate the minimum blocksize of each metaslab class.  Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.

Report devices with sub-optimal block size configuration in "zpool
status".  Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.

Sponsored by:	Spectra Logic Corporaion
MFC after:	2 weeks

Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands.  Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size).  Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

1) Existing pools created with devices that have different logical
   and physical block sizes, but were configured to use the logical
   block size (e.g. because the OS version used for pool construction
   reported the logical block size instead of the physical block
   size) will suddenly find that the vdev allocation size has
   increased.  This can be easily tolerated for active members of
   the array, but ZFS would prevent replacement of a vdev with
   another identical device because it now appears that the smaller
   allocation size required by the pool is not supported by the new
   device.

2) The device's physical block size may be too large to be supported
   by ZFS.  The optimal allocation size for the vdev may be quite
   large.  For example, a RAID controller may export a vdev that
   requires read-modify-write cycles unless accessed using 64k
   aligned/sized requests.  ZFS currently has an 8k minimum block
   size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems.  A device may be used so long as the logical
block size is compatible with the configuration.  By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
	Add the SPA_ASHIFT constant.  ZFS currently has a hard upper
	limit of 13 (8k) for ashift and this constant is used to
	both document and enforce this limit.

sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
	Add the VDEV_AUX_ASHIFT_TOO_BIG error code.

	Add fields for exporting the configured, logical, and
	physical ashift to the vdev_stat_t structure.

	Add VDEV_STAT_VALID() macro which can be used to verify the
	presence of required vdev_stat_t fields in nvlist data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
	Provide a SYSCTL_PROC handler for "max_auto_ashift".  Since
	the limit is only referenced long after boot when a create
	operation occurs, there's no compelling need for it to be
	a boot time configurable tunable.  This also allows the
	validation code for the max_auto_ashift value to be contained
	within the sysctl handler.

	Populate the new fields in the vdev_stat_t structure.

	Fail vdev opens if the vdev reports an ashift larger than
	SPA_MAXASHIFT.

	Propogate vdev_logical_ashift and vdev_physical_ashift between
	child and parent vdevs as is done for vdev_ashift.

	In vdev_open(), restore code that fails opens for devices
	where vdev_ashift grows.  This can only happen now if the
	device's logical ashift grows, which means it really isn't
	safe to use the device.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
	Update the vdev_open() API so that both logical (what was
	just ashift before) and physical ashift are reported.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
	Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
	to vdev_t.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
	Add vdev_ashift_optimize().  Call it anytime a new top-level
	vdev is allocated.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
	Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.

	For each sub-optimally configured leaf vdev, report configured
	and native block sizes.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
	Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
	This status is reported on healthy pools containing vdevs
	configured to use a block size smaller than their reported
	physical block size.

cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
	Update find_vdev_problem() and supporting functions to
	provide the full vdev_stat_t structure to problem checking
	routines, and to allow decent into replacing vdevs.

	Add a vdev_non_native_ashift() validator which is used on
	the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.

cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
	Enhance sysctl userland stubs now that a SYSCTL_PROC handler
	is used in vdev.c.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
	When the group membership of a metaslab class changes (i.e.
	when a vdev is added or removed from a pool), walk the group
	list to determine the smallest block size currently available
	and record this in the metaslab class.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
	Add the metaslab_class_get_minblocksize() accessor.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
	In zio_compress_data(), take the minimum blocksize as an
	input parameter instead of assuming SPA_MINBLOCKSIZE.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
	In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
	blocksize of the device.  The l2arc code performs has it's own
	code for deciding if compression is worth while, so this
	effectively disables zio_compress_data() from second guessing
	the original decision.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
	In zio_write_bp_init(), use the minimum blocksize of the
	normal metaslab class when compressing data.
2013-08-21 04:10:24 +00:00
Alexander Motin
f8dcf872c4 Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really
disable TRIM otherwise.

r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.

When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition.  Otherwise it queues free request as
it was before r252840.
2013-08-06 14:30:28 +00:00
Martin Matuska
12df7d65b0 MFV r252839:
Quoting illumos issue #3836:
  Currently zio_free() always puts the zio on a list for subsequent
  processing by zio_free_sync().  This is only necessary for frees that
  might need to issue reads (gang and dedup blocks).

  By processing the majority of the frees as we encounter them, we reduce
  the amount of time that the spa_sync() thread spends burning CPU and
  not doing any i/o, thus increasing the overall write throughput of the
  system.

Illumos ZFS issues:
  3836 zio_free() can be processed immediately in the common case

MFC after:	1 week
2013-07-05 21:29:59 +00:00
Xin LI
a91afe8a8d MFV r251620:
ZFS comments need cleaner, more consistent style

Illumos ZFS issues:
  3741 zfs comments need cleaner, more consistent style

MFC after:      2 weeks
2013-06-11 19:12:06 +00:00
Xin LI
9e43a32a5c MFV r251519:
* Illumos ZFS issue #3805 arc shouldn't cache freed blocks

Quote from the Illumos issue:

    ZFS should proactively evict freed blocks from the cache.

    Even though these freed blocks will never be used again, and thus
    will eventually be evicted, this causes us to use memory
    inefficiently for 2 reasons:

    1. A block that is freed has no chance of being accessed again, but
       will be kept in memory preferentially to a block that was accessed
       before it (and is thus older) but has not been freed and thus has
       at least some chance of being accessed again.

    2. We partition the ARC into several buckets:
       user data that has been accessed only once (MRU)
       metadata that has been accessed only once (MRU)
       user data that has been accessed more than once (MFU)
       metadata that has been accessed more than once (MFU)

    The user data vs metadata split is somewhat arbitrary, and the
    primary control on how much memory is used to cache data vs metadata
    is to simply try to keep the proportion the same as it has been in the
    past (each bucket "evicts against" itself).  The secondary control is
    to evict data before evicting metadata.

    Because of this bucketing, we may end up with one bucket mostly
    containing freed blocks that are very old, while another bucket has
    more recently accessed, still-allocated blocks.  Data in the useful
    bucket (with still-allocated blocks) may be evicted in preference to
    data in the useless bucket (with old, freed blocks).

    On dcenter, we saw that the MFU metadata bucket was 230MB, while the
    MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
    However, the vast majority of data in the MRU metadata bucket (256GB)
    was freed blocks, and thus useless.  Meanwhile, the MFU metadata bucket
    (230MB) was constantly evicting useful blocks that will be soon needed.

    The problem of cache segmentation is a larger problem that needs more
    investigation.  However, if we stop caching freed blocks, it should
    reduce the impact of this more fundamental issue.

MFC after:	2 weeks
2013-06-08 09:11:20 +00:00
Davide Italiano
a28d25df31 In case ZFS doesn't use UMA for buffers there's no need to waste memory
creating zones that will remain empty.

Reviewed by:	pjd
2013-05-01 17:34:44 +00:00
Martin Matuska
f1b5c26470 MFV r248217:
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.

Illumos ZFS issues:
  3598 want to dtrace when errors are generated in zfs

MFC after:	3 weeks
2013-04-06 10:39:38 +00:00
Steven Hartland
78ad0c1c80 Improve TXG handling in the TRIM module.
This patch adds some improvements to the way the trim module considers
TXGs:

 - Free ZIOs are registered with the TXG from the ZIO itself, not the
   current SPA syncing TXG (which may be out of date);
 - L2ARC are registered with a zero TXG number, as L2ARC has no concept
   of TXGs;
 - The TXG limit for issuing TRIMs is now computed from the last synced
   TXG, not the currently syncing TXG. Indeed, under extremely unlikely
   race conditions, there is a risk we could trim blocks which have been
   freed in a TXG that has not finished syncing, resulting in potential
   data corruption in case of a crash.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	5b46ad40d9
MFC after:	2 weeks
2013-03-21 10:16:10 +00:00
Steven Hartland
e07e3a3792 Don't register repair writes in the trim map.
The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.

I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	Source: 6a3cebaf7c
2013-03-21 10:02:32 +00:00
Steven Hartland
e05aad2d33 Add TRIM support for L2ARC.
This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	31aae37399
MFC after:	2 weeks
2013-03-21 09:34:41 +00:00
Martin Matuska
07091d8f14 MFV r247580:
Merge synctask code restructuring from vendor.

Modify forward and backward compatibility to support new change.

Illumos ZFS issues:
  3464 zfs synctask code needs restructuring

Sponsored by:	Hybrid Logic Ltd.
2013-03-19 12:51:18 +00:00
Martin Matuska
ff20578569 MFV r246392:
Import vendor ZFS bugfix fixing a possible deadlock in arc_read().

Illumos ZFS issues:
  3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

References:
  https://www.illumos.org/issues/3498

MFC after:	2 weeks
2013-02-11 12:42:11 +00:00
Steven Hartland
c440a359ca Upgrades trim free request sizes before inserting them into to free map,
making range consolidation much more effective particularly for small
deletes.

This reduces memory used by the free map as well as reducing the number
of bio requests down to geom required to process all deletes.

In tests this achieved a factor of 10 reduction of trim ranges / geom
call downs.

While I'm here correct the description of zio_vdev_io_start.

PR:		kern/173254
Submitted by:	Steven Hartland
Approved by:	pjd (mentor)
2012-12-13 17:06:38 +00:00
Steven Hartland
7150222c0a Renamed zfs trim stats removing duplicate zio_trim identifier from the name
Added description option to kstats.
Added descriptions for zio_trim kstats

PR:		kern/173113
Submitted by:	Steven Hartland
Reviewed by:	pjd
Approved by:	pjd
MFC after:	2 weeks
2012-12-12 16:14:14 +00:00
Martin Matuska
dd801aa546 MFV r243013 and r243267:
Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after:	2 weeks
2012-11-25 16:32:07 +00:00
Martin Matuska
2b8d4033cc MFV r242735:
Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after:	2 weeks
2012-11-25 09:06:32 +00:00
Pawel Jakub Dawidek
bcb77be2b7 Add TRIM support.
The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.

Sponsored by:	multiplay.co.uk

This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.

Obtained from:	zfsonlinux
Submitted by:	Etienne Dechamps <etienne.dechamps@ovh.net>
2012-09-23 19:40:58 +00:00
Martin Matuska
4c5238d576 Merge recent zfs vendor changes, sync code and adjust userland DEBUG.
Illumos issued covered:
1884 Empty "used" field for zfs *space commands
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument
     is zero
3028 zfs {group,user}space -n prints (null) instead of numeric GID/UID
3048 zfs {user,group}space [-s|-S] is broken
3049 zfs {user,group}space -t doesn't really filter the results
3060 zfs {user,group}space -H output isn't tab-delimited
3061 zfs {user,group}space -o doesn't use specified fields order
3064 usr/src/cmd/zpool/zpool_main.c misspells "successful"
3093 zfs {user,group}space's -i is noop
3098 zfs userspace/groupspace fail without saying why when run as non-root

References:
  https://www.illumos.org/issues/ + [issue_id]

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-09-12 18:05:43 +00:00
Martin Matuska
4a24a25b2f Merge recent vendor changes and sync code:
1862 incremental zfs receive fails for sparse file > 8PB
3112 ztest does not honor ZFS_DEBUG
3122 zfs destroy filesystem should prefetch blocks
3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
       0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

References:
  https://www.illumos.org/issues/1862
  https://www.illumos.org/issues/3112
  https://www.illumos.org/issues/3122
  https://www.illumos.org/issues/3129
  https://www.illumos.org/issues/3130

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-09-05 12:02:09 +00:00
Martin Matuska
2d9cf57e18 Introduce "feature flags" for ZFS pools (bump SPA version to 5000).
Add first feature "com.delphix:async_destroy" (asynchronous destroy
of ZFS datasets).
Implement features support in ZFS boot code.

Illumos revisions merged:
13700:2889e2596bd6
13701:1949b688d5fb
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags

References:
https://www.illumos.org/issues/2619
https://www.illumos.org/issues/2747

Obtained from:	illumos (issue #2619, #2747)
MFC after:	1 month
2012-06-11 11:35:22 +00:00
Kip Macy
ad69d4e266 always exclude data bufs regardless of debug settings 2012-01-29 00:19:19 +00:00
Kip Macy
cc0021eb34 add tunable for developers working on areas outside of ZFS
to further reduce core size by excluding ARC metadata buffers
from core dumps
2012-01-28 17:41:42 +00:00
Kip Macy
263811f724 exclude kmem_alloc'ed ARC data buffers from kernel minidumps on amd64
excluding other allocations including UMA now entails the addition of
a single flag to kmem_alloc or uma zone create

Reviewed by:	alc, avg
MFC after:	2 weeks
2012-01-27 20:18:31 +00:00
Martin Matuska
538251bbf6 Merge illumos revisions 13572, 13573, 13574:
Rev. 13572:
disk sync write perf regression when slog is used post oi_148 [1]

Rev. 13573:
crash during reguid causes stale config [2]
allow and unallow missing from zpool history since removal of pyzfs [5]

Rev. 13574:
leaking a vdev when removing an l2cache device [3]
memory leak when adding a file-based l2arc device [4]
leak in ZFS from metaslab_group_create and zfs_ereport_checksum [6]

References:
https://www.illumos.org/issues/1909 [1]
https://www.illumos.org/issues/1949 [2]
https://www.illumos.org/issues/1951 [3]
https://www.illumos.org/issues/1952 [4]
https://www.illumos.org/issues/1953 [5]
https://www.illumos.org/issues/1954 [6]

Obtained from:	illumos (issues #1909, #1949, #1951, #1952, #1953, #1954)
MFC after:	2 weeks
2012-01-24 23:09:54 +00:00
Martin Matuska
1bc399c4b1 ZFS tries to allocate blocks evenly across all devices. This means when
devices are imbalanced zfs will lots of CPU searching for space on devices
which tend to be pretty full. It should instead fail quickly on the full
devices and move onto devices which have more availability.

New loader tunable: vfs.zfs.mg_alloc_failures (min = 8)

Illumos-gate changeset:	13379:4df42cc92254

Obtained from:	Illumos (Bug #1051)
MFC after:	2 weeks
2011-07-18 08:29:49 +00:00
Pawel Jakub Dawidek
541c60d988 Don't access task structure once we call task function.
The task structure might be no longer available.
This also allows to eliminates the need for two tasks in the zio structure.

Submitted by:	anonymous
MFC after:	2 weeks
2011-05-24 20:07:15 +00:00
Pawel Jakub Dawidek
10b9d77bf1 Finally... Import the latest open-source ZFS version - (SPA) 28.
Few new things available from now on:

- Data deduplication.
- Triple parity RAIDZ (RAIDZ3).
- zfs diff.
- zpool split.
- Snapshot holds.
- zpool import -F. Allows to rewind corrupted pool to earlier
  transaction group.
- Possibility to import pool in read-only mode.

MFC after:	1 month
2011-02-27 19:41:40 +00:00
Martin Matuska
df06a59a77 MFp4 r186485, r186859:
Fix a race by defining two tasks in the zio structure
as we can still be returning from issue task when interrupt task is used.

Tested by:	pjd
Approved by:	pjd, delphij (mentor)
MFC after:	3 days
2011-01-03 12:57:07 +00:00
Martin Matuska
aa007a9f0e Properly handle IO with B_FAILFAST
Retry IO once with ZIO_FLAG_TRYHARD before declaring a pool faulted

OpenSolaris revision and Bug IDs:

9725:0bf7402e8022
6843014 ZFS B_FAILFAST handling is broken

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (Bug ID 6843014)
MFC after:	3 weeks
2010-09-27 09:42:31 +00:00
Martin Matuska
abe5837f7c Update ZFS metaslab code from OpenSolaris.
This provides a noticeable write speedup, especially on pools with
less than 30% of free space.

Detailed information (OpenSolaris onnv changesets and Bug IDs):

11146:7e58f40bcb1c
6826241	Sync write IOPS drops dramatically during TXG sync
6869229	zfs should switch to shiny new metaslabs more frequently

11728:59fdb3b856f6
6918420	zdb -m has issues printing metaslab statistics

12047:7c1fcc8419ca
6917066	zfs block picking can be improved

Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (Bug ID 6826241, 6869229, 6918420, 6917066)
MFC after:	2 weeks
2010-08-28 08:59:55 +00:00
Martin Matuska
8fc257994d Merge ZFS version 15 and almost all OpenSolaris bugfixes referenced
in Solaris 10 updates 141445-09 and 142901-14.

Detailed information:
(OpenSolaris revisions and Bug IDs, Solaris 10 patch numbers)

7844:effed23820ae
6755435	zfs_open() and zfs_close() needs to use ZFS_ENTER/ZFS_VERIFY_ZP (141445-01)

7897:e520d8258820
6748436	inconsistent zpool.cache in boot_archive could panic a zfs root filesystem upon boot-up (141445-01)

7965:b795da521357
6740164	zpool attach can create an illegal root pool (141909-02)

8084:b811cc60d650
6769612	zpool_import() will continue to write to cachefile even if altroot is set (N/A)

8121:7fd09d4ebd9c
6757430	want an option for zdb to disable space map loading and leak tracking (141445-01)

8129:e4f45a0bfbb0
6542860	ASSERT: reason != VDEV_LABEL_REMOVE||vdev_inuse(vd, crtxg, reason, 0) (141445-01)

8188:fd00c0a81e80
6761100	want zdb option to select older uberblocks (141445-01)

8190:6eeea43ced42
6774886	zfs_setattr() won't allow ndmp to restore SUNWattr_rw (141445-01)

8225:59a9961c2aeb
6737463	panic while trying to write out config file if root pool import fails (141445-01)

8227:f7d7be9b1f56
6765294	Refactor replay (141445-01)

8228:51e9ca9ee3a5
6572357	libzfs should do more to avoid mnttab lookups (141909-01)
6572376	zfs_iter_filesystems and zfs_iter_snapshots get objset stats twice (141909-01)

8241:5a60f16123ba
6328632	zpool offline is a bit too conservative (141445-01)
6739487	ASSERT: txg <= spa_final_txg due to scrub/export race (141445-01)
6767129	ASSERT: cvd->vdev_isspare, in spa_vdev_detach() (141445-01)
6747698	checksum failures after offline -t / export / import / scrub (141445-01)
6745863	ZFS writes to disk after it has been offlined (141445-01)
6722540	50% slowdown on scrub/resilver with certain vdev configurations (141445-01)
6759999	resilver logic rewrites ditto blocks on both source and destination (141445-01)
6758107	I/O should never suspend during spa_load() (141445-01)
6776548	codereview(1) runs off the page when faced with multi-line comments (N/A)
6761406	AMD errata 91 workaround doesn't work on 64-bit systems (141445-01)

8242:e46e4b2f0a03
6770866	GRUB/ZFS should require physical path or devid, but not both (141445-01)

8269:03a7e9050cfd
6674216	"zfs share" doesn't work, but "zfs set sharenfs=on" does (141445-01)
6621164	$SRC/cmd/zfs/zfs_main.c seems to have a syntax error in the translation note (141445-01)
6635482	i18n problems in libzfs_dataset.c and zfs_main.c (141445-01)
6595194	"zfs get" VALUE column is as wide as NAME (141445-01)
6722991	vdev_disk.c: error checking for ddi_pathname_to_dev_t() must test for NODEV (141445-01)
6396518	ASSERT strings shouldn't be pre-processed (141445-01)

8274:846b39508aff
6713916	scrub/resilver needlessly decompress data (141445-01)

8343:655db2375fed
6739553	libzfs_status msgid table is out of sync (141445-01)
6784104	libzfs unfairly rejects numerical values greater than 2^63 (141445-01)
6784108	zfs_realloc() should not free original memory on failure (141445-01)

8525:e0e0e525d0f8
6788830	set large value to reservation cause core dump (141445-01)
6791064	want sysevents for ZFS scrub (141445-01)
6791066	need to be able to set cachefile on faulted pools (141445-01)
6791071	zpool_do_import() should not enable datasets on faulted pools (141445-01)
6792134	getting multiple properties on a faulted pool leads to confusion (141445-01)

8547:bcc7b46e5ff7
6792884	Vista clients cannot access .zfs (141445-01)

8632:36ef517870a3
6798384	It can take a village to raise a zio (141445-01)

8636:7e4ce9158df3
6551866	deadlock between zfs_write(), zfs_freesp(), and zfs_putapage() (141909-01)
6504953	zfs_getpage() misunderstands VOP_GETPAGE() interface (141909-01)
6702206	ZFS read/writer lock contention throttles sendfile() benchmark (141445-01)
6780491	Zone on a ZFS filesystem has poor fork/exec performance (141445-01)
6747596	assertion failed: DVA_EQUAL(BP_IDENTITY(&zio->io_bp_orig), BP_IDENTITY(zio->io_bp))); (141445-01)

8692:692d4668b40d
6801507	ZFS read aggregation should not mind the gap (141445-01)

8697:e62d2612c14d
6633095	creating a filesystem with many properties set is slow (141445-01)

8768:dfecfdbb27ed
6775697	oracle crashes when overwriting after hitting quota on zfs (141909-01)

8811:f8deccf701cf
6790687	libzfs mnttab caching ignores external changes (141445-01)
6791101	memory leak from libzfs_mnttab_init (141445-01)

8845:91af0d9c0790
6800942	smb_session_create() incorrectly stores IP addresses (N/A)
6582163	Access Control List (ACL) for shares (141445-01)
6804954	smb_search - shortname field should be space padded following the NULL terminator (N/A)
6800184	Panic at smb_oplock_conflict+0x35() (N/A)

8876:59d2e67b4b65
6803822	Reboot after replacement of system disk in a ZFS mirror drops to grub> prompt (141445-01)

8924:5af812f84759
6789318	coredump when issue zdb -uuuu poolname/ (141445-01)
6790345 zdb -dddd -e poolname coredump (141445-01)
6797109 zdb: 'zdb -dddddd pool_name/fs_name inode' coredump if the file with inode was deleted (141445-01)
6797118 zdb: 'zdb -dddddd poolname inum' coredump if I miss the fs name (141445-01)
6803343 shareiscsi=on failed, iscsitgtd failed request to share (141445-01)

9030:243fd360d81f
6815893	hang mounting a dataset after booting into a new boot environment (141445-01)

9056:826e1858a846
6809691	'zpool create -f' no longer overwrites ufs infomation (141445-01)

9179:d8fbd96b79b3
6790064	zfs needs to determine uid and gid earlier in create process (141445-01)

9214:8d350e5d04aa
6604992	forced unmount + being in .zfs/snapshot/<snap1> = not happy (141909-01)
6810367	assertion failed: dvp->v_flag & VROOT, file: ../../common/fs/gfs.c, line: 426 (141909-01)

9229:e3f8b41e5db4
6807765	ztest_dsl_dataset_promote_busy needs to clean up after ENOSPC (141445-01)

9230:e4561e3eb1ef
6821169	offlining a device results in checksum errors (141445-01)
6821170	ZFS should not increment error stats for unavailable devices (141445-01)
6824006	need to increase issue and interrupt taskqs threads in zfs (141445-01)

9234:bffdc4fc05c4
6792139	recovering from a suspended pool needs some work (141445-01)
6794830	reboot command hangs on a failed zfs pool (141445-01)

9246:67c03c93c071
6824062	System panicked in zfs_mount due to NULL pointer dereference when running btts and svvs tests (141909-01)

9276:a8a7fc849933
6816124	System crash running zpool destroy on broken zpool (141445-03)

9355:09928982c591
6818183	zfs snapshot -r is slow due to set_snap_props() doing txg_wait_synced() for each new snapshot (141445-03)

9391:413d0661ef33
6710376	log device can show incorrect status when other parts of pool are degraded (141445-03)

9396:f41cf682d0d3 (part already merged)
6501037	want user/group quotas on ZFS (141445-03)
6827260	assertion failed in arc_read(): hdr == pbuf->b_hdr (141445-03)
6815592	panic: No such hold X on refcount Y from zfs_znode_move (141445-03)
6759986	zfs list shows temporary %clone when doing online zfs recv (141445-03)

9404:319573cd93f8
6774713	zfs ignores canmount=noauto when sharenfs property != off (141445-03)

9412:4aefd8704ce0
6717022	ZFS DMU needs zero-copy support (141445-03)

9425:e7ffacaec3a8
6799895	spa_add_spares() needs to be protected by config lock (141445-03)
6826466	want to post sysevents on hot spare activation (141445-03)
6826468	spa 'allowfaulted' needs some work (141445-03)
6826469	kernel support for storing vdev FRU information (141445-03)
6826470	skip posting checksum errors from DTL regions of leaf vdevs (141445-03)
6826471	I/O errors after device remove probe can confuse FMA (141445-03)
6826472	spares should enjoy some of the benefits of cache devices (141445-03)

9443:2a96d8478e95
6833711	gang leaders shouldn't have to be logical (141445-03)

9463:d0bd231c7518
6764124	want zdb to be able to checksum metadata blocks only (141445-03)

9465:8372081b8019
6830237	zfs panic in zfs_groupmember() (141445-03)

9466:1fdfd1fed9c4
6833162	phantom log device in zpool status (141445-03)

9469:4f68f041ddcd
6824968	add ZFS userquota support to rquotad (141445-03)

9470:6d827468d7b5
6834217	godfather I/O should reexecute (141445-03)

9480:fcff33da767f
6596237	Stop looking and start ganging (141909-02)

9493:9933d599bc93
6623978	lwb->lwb_buf != NULL, file ../../../uts/common/fs/zfs/zil.c, line 787, function zil_lwb_commit (141445-06)

9512:64cafcbcc337
6801810	Commit of aligned streaming rewrites to ZIL device causes unwanted disk reads (N/A)

9515:d3b739d9d043
6586537	async zio taskqs can block out userland commands (142901-09)

9554:787363635b6a
6836768	zfs_userspace() callback has no way to indicate failure (N/A)

9574:1eb6a6ab2c57
6838062	zfs panics when an error is encountered in space_map_load() (141909-02)

9583:b0696cd037cc
6794136	Panic BAD TRAP: type=e when importing degraded zraid pool. (141909-03)

9630:e25a03f552e0
6776104	"zfs import" deadlock between spa_unload() and spa_async_thread() (141445-06)

9653:a70048a304d1
6664765	Unable to remove files when using fat-zap and quota exceeded on ZFS filesystem (141445-06)

9688:127be1845343
6841321	zfs userspace / zfs get userused@ doesn't work on mounted snapshot (N/A)
6843069	zfs get userused@S-1-... doesn't work (N/A)

9873:8ddc892eca6e
6847229	assertion failed: refcount_count(&tx->tx_space_written) + delta <= tx->tx_space_towrite in dmu_tx.c (141445-06)

9904:d260bd3fd47c
6838344	kernel heap corruption detected on zil while stress testing (141445-06)

9951:a4895b3dd543
6844900	zfs_ioc_userspace_upgrade leaks (N/A)

10040:38b25aeeaf7a
6857012	zfs panics on zpool import (141445-06)

10000:241a51d8720c
6848242	zdb -e no longer works as expected (N/A)

10100:4a6965f6bef8
6856634	snv_117 not booting: zfs_parse_bootfs: error2 (141445-07)

10160:a45b03783d44
6861983	zfs should use new name <-> SID interfaces (N/A)
6862984	userquota commands can hang (141445-06)

10299:80845694147f
6696858	zfs receive of incremental replication stream can dereference NULL pointer and crash (N/A)

10302:a9e3d1987706
6696858	zfs receive of incremental replication stream can dereference NULL pointer and crash (fix lint) (N/A)

10575:2a8816c5173b (partial merge)
6882227 spa_async_remove() shouldn't do a full clear (142901-14)

10800:469478b180d9
6880764	fsync on zfs is broken if writes are greater than 32kb on a hard crash and no log attached (142901-09)
6793430 zdb -ivvvv assertion failure: bp->blk_cksum.zc_word[2] == dmu_objset_id(zilog->zl_os) (N/A)

10801:e0bf032e8673 (partial merge)
6822816 assertion failed: zap_remove_int(ds_next_clones_obj) returns ENOENT (142901-09)

10810:b6b161a6ae4a
6892298 buf->b_hdr->b_state != arc_anon, file: ../../common/fs/zfs/arc.c, line: 2849 (142901-09)

10890:499786962772
6807339	spurious checksum errors when replacing a vdev (142901-13)

11249:6c30f7dfc97b
6906110 bad trap panic in zil_replay_log_record (142901-13)
6906946 zfs replay isn't handling uid/gid correctly (142901-13)

11454:6e69bacc1a5a
6898245 suspended zpool should not cause rest of the zfs/zpool commands to hang (142901-10)

11546:42ea6be8961b (partial merge)
6833999 3-way deadlock in dsl_dataset_hold_ref() and dsl_sync_task_group_sync() (142901-09)

Discussed with:	pjd
Approved by:	delphij (mentor)
Obtained from:	OpenSolaris (multiple Bug IDs)
MFC after:	2 months
2010-07-12 23:49:04 +00:00
Pawel Jakub Dawidek
653e034db5 Turn off UMA allocations on all archs by default. It isn't stable even on
amd64.

Reported by:	many
MFC after:	3 days
2010-06-17 17:41:42 +00:00
Martin Matuska
1aa2ebdd23 Fix vdev_probe() starvation brings txg train to a screeching halt
OpenSolaris onnv-revision:	9722:e3866bad4e96

Obtained from:	OpenSolaris (Bug ID 6844069)
Approved by:	pjd, delphij (mentor)
MFC after:	3 days
2010-06-12 11:21:37 +00:00
Pawel Jakub Dawidek
4e8c7af455 Create UMA zones unconditionally.
MFC after:	3 days
2010-05-23 19:10:06 +00:00
Pawel Jakub Dawidek
ed3c664257 Allow to configure UMA usage for ZIO data via loader and turn it on by
default for amd64. On i386 I saw performance degradation when UMA was used,
but for amd64 it should help.

MFC after:	3 days
2010-05-16 15:14:59 +00:00
Pawel Jakub Dawidek
cfb3e98d37 Add task structure to zio and use it instead of allocating one.
This eliminates the only place where we can sleep when calling zio_interrupt().
As a side-effect this can actually improve performance a little as we
allocate one less thing for every I/O.

Prodded by:	kib
MFC after:	1 week
2010-05-16 15:12:34 +00:00
Kip Macy
03af82ac5e fix compilation under ZIO_USE_UMA 2010-03-13 21:52:21 +00:00
Kip Macy
e95d34711b - back out direct map hack
- it is no longer needed
2009-05-19 01:14:37 +00:00
Kip Macy
32237d8492 apply band-aid to x86_64 systems with more physical memory than kmem by allocating from the direct map 2009-05-16 19:17:15 +00:00
Pawel Jakub Dawidek
1ba4a712dd Update ZFS from version 6 to 13 and bring some FreeBSD-specific changes.
This bring huge amount of changes, I'll enumerate only user-visible changes:

- Delegated Administration

	Allows regular users to perform ZFS operations, like file system
	creation, snapshot creation, etc.

- L2ARC

	Level 2 cache for ZFS - allows to use additional disks for cache.
	Huge performance improvements mostly for random read of mostly
	static content.

- slog

	Allow to use additional disks for ZFS Intent Log to speed up
	operations like fsync(2).

- vfs.zfs.super_owner

	Allows regular users to perform privileged operations on files stored
	on ZFS file systems owned by him. Very careful with this one.

- chflags(2)

	Not all the flags are supported. This still needs work.

- ZFSBoot

	Support to boot off of ZFS pool. Not finished, AFAIK.

	Submitted by:	dfr

- Snapshot properties

- New failure modes

	Before if write requested failed, system paniced. Now one
	can select from one of three failure modes:
	- panic - panic on write error
	- wait - wait for disk to reappear
	- continue - serve read requests if possible, block write requests

- Refquota, refreservation properties

	Just quota and reservation properties, but don't count space consumed
	by children file systems, clones and snapshots.

- Sparse volumes

	ZVOLs that don't reserve space in the pool.

- External attributes

	Compatible with extattr(2).

- NFSv4-ACLs

	Not sure about the status, might not be complete yet.

	Submitted by:	trasz

- Creation-time properties

- Regression tests for zpool(8) command.

Obtained from:	OpenSolaris
2008-11-17 20:49:29 +00:00
John Birrell
b468fe2bce * Check endianness the FreeBSD way.
* Use LBOLT rather than lbolt to avoid a clash with a FreeBSD global
  variable.
2007-11-28 22:16:00 +00:00
Pawel Jakub Dawidek
6a7309390f - Add missing lock destruction and remove duplicate initializations.
With this change it is possible to unload zfs.ko module from
  WITNESS-enabled kernel.
- Remove bogus comment.
2007-05-06 19:05:37 +00:00
Pawel Jakub Dawidek
9de81c7273 MFp4:
@118370	Correct typo.

@118371	Integrate changes from vendor.

@118491	Show backtrace on unexpected code paths.

@118494	Integrate changes from vendor.

@118504	Fix sendfile(2). I had two ways of fixing it:
	1. Fixing sendfile(2) itself to use VOP_GETPAGES() instead of
	   hacking around with vn_rdwr(UIO_NOCOPY), which was suggested
	   by ups.
	2. Modify ZFS behaviour to handle this special case.

	Although 1 is more correct, I've choosen 2, because hack from 1
	have a side-effect of beeing faster - it reads ahead MAXBSIZE
	bytes instead of reading page by page. This is not easy to implement
	with VOP_GETPAGES(), at least not for me in this very moment.

	Reported by:	Andrey V. Elsukov <bu7cher@yandex.ru>

@118525	Reorganize the code to reduce diff.

@118526	This code path is expected. It is simply when file is opened with
	O_FSYNC flag.

	Reported by:	kris
	Reported by:	Michal Suszko <dry@dry.pl>
2007-04-21 12:02:57 +00:00
Pawel Jakub Dawidek
f0a75d274a Please welcome ZFS - The last word in file systems.
ZFS file system was ported from OpenSolaris operating system. The code in under
CDDL license.

I'd like to thank all SUN developers that created this great piece of software.

Supported by:	Wheel LTD (http://www.wheel.pl/)
Supported by:	The FreeBSD Foundation (http://www.freebsdfoundation.org/)
Supported by:	Sentex (http://www.sentex.net/)
2007-04-06 01:09:06 +00:00