Commit Graph

734 Commits

Author SHA1 Message Date
Xin LI
7c1db36b28 MFV r270196:
Illumos issue:
    5047 don't use atomic_*_nv if you discard the return value

MFC after:	2 weeks
2014-08-20 22:39:26 +00:00
Xin LI
249ddb42f6 MFC r270195:
Illumos issue:
    5045 use atomic_{inc,dec}_* instead of atomic_add_*

MFC after:	2 weeks
2014-08-20 21:44:48 +00:00
Xin LI
60723bfe21 MFV r269542:
In vdev_get_stats, check that the vdev is not a hole before computing the
fragmentation.  This fixes a panic when removing log device.

Illumos issue:
    5049 panic when removing log device

Author:		Alex Reece <alex@delphix.com>
MFC after:	2 weeks
2014-08-05 00:07:21 +00:00
Xin LI
cd741a5e1d Revert r269404 and use cpu_ticks() for dbuf allocation.
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().

Reviewed by:	mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after:	2 weeks
2014-08-03 09:47:51 +00:00
Xin LI
1dcef10eac MFV r269427:
In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.

This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.

Illumos issue:
    5038 Remove "old-style" flexible array usage in ZFS.
    Author: Justin T. Gibbs <justing@spectralogic.com>

MFC after:	2 weeks
2014-08-02 08:34:22 +00:00
Steven Hartland
6a369c018c Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods
This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.

We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.

zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.

Cleanup flow in vdev_geom_io_start while I'm here.

Also fix some cases not using SET_ERROR(..)

MFC after:	2 weeks
X-MFC-With:	r265321
2014-08-01 23:16:48 +00:00
Xin LI
9b046b421f MFV r269224:
Increase default ARC buf_hash_table size.  When typical block size is small,
the hash table could be too small, which would lead to long hash chains and
limit performance for cached reads.

A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which
allows users to override the default assumption of average (typical) block
size.  Old default was 65536 (64 KiB) and new default is 8192 (8 KiB).

Illumos issue:
    5034 ARC's buf_hash_table is too small

MFC after:	2 weeks
2014-07-29 09:36:48 +00:00
Xin LI
a3cbca537e MFV r269223:
Change dn->dn_dbufs from linked list to AVL tree.

Illumos issues:
  4873 zvol unmap calls can take a very long time for larger datasets

MFC after:	2 weeks
2014-07-29 08:42:22 +00:00
Xin LI
343c95a24e Reschedule the 'deadman' callout after handling, this makes our
code behave more like it is on Solaris.

Reported by:	avg
Reviewed by:	avg, mav (but bugs are mine)

Differential Revision: https://phabric.freebsd.org/D457
2014-07-29 06:57:13 +00:00
Konstantin Belousov
fe0e9a63e0 Initialize zfs vnode v_hash when the vnode is allocated, instead of
postponing it to zfs_vget().  zfs_root() returned vnode with the
default value of v_hash, which caused inconsistent v_hash value when
root vnode was obtained from zfs_vget().

Nullfs allocated two upper vnodes for the root zfs vnode due to
different hashes, causing consistency problems.

Reported and tested by:	Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-28 14:24:18 +00:00
Xin LI
50b74c6ef1 Add two sysctls for newly added tunables.
MFC after:	2 weeks
2014-07-26 19:07:08 +00:00
Xin LI
7e37b1e609 MFV r269010:
Import Illumos changes to address the following Illumos issues:
  4976 zfs should only avoid writing to a failing non-redundant
       top-level vdev
  4978 ztest fails in get_metaslab_refcount()
  4979 extend free space histogram to device and pool
  4980 metaslabs should have a fragmentation metric
  4981 remove fragmented ops vector from block allocator
  4982 space_map object should proactively upgrade when feature
       is enabled
  4984 device selection should use fragmentation metric

MFC after:	2 weeks
2014-07-26 10:20:48 +00:00
Alexander Motin
1bc04f6a8c Make sysctls under vfs.zfs.zfetch writeable.
I don't see any reason for them to be read-only, while tuning them without
reboot is much more convenient for experiments.

MFC after:	2 weeks
2014-07-26 09:09:14 +00:00
Xin LI
0aa4ce9b7d Transform the I/O when vdev_physical_ashift is greater than
SPA_MINBLOCKSHIFT.

MFC after:	2 weeks
2014-07-25 18:41:56 +00:00
Xin LI
883d80c104 As of r268075, the responsibility of rounding up buffer to optimal size have
been transferred from zio_compress_data to its caller.  Therefore, passing
the 'minblocksize' down will be a no-op.

Eliminate the parameter to reduce diff against upstream.

MFC after:	2 weeks
2014-07-25 06:53:20 +00:00
Xin LI
3d4d6b0883 Correct typo introduced with r268855.
MFC after:	10 days
X-MFC with:	r268855
2014-07-22 08:37:01 +00:00
Xin LI
b4bb49887b Reduce lock contention on the z_teardown_lock under heavily cached
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.

Read acquisitions are randomly distributed among these locks based
on curthread pointer.  Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.

Illumos issue:
    5008 lock contention (rrw_exit) while running a read only load

MFC after:	2 weeks
2014-07-19 00:26:03 +00:00
Xin LI
82599d31fe MFV r268851:
When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).

Illumos issue:
    4753 increase number of outstanding async writes when sync task is waiting

MFC after:	2 weeks
2014-07-18 22:34:01 +00:00
Xin LI
f886b6e3bc MFV r268850:
Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC.  Instead
we simply coordinate the destruction of the DMU's data with the ARC.

The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.

Illumos issue:
    4631 zvol_get_stats triggering too many reads

MFC after:	2 weeks
2014-07-18 22:04:21 +00:00
Xin LI
7882b61f60 MFV r268848:
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.

Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.

This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).

While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.

Illumos issue:
  4958 zdb trips assert on pools with ashift >= 0xe

MFC after:	2 weeks
2014-07-18 20:41:40 +00:00
Xin LI
7079d5877c MFV r268714:
Improve extreme rewind import.

When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os.  This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.

For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm.  Three kernel tunables have been added:

	vfs.zfs.spa_load_verify_maxinflight
	vfs.zfs.spa_load_verify_metadata
	vfs.zfs.spa_load_verify_data

The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.

Make 'zpool import -T' imply scrub.

Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.

Skip txg's for which there is no uberblock when doing extreme rewind.

Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.

Illumos issues:
  4970 need controls on i/o issued by zpool import -XF
  4971 zpool import -T should accept hex values
  4972 zpool import -T implies extreme rewind, and thus a scrub
  4973 spa_load_retry retries the same txg
  4974 spa_load_verify() reads all data twice

MFC after:	2 weeks
2014-07-15 22:44:04 +00:00
Xin LI
eb75155228 MFV r268702:
Add missing *_destroy() calls in various places with ZFS.

Illumos issue:
  4975 missing mutex_destroy() calls in zfs

MFC after:	2 weeks
2014-07-15 20:32:23 +00:00
Xin LI
1b174fa1eb MFV r268455:
Use reserved space for ZFS administrative commands.

We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use.  Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.

Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used.  This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.

A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.

MFC after:	 2 weeks
2014-07-09 23:14:59 +00:00
Xin LI
fdc0ee2cf5 MFV r268452:
Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.

Illumos issue:	4950 files sometimes can't be removed from a full
		filesystem
MFC after:	2 weeks
2014-07-09 18:32:40 +00:00
Alexander Motin
e327a057a7 Remove IO_SYNC flag when writing extended file attributes on ZFS.
While it is possible to create and write file, modify its permissions, etc.
without ever doing sync, it looks odd that it is required for setting
extended file attributes on ZFS.  UFS does not do sync there too.

Samba uses those extended attributes to store some its data, and doing it
synchronously by many times reduces file creation performance for systems
without SLOG device.

Reviewed by:	delphij, jpaetzel, silence on fs@
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-07-08 17:26:08 +00:00
Alexander Motin
5a178afd41 Fix bug in sync control in new "dev" mode of ZVOL (r265678).
Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".

MFC after:	3 days
2014-07-02 21:25:32 +00:00
Xin LI
30324e945a MFV r268122:
4929 want prevsnap property

illumos/illumos-gate@b461c7460e

MFC after:	2 weeks
2014-07-01 22:42:53 +00:00
Xin LI
9cc8a15b2e MFV r268121:
4924 LZ4 Compression for metadata

illumos/illumos-gate@b8289d24d8

MFC after:	2 weeks
2014-07-01 22:31:09 +00:00
Xin LI
aa882b9048 MFV r268119:
4914 zfs on-disk bookmark structure should be named *_phys_t

illumos/illumos-gate@7802d7bf98

MFC after:	2 weeks
2014-07-01 21:51:30 +00:00
Xin LI
55f6421982 - Fix handling of "new" style of ioctl in compatiblity mode [1];
- Reorganize code and reduce diff from upstream;
 - Improve forward compatibility shims for previous kernel;

Reported by:	sbruno [1]
X-MFC-With:	r268075
2014-07-01 20:57:39 +00:00
Xin LI
be78a8db97 MFV r267570:
4756 metaslab_group_preload() could deadlock

illumos/illumos-gate@30beaff42d

MFC after:	2 weeks
2014-07-01 08:36:56 +00:00
Xin LI
3a0f8ff95e MFV r267569:
4897 Space accounting mismatch in L2ARC/zpool

illumos/illumos-dist@3038a2b421

MFC after:	2 weeks
2014-07-01 08:28:49 +00:00
Xin LI
93b8d53c09 MFV r267567:
4881 zfs send performance degradation when embedded block pointers are
     encountered

illumos/illumos-gate@06315b795c

MFC after:	2 weeks
2014-07-01 07:56:07 +00:00
Xin LI
71eaf0fda7 MFV r267566:
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption

MFC after:	2 weeks
2014-07-01 07:29:42 +00:00
Xin LI
29441ba3fa MFV r267565:
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

MFC after:	2 weeks
2014-07-01 06:43:15 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Steven Hartland
74ddec2b18 Removed stale comment about multi-vdev root pool config not working
MFC after:	1 week
2014-06-09 13:04:58 +00:00
Bryan Drewery
f3a7518361 - Naively fix build by partially reverting r267029 to still use
gethrtime() when building libzpool.

X-MFC-With:	267029
2014-06-04 05:04:15 +00:00
Alexander Motin
4220ebcf71 Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue
selection.  gethrtime() in our port updated with HZ rate, so unusable for
this specific purpose, completely draining benefit of multiple taskqueues.

MFC after:	2 weeks
2014-06-03 21:06:03 +00:00
Xin LI
f4c7dd6dd0 MFV 266913+266914:
3897 zfs filesystem and snapshot limits (fix leak)
4901 zfs filesystem/snapshot limit leaks

MFC after:	3 days
2014-05-31 01:00:22 +00:00
Xin LI
2bdf7f79bc MFV r266766:
Add a new zfs property, "redundant_metadata" which can have values "all" or
"most".  The default will be "all", which is the current behavior.  When set
to all, ZFS stores an extra copy of all metadata.  If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.

Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files.  This can improve performance of random writes,
because less metadata has to be written.  In practice,  at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.

The exact behavior of which metadata blocks are stored redundantly may change
in future releases.

Illumos issue: 3835 zfs need not store 2 copies of all metadata

MFC after:	2 weeks
2014-05-27 19:46:11 +00:00
Allan Jude
ecd9567c1a Improve sysctl descriptions for new ZFS sysctls:
vfs.zfs.dirty_data_max
vfs.zfs.dirty_data_max_max
vfs.zfs.dirty_data_sync

Reviewed by:	smh
Approved by:	wblock (mentor)
2014-05-22 05:30:38 +00:00
Steven Hartland
df23182a62 Added sysctls / tunables for ZFS dirty data tuning
Added the following new sysctls / tunables:
* vfs.zfs.dirty_data_max
* vfs.zfs.dirty_data_max_max
* vfs.zfs.dirty_data_max_percent
* vfs.zfs.dirty_data_sync
* vfs.zfs.delay_min_dirty_percent
* vfs.zfs.delay_scale

PR:		kern/189865
MFC after:	2 weeks
2014-05-21 13:36:04 +00:00
Xin LI
b8cdcb8ad8 Import George Wilson's change for Illumos #4730:
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
	Reviewed by: Alex Reece <alex.reece@delphix.com>
	Reviewed by: Matthew Ahrens <mahrens@delphix.com>
	Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>

	Original author: George Wilson

MFC after:	3 days
2014-05-06 19:03:04 +00:00
Steven Hartland
4f64781818 Use a zio flag to prevent recursion of vdev_queue_io_done which can
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.

This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.

MFC after:	1 month
X-MFC-With:	r265152
2014-05-04 14:05:14 +00:00
Steven Hartland
573621a6d6 Don't treat TRIM requests returning ENOTSUP as an unexpected error.
MFC after:	1 month
X-MFC-With:	r265152
2014-05-03 02:30:01 +00:00
Steven Hartland
10138166cf Removed pointless / duplicated call to trim_map_first.
MFC after:	1 month
X-MFC-With:	r265152
2014-05-02 09:31:21 +00:00
Steven Hartland
82ce008538 Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.

Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.

As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.

This is based off avg's original work, so credit to him.

MFC after:	1 month
2014-04-30 17:46:29 +00:00