Commit Graph

1144 Commits

Author SHA1 Message Date
Andriy Gapon
0908b20b7e Revert r269093 which introduced physical zio alignment transform
Size of physical ZIOs must never be implicitly adjusted, it's
a responsibility of a caller to make sure that such a ZIO has proper offset
and size.

Discussed with:	delphij, gibbs
MFC after:	2 weeks
2014-11-17 14:16:02 +00:00
Steven Hartland
a559adfbce Disable TRIM on file backed ZFS vdevs and fix TRIM on init
After r265152 TRIM requests are ZIO_TYPE_FREE instead of ZIO_TYPE_IOCTL
this meant file backed vdevs to attempted to process the ZIO as a write
causing a panic.

We now disable TRIM on file backed vdevs and ASSERT the ZIO types supported
by each vdev type to ensure we explicity support the ZIO type being
processed.

Also ensure that TRIM on init is not procesed for devices which declare they
didn't support TRIM via vdev_notrim.

PR:		195061, 194976, 191573
Sponsored by:	Multiplay
2014-11-17 11:32:10 +00:00
Konstantin Belousov
6e646651d3 Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 18:01:51 +00:00
Xin LI
8bcd603968 MFV r274273:
ZFS large block support.

Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool).  This *may* remain unchanged because of memory constraint.

Limited safety belt is provided for mounted root filesystem but use
caution is advised.

Illumos issue:
    5027 zfs large block support

MFC after:	1 month
2014-11-10 08:20:21 +00:00
Xin LI
42350b6bde MFV r274272 and diff reduction with upstream.
Illumos issue:
    5244 zio pipeline callers should explicitly invoke next stage

Tested with:	ztest plus ZFS over GELI configuration
MFC after:	1 month
2014-11-09 07:37:00 +00:00
Xin LI
81f1255e58 MFV r274271:
Improve zdb -b performance:

 - Reduce gethrtime() call to 1/100th of blkptr's;
 - Skip manipulating the size-ordered tree;
 - Issue more (10, previously 3) async reads;
 - Use lighter weight testing in traverse_visitbp();

Illumos issue:
    5243 zdb -b could be much faster

MFC after:	2 weeks
2014-11-08 07:30:40 +00:00
Andriy Gapon
2fd3cc0cb2 fix l2arc compression buffers leak
We have observed that arc_release() can be called concurrently with a
l2arc in-flight write.
Also, we have observed that arc_hdr_destroy() can be called from
arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar
circumstances.

Previously the l2arc headers would be freed while leaking their
associated compression buffers.  Now the buffers are placed on
l2arc_free_on_write list for delayed freeing.  This is similar to what
was already done to arc buffers that were supposed to be freed
concurrently with in-flight writes of those buffers.

In addition to fixing the discovered leaks this change also adds some
protective code to assert that a compression buffer associated with a
l2arc header is never leaked.

A new kstat l2_cdata_free_on_write is added.  It keeps a count of
delayed compression buffer frees which previously would have been leaks.

Tested by:	Vitalij Satanivskij <satan@ukr.net> et al
Requested by:	many
MFC after:	2 weeks
Sponsored by:	HybridCluster / ClusterHQ
2014-11-06 11:08:02 +00:00
Alexander Motin
c3e7ba3e6d Add to CTL support for logical block provisioning threshold notifications.
For ZVOL-backed LUNs this allows to inform initiators if storage's used or
available spaces get above/below the configured thresholds.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-11-06 00:48:36 +00:00
Josh Paetzel
14127f5b21 This change addresses 4 bugs in ZFS exposed by Richard Kojedzinszky's
crash.sh script attached to FreeNAS bug 4109:
https://bugs.freenas.org/issues/4109

Three are in the snapshot layer:
a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD

"VOP_INACTIVE must not do any destructive actions to a vnode
and its filesystem node, nor invalidate them in any way."
gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In
OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM.
Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim
and merge in the requisite vnode_destroy from zfsctl_common_reclaim.

b) gfs_lookup_dot and various zfsctl functions do not honor the
FreeBSD VFS convention of only locking from the root downward. When
looking up ".." the convention is to drop the current leaf vnode lock before
acquiring the directory vnode and then subsequently re-acquiring the lock on the
leaf vnode. This fixes that in all the places that our exercised by crash.sh.

c) The snapshot may already be unmounted when the directory vnode is reclaimed.
Check for this case and return.

One in the common layer:
d) Callers of traverse expect the reference to the vnode passed in to be
maintained. Don't release it.

This last one may be an unclear contract. There may in fact be some callers that
do expect the reference to be dropped on success in addition to callers that
expect it to be released. In this case a further audit of the callers is needed
and a consensus on the correct behavior.

PR:	184677
Submitted by:	kmacy
Reviewed by:	delphij, will, avg
MFC after:	2 weeks
Sponsored by:	iXsystems
2014-10-25 17:42:44 +00:00
Justin Hibbits
3ff2096995 Whitespace
X-MFC-with:	r273570
MFC after:	1 week
2014-10-24 03:34:21 +00:00
Justin Hibbits
24d5dfb116 Three updates to PowerPC FBT:
* Use a constant to define the number of stack frames in a probe exception.
* Only allow function symbols in powerpc64 ('.' prefixed)
* Set the fbtp_roffset for return probes, so the correct dtrace_probe call is
  made.

MFC after:	1 week
2014-10-24 03:33:01 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Xin LI
78701de4b7 Add tunable vfs.zfs.space_map_blksz for space map's maximum block size.
MFC after:	2 weeks
2014-10-18 22:11:10 +00:00
Davide Italiano
2be111bf7d Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
Steven Hartland
ca6505b818 Prevent ZFS leaking pool free space
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.

In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query the
pool status during the hung import would not return as the import holds
the pool lock.

The only way to import such a pool would be to specify -o readonly=on
to the zpool import.

zdb -bb <pool> can be used to check for "deferred free" size which is where
this lost space will be counted.

MFC after:	3 days
Sponsored by:	Multiplay
2014-10-16 02:23:27 +00:00
Xin LI
ba6e85e0cf Use write_psize instead of write_asize when doing vdev_space_update.
Without this change the accounting of L2ARC usage would be wrong and
give 16EB free space because the number became negative and overflows.

Obtained from:	FreeNAS (issue #6239)
MFC after:	2 weeks
2014-10-13 20:39:51 +00:00
Xin LI
a4f5b8db9f Add a tunable for arc_shrink_shift (vfs.zfs.arc_shrink_shift) that
controls how much fraction, 1/2^arc_shrink_shift, should be reclaimed
when there is memory pressure.

Submitted by:	Richard Kojedzinszky <krichy at tvnetwork.hu>
MFC after:	2 weeks
2014-10-13 05:34:10 +00:00
Xin LI
eba15cf463 MFV r272804:
Refactor the code and stop restore_object from creating two transactions.

Illumos issue:
    3693 restore_object uses at least two transactions to restore an object

MFC after:	2 weeks
2014-10-09 07:52:51 +00:00
Xin LI
ce44f14b41 MFV r272803:
Illumos issue:
    5175 implement dmu_read_uio_dbuf() to improve cached read performance

MFC after:	2 weeks
2014-10-09 07:18:40 +00:00
Andriy Gapon
c3d1d2e104 l2arc_write_buffers: reduce headroom value
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.

headroom determines how much data is scanned on a single list
during each run of the l2arc feed thread.
Because FreeBSD has more lists we proportionally decrease the limit.

Reviewed by:	Brendan Gregg (earlier version)
MFC after:	2 weeks
Sponsored by:	HybridCluster
2014-10-07 16:08:21 +00:00
Andriy Gapon
9f96723ec5 revert r272702: wrong (earlier) change was committed 2014-10-07 16:06:10 +00:00
Andriy Gapon
4c3b02bfce reduce L2ARC_WRITE_SIZE on FreeBSD
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.

L2ARC_WRITE_SIZE determines the default value of l2arc_write_max which
defines limits on how much data is scanned and written to a cache device
during each run of the l2arc feed thread.  The limits are applied on the
per buffer list basis.
Because FreeBSD has more lists we proportionally reduce the limits.

Reviewed by:	Brendan Gregg (earlier version)
MFC after:	2 weeks
Sponsored by:	HybridCluster
2014-10-07 14:30:24 +00:00
Andriy Gapon
ab26525af2 make userland __assfail from opensolaris compat honor 'aok' variable
This should allow zdb -A option to actually make difference.

MFC after:	2 weeks
2014-10-07 14:15:50 +00:00
Xin LI
1b5bcb8425 MFV r272591:
Use loaned ARC buffer for zfs receive to avoid copy.

Illumos issue:
    5162 zfs recv should use loaned arc buffer to avoid copy

MFC after:	2 weeks
2014-10-06 07:29:17 +00:00
Xin LI
8fb26f5aef MFV r272585:
Split the godfather zio into CPU number's to reduce lock
contention.

Illumos issue:
    5176 lock contention on godfather zio

MFC after:	2 weeks
2014-10-06 07:03:17 +00:00
Xin LI
dcb20006f0 MFV r272501:
Illumos issue:
    5177 remove dead code from dsl_scan.c

MFC after:	2 weeks
2014-10-06 05:46:51 +00:00
Xin LI
00769ce74d MFV r272500:
Don't inherit flags other than DS_FLAG_CI_DATASET and DS_FLAG_INCONSISTENT
when cloning.  This prevents DS_FLAG_DEFER_DESTROY being inherited from a
clone that is marked for deferred destroy, which causes snapshots of the
clone being destroyed when getting a hold or clone.

Illumos issue:
    5150 zfs clone of a defer_destroy snapshot causes strangeness

MFC after:	1 week
2014-10-06 05:42:20 +00:00
Xin LI
4bb264ae15 Don't make nested definition for range_seg_cache.
Reported by:	ian
MFC after:	1 week
X-MFC-With:	r272506
2014-10-04 15:42:52 +00:00
Xin LI
4750c382a9 MFV r272499:
Illumos issue:
    5174 add sdt probe for blocked read in dbuf_read()

MFC after:	2 weeks
2014-10-04 08:55:08 +00:00
Xin LI
eb0b70068c Add a new sysctl, vfs.zfs.vol.unmap_enabled, which allows the system
administrator to toggle whether ZFS should ignore UNMAP requests.

Illumos issue:
    5149 zvols need a way to ignore DKIOCFREE

MFC after:	2 weeks
2014-10-04 08:51:57 +00:00
Xin LI
2d36d67c72 Diff reduction with upstream. The code change is not really applicable
to FreeBSD.

Illumos issue:
    5148 zvol's DKIOCFREE holds zfsdev_state_lock too long

MFC after:	1 month
2014-10-04 08:41:23 +00:00
Xin LI
523b4c7fdf MFV r272496:
Add tunable for number of metaslabs per vdev
(vfs.zfs.vdev.metaslabs_per_vdev).  The default remains
at 200.

Illumos issue:
    5161 add tunable for number of metaslabs per vdev

MFC after:	2 weeks
2014-10-04 08:29:48 +00:00
Xin LI
a8d7512709 MFV r272495:
In arc_kmem_reap_now(), reap range_seg_cache too to reclaim memory in
response of memory pressure.

Illumos issue:
    5163 arc should reap range_seg_cache

MFC after:	1 week
2014-10-04 08:14:10 +00:00
Xin LI
8c20e2ff11 MFV r272494:
Make space_map_truncate() always do space_map_reallocate().  Without
this, setting space_map_max_blksz would cause panic for existing pool,
as dmu_objset_set_blocksize would fail if the object have multiple blocks.

Illumos issues:
   5164 space_map_max_blksz causes panic, does not work
   5165 zdb fails assertion when run on pool with recently-enabled
	spacemap_histogram feature

MFC after:	2 weeks
2014-10-04 08:05:39 +00:00
Steven Hartland
14a0d74ea8 Refactor ZFS ARC reclaim checks and limits
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.

Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.

Bring in missing upstream only changes which move unused code to
further eliminate code differences.

Add additional DTRACE probe to aid monitoring of ARC behaviour.

Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.

Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.

PR:		187594
Review:		D702
Reviewed by:	avg
MFC after:	1 week
X-MFC-With:	r270759 & r270861
Sponsored by:	Multiplay
2014-10-03 20:34:55 +00:00
Steven Hartland
99140218aa Fix various issues with zvols
When performing snapshot renames we could deadlock due to the locking
in zvol_rename_minors. In order to avoid this use the same workaround
as zvol_open in zvol_rename_minors.

Add missing zvol_rename_minors to dsl_dataset_promote_sync.

Protect against invalid index into zv_name in zvol_remove_minors.

Replace zvol_remove_minor calls with zvol_remove_minors to ensure
any potential children are also renamed.

Don't fail zvol_create_minors if zvol_create_minor returns EEXIST.

Restore the valid pool check in zfs_ioc_destroy_snaps to ensure we
don't call zvol_remove_minors when zfs_unmount_snap fails.

PR:		193803
MFC after:	1 week
Sponsored by:	Multiplay
2014-10-03 14:49:48 +00:00
Marcelo Araujo
d8a5961f88 Fix failures and warnings reported by newpynfs20090424 test tool.
This fix addresses only issues with the pynfs reports, none of these
issues are know to create problems for extant real clients.

Submitted by:	Bart Hsiao <bart.hsiao@gmail.com>
Reworked by:	myself
Reviewed by:	rmacklem
Approved by:	rmacklem
Sponsored by:	QNAP Systems Inc.
2014-10-03 02:24:41 +00:00
Xin LI
43ac3722ac Diff reduction with kernel code: instruct the compiler that the data of
these types may be unaligned to their "normal" alignment and exercise
caution when accessing them.

PR:		194071
MFC after:	3 days
2014-10-02 00:13:08 +00:00
Will Andrews
fbce0221eb zfsvfs_create(): Refuse to mount datasets whose names are too long.
This is checked for in the zfs_snapshot_004_neg STF/ATF test (currently
still in projects/zfsd rather than head).

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
- zfsvfs_create(): Check whether the objset name fits into
  statfs.f_mntfromname, and return ENAMETOOLONG if not.  Although
  the filesystem can be unmounted via the umount(8) command, any
  interface that relies on iterating on statfs (e.g. libzfs) will
  fail to find the filesystem by its objset name, and thus assume
  it's not mounted.  This causes "zfs unmount", "zfs destroy",
  etc. to fail on these filesystems, whether or not -f is passed.

MFC after:	1 month
Sponsored by:	Spectra Logic
MFSpectraBSD:	974872 on 2013/08/09
2014-10-01 14:12:02 +00:00
Xin LI
0b66c7c514 Fix a mismerge in r260183 which prevents snapshot zvol devices being
removed and re-instate the fix in r242862.

Reported by:	Leon Dang <ldang nahannisys com>, smh
MFC after:	3 days
2014-09-30 18:50:45 +00:00
Steven Hartland
8caa3daf35 Remove sys/types.h include as per style (9)
SDT requries sys/param.h due to use of NULL

Reported by:	Garrett
Sponsored by:	Multiplay
2014-09-18 20:38:18 +00:00
Steven Hartland
71f3caaf31 Add dtrace probe support for zfs SET_ERROR(..)
MFC after:	1 week
Sponsored by:	Multiplay
2014-09-18 20:00:36 +00:00
Will Andrews
91dda985cc Remove debug.zfs_flags in favor of the new vfs.zfs.debug_flags.
Replace TUNABLE_INT with CTLFLAG_RWTUN.

Submitted by:	avg (debug.zfs_flags removal), smh (TUNABLE_INT replacement)
2014-09-18 18:46:38 +00:00
Will Andrews
f8c2f66a6c Enable ZFS debug flags to be modified via vfs.zfs.debug_flags.
This is primarily only of interest to ZFS developers, but it makes it
easier to get additional debugging.

Submitted by:	gibbs
MFC after:	1 month
Sponsored by:	Spectra Logic
MFSpectraBSD:	517074 on 2011/12/15 (by will), 662343 on 2013/03/20 (by gibbs)
2014-09-18 16:55:41 +00:00
Will Andrews
cf0a1157d7 Reorder sysctls for spa.c global tunables; add sysctl for ccw_retry_interval.
MFC after:	1 month
Sponsored by:	Spectra Logic
2014-09-18 16:38:03 +00:00
Will Andrews
cf7a096e72 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
If bpobj_space() returned non-zero here, the sublist would have been
left open, along with the bonus buffer hold it requires.  This call
does not invoke any calls to bpobj_close() itself.

This bug doesn't have any known vector, but was found on inspection.

MFC after:	1 week
Sponsored by:	Spectra Logic
Affects:	All ZFS versions starting 21 May 2010 (illumos cde58dbc)
MFSpectraBSD:	r1050998 on 2014/03/26
2014-09-18 15:37:53 +00:00
Steven Hartland
d1d469e22b Remove unused ZFS ARC functions
* arc_data_buf_alloc
* arc_data_buf_free

MFC after:	1 week
Sponsored by:	Multiplay
2014-09-18 10:46:51 +00:00
Justin Hibbits
e40a5cd3ec Fix the stack tracing for dtrace/powerpc.
Summary:
Fix the stack tracing for dtrace/powerpc by using the trapexit/asttrapexit
return address sentinels instead of checking within the kernel address space.

As part of this, I had to add new inline functions.  FBT traces the kernel, so
we have to have special case handling for this, since a trap will create a full
new trap frame, and there's no way to pass around the 'real' stack.  I handle
this by special-casing 'aframes == 0' with the trap frame.  If aframes counts
out to the trap frame, then assume we're looking for the full kernel trap frame,
so switch to the real stack pointer.

Test Plan: Tested on powerpc64

Reviewers: rpaulo, markj, nwhitehorn

Reviewed By: markj, nwhitehorn

Differential Revision: https://reviews.freebsd.org/D788

MFC after:	3 week
Relnotes:	Yes
2014-09-17 02:43:47 +00:00
Steven Hartland
a889b18c52 Added missing ZFS sysctls
* vfs.zfs.vdev.async_write_active_min_dirty_percent
* vfs.zfs.vdev.async_write_active_max_dirty_percent

Added validation of min / max for ZFS sysctl
* vfs.zfs.dirty_data_max_percent

MFC after:	3 days
2014-09-14 12:23:00 +00:00
Xin LI
f9290bc2c9 MFV r271518:
Correctly report hole at end of file.

When asked to find a hole, the DMU sees that there are no holes in the
object, and returns ESRCH.  The ZPL interprets this as "no holes before
the end of the file", and therefore inserts the "virtual hole" at the
end of the file.  Because DMU and ZPL have different ideas of where the
end of an object/file is, we will end up returning the end of file,
which is generally larger, instead of returning the end of object.

The fix is to handle the "virtual hole" in the DMU. If no hole is found,
the DMU will return a hole at the end of the file, rather than an error.

Illumos issue:
    5139 SEEK_HOLE failed to report a hole at end of file

MFC after:	1 week
2014-09-13 17:48:44 +00:00