Commit Graph

719 Commits

Author SHA1 Message Date
Mark Johnston
165de3f338 Fix a couple of bugs in the fasttrap emulation of a "push %rbp" instruction:
the code was trying to save the stack pointer rather than the frame pointer,
and the arguments to copyout(9) were reversed, so nothing ended up being
saved on the stack. This would cause process crashes when the pid provider
was being used to instrument calls of a function starting with this
instruction.

Reported by:	symbolics@gmx.com
Tested by:	symbolics@gmx.com (earlier version)
MFC after:	2 weeks
2013-10-26 03:21:54 +00:00
Steven Hartland
c28078e903 Improve ZFS N-way mirror read performance by using load and locality
information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay
2013-10-23 09:54:58 +00:00
Steven Hartland
70c3432663 Use the vdev's ashift to calculate the supported min block size passed to
zio_compress_data(..) when compressing l2arc buffers.

This eliminates l2arc I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.

MFC after:	2 days
2013-10-22 13:31:36 +00:00
Alexander Motin
40ea77a036 Merge GEOM direct dispatch changes from the projects/camlock branch.
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

The defined now safety requirements are:
 - caller should not hold any locks and should be reenterable;
 - callee should not depend on GEOM dual-threaded concurency semantics;
 - on the way down, if request is unmapped while callee doesn't support it,
   the context should be sleepable;
 - kernel thread stack usage should be below 50%.

To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
 - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
 - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
 - G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
 - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe.  If any of requirements are not met, request is queued to
g_up or g_down thread same as before.

Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).

To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION.  da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.

This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).

Sponsored by:	iXsystems, Inc.
MFC after:	2 months
2013-10-22 08:22:19 +00:00
Mark Johnston
7e75d58610 When fetching function arguments out of a frame on amd64, explicitly select
the register based on the argument index rather than relying on the fields
in struct reg to be in the right order. This assumption is incorrect on
FreeBSD and generally led to bogus argument values for the sixth argument
of PID and USDT probes; the first five are passed directly to dtrace_probe()
via the fasttrap trap handler and so were correctly handled.

MFC after:	2 weeks
2013-10-21 04:15:55 +00:00
Mark Johnston
e572bc11ec Add a function, memstr, which can be used to convert a buffer of
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:

dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}'

Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:

fbt::kern_execve:entry
{
    printf("%s", memstr(args[1]->begin_argv, ' ',
        args[1]->begin_envv - args[1]->begin_argv));
}

The debug.dtrace.memstr_max sysctl limits the maximum argument size to
memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace.

Tested by:	Fabian Keil (earlier version)
MFC after:	2 weeks
2013-10-16 01:39:26 +00:00
Justin Hibbits
30b318b92f Add fasttrap for PowerPC. This is the last piece of the dtrace/ppc puzzle.
It's incomplete, it doesn't contain full instruction emulation, but it should be
sufficient for most cases.

MFC after:	1 month
2013-10-15 15:00:29 +00:00
Andriy Gapon
5d8fac897e MFV r255257: 4082 zfs receive gets EFBIG from dmu_tx_hold_free()
illumos change 14172:be36a38bac3d:
  illumos ZFS issues:
    4082 zfs receive gets EFBIG from dmu_tx_hold_free()

Please note that this change is slightly different from r255257, because
it is merged out of order with other (larger) upstream changes.

PR:		kern/182570
Reported by:	Keith White <kwhite@site.uottawa.ca>
Tested by:	Keith White <kwhite@site.uottawa.ca>
Approved by:	re (glebius)
MFC after:	1 week
X-MFC after:	r254753
2013-10-10 09:53:46 +00:00
Mark Johnston
cb7320ce7b Initialize and free the DTrace taskqueue in the dtrace module load/unload
handlers rather than in the dtrace device open/close methods. The current
approach can cause a panic if the device is closed which the taskqueue
thread is active, or if a kernel module containing a provider is unloaded
while retained enablings are present and the dtrace device isn't opened.

Submitted by:	gibbs (original version)
Reviewed by:	gibbs
Approved by:	re (glebius)
MFC after:	2 weeks
2013-10-08 12:56:46 +00:00
Xin LI
6eb151f212 Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.

Submitted by:	Denis Ahrens <denis h3q com>
MFC after:	2 weeks
Approved by:	re (gjb)
2013-10-08 01:38:24 +00:00
Justin T. Gibbs
69d1b777e8 Optimize the block size used on ZFS cache devices as is already done
for data and log devices.

Reported by:	Dmitryy Makarov
Submitted by:	smh
Reviewed by:	gibbs
Approved by:	re (delphij)
MFC after:	2 weeks
2013-09-21 03:52:08 +00:00
Xin LI
253aa02fc3 MFV r254750:
Add support of Illumos dumps on zvol over RAID-Z.

Note that this only adds the features.  FreeBSD would
still need more work to support dumping on zvols.

Illumos ZFS issues:
  2932 support crash dumps to raidz, etc. pools

MFC after:	1 month
Approved by:	re (ZFS blanket)
2013-09-21 00:17:26 +00:00
Davide Italiano
a25a7e386a Fixup cross-device rename checks in ZFS. Add a check for the case
where 'fdvp' is a directory, 'tvp' is an already existing directory
and they have different mount points.

Reported by:	avg, pjd
Reviewed by:	pjd
Approved by:	re (rodrigc)
2013-09-20 23:22:00 +00:00
Xin LI
e8de677c74 MFV r247844 (illumos-gate 13975:ef6409bc370f)
Illumos ZFS issues:
  3582 zfs_delay() should support a variable resolution
  3584 DTrace sdt probes for ZFS txg states

Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.

Approved by:	re (ZFS blanket)
2013-09-10 01:46:47 +00:00
Pawel Jakub Dawidek
7e473ea146 Add sysctl/tunables for various metaslab variables. 2013-09-05 00:53:01 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Xin LI
1c1075ed93 Previously, both zfs_rename and zfs_link does a check on whether
the passed vnode belongs to the same mount point (v_vfsp or also
known as v_mount in FreeBSD).  This check prevents the code from
proceeding further on vnodes that do not belong to ZFS, for
instance, on UFS or NULLFS.

The recent change (merged as r254585) on upstream changes the
check of v_vfsp to instead check the znode's z_zfsvfs.  On Illumos
this would work because when the vnode comes from lofs, the
VOP_REALVP() would give the right vnode, this is not true on
FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is
not guaranteed to be a ZFS vnode, and will later trigger a
failed assertion when verifying the vnode.

This changeset modifies our local shims (zfs_freebsd_rename and
zfs_freebsd_link) to check if v_mount matches before proceeding
further.

Reported by:		many
Diagnostic work by:	avg
2013-08-28 00:39:47 +00:00
Mark Johnston
29f4e216f2 Rename the kld_unload event handler to kld_unload_try, and add a new
kld_unload event handler which gets invoked after a linker file has been
successfully unloaded. The kld_unload and kld_load event handlers are now
invoked with the shared linker lock held, while kld_unload_try is invoked
with the lock exclusively held.

Convert hwpmc(4) to use these event handlers instead of having
kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are
loaded or unloaded. This has no functional effect, but simplifes the linker
code somewhat.

Reviewed by:	jhb
2013-08-24 21:13:38 +00:00
Xin LI
439024135c MFV r254749:
Don't hold dd_lock for long by breaking it when not doing dsl_dir
accounting.  It is not necessary to hold the lock while manipulating
the parent's accounting, because there is no interface for userland
to see a consistent picture of both parent and child at the same
time anyway.

Illumos ZFS issues:
  4046 dsl_dataset_t ds_dir->dd_lock is highly contended
2013-08-24 00:42:37 +00:00
Xin LI
00e37ef129 MFV r254747:
Fix a panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive.  This is a regression from FreeBSD r253821.

Illumos ZFS issues:
  4047 panic from dbuf_free_range() from dmu_free_object() while
       doing zfs receive
2013-08-24 00:19:26 +00:00
Xin LI
3f0164abf3 MFV r254422:
Illumos DTrace issues:
  3089 want ::typedef
  3094 libctf should support removing a dynamic type
  3095 libctf does not validate arrays correctly
  3096 libctf does not validate function types correctly
2013-08-23 23:21:24 +00:00
Andriy Gapon
2073a41a42 zfs: do not reject any operations on a pool just because it's a boot pool
Unlike the upstream FreeBSD supports booting to all kinds of pools.

Requested by:	many
Tested by:	sbruno
MFC after:	12 days
2013-08-23 14:43:32 +00:00
Andriy Gapon
05869c0ea7 zfs: inline and remove zfs_vnode_lock
It didn't serve any useful purpose, but obscured file and line information
useful for debugging.

MFC after:	5 days
X-MFC with:	r254445
2013-08-23 14:40:09 +00:00
Konstantin Belousov
5944de8ecd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
Kenneth D. Merry
7da1a731c6 Expand the use of stat(2) flags to allow storing some Windows/DOS
and CIFS file attributes as BSD stat(2) flags.

This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.

The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.

The summary of the flags is as follows:

UF_SYSTEM:	Command line name: "system" or "usystem"
		ZFS name: XAT_SYSTEM, ZFS_SYSTEM
		Windows: FILE_ATTRIBUTE_SYSTEM

		This flag means that the file is used by the
		operating system.  FreeBSD does not enforce any
		special handling when this flag is set.

UF_SPARSE:	Command line name: "sparse" or "usparse"
		ZFS name: XAT_SPARSE, ZFS_SPARSE
		Windows: FILE_ATTRIBUTE_SPARSE_FILE

		This flag means that the file is sparse.  Although
		ZFS may modify this in some situations, there is
		not generally any special handling for this flag.

UF_OFFLINE:	Command line name: "offline" or "uoffline"
		ZFS name: XAT_OFFLINE, ZFS_OFFLINE
		Windows: FILE_ATTRIBUTE_OFFLINE

		This flag means that the file has been moved to
		offline storage.  FreeBSD does not have any special
		handling for this flag.

UF_REPARSE:	Command line name: "reparse" or "ureparse"
		ZFS name: XAT_REPARSE, ZFS_REPARSE
		Windows: FILE_ATTRIBUTE_REPARSE_POINT

		This flag means that the file is a Windows reparse
		point.  ZFS has special handling code for reparse
		points, but we don't currently have the other
		supporting infrastructure for them.

UF_HIDDEN:	Command line name: "hidden" or "uhidden"
		ZFS name: XAT_HIDDEN, ZFS_HIDDEN
		Windows: FILE_ATTRIBUTE_HIDDEN

		This flag means that the file may be excluded from
		a directory listing if the application honors it.
		FreeBSD has no special handling for this flag.

		The name and bit definition for UF_HIDDEN are
		identical to the definition in MacOS X.

UF_READONLY:	Command line name: "urdonly", "rdonly", "readonly"
		ZFS name: XAT_READONLY, ZFS_READONLY
		Windows: FILE_ATTRIBUTE_READONLY

		This flag means that the file may not written or
		appended, but its attributes may be changed.

		ZFS currently enforces this flag, but Illumos
		developers have discussed disabling enforcement.

		The behavior of this flag is different than MacOS X.
		MacOS X uses UF_IMMUTABLE to represent the DOS
		readonly permission, but that flag has a stronger
		meaning than the semantics of DOS readonly permissions.

UF_ARCHIVE:	Command line name: "uarch", "uarchive"
		ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
		Windows name: FILE_ATTRIBUTE_ARCHIVE

		The UF_ARCHIVED flag means that the file has changed and
		needs to be archived.  The meaning is same as
		the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
		the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.

		msdosfs and ZFS have special handling for this flag.
		i.e. they will set it when the file changes.

sys/param.h:		Bump __FreeBSD_version to 1000047 for the
			addition of new stat(2) flags.

chflags.1:		Document the new command line flag names
			(e.g. "system", "hidden") available to the
			user.

ls.1:			Reference chflags(1) for a list of file flags
			and their meanings.

strtofflags.c:		Implement the mapping between the new
			command line flag names and new stat(2)
			flags.

chflags.2:		Document all of the new stat(2) flags, and
			explain the intended behavior in a little
			more detail.  Explain how they map to
			Windows file attributes.

			Different filesystems behave differently
			with respect to flags, so warn the
			application developer to take care when
			using them.

zfs_vnops.c:		Add support for getting and setting the
			UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
			UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.

			All of these flags are implemented using
			attributes that ZFS already supports, so
			the on-disk format has not changed.

			ZFS currently doesn't allow setting the
			UF_REPARSE flag, and we don't really have
			the other infrastructure to support reparse
			points.

msdosfs_denode.c,
msdosfs_vnops.c:	Add support for getting and setting
			UF_HIDDEN, UF_SYSTEM and UF_READONLY
			in MSDOSFS.

			It supported SF_ARCHIVED, but this has been
			changed to be UF_ARCHIVE, which has the same
			semantics as the DOS archive attribute instead
			of inverse semantics like SF_ARCHIVED.

			After discussion with Bruce Evans, change
			several things in the msdosfs behavior:

			Use UF_READONLY to indicate whether a file
			is writeable instead of file permissions, but
			don't actually enforce it.

			Refuse to change attributes on the root
			directory, because it is special in FAT
			filesystems, but allow most other attribute
			changes on directories.

			Don't set the archive attribute on a directory
			when its modification time is updated.
			Windows and DOS don't set the archive attribute
			in that scenario, so we are now bug-for-bug
			compatible.

smbfs_node.c,
smbfs_vnops.c:		Add support for UF_HIDDEN, UF_SYSTEM,
			UF_READONLY and UF_ARCHIVE in SMBFS.

			This is similar to changes that Apple has
			made in their version of SMBFS (as of
			smb-583.8, posted on opensource.apple.com),
			but not quite the same.

			We map SMB_FA_READONLY to UF_READONLY,
			because UF_READONLY is intended to match
			the semantics of the DOS readonly flag.
			The MacOS X code maps both UF_IMMUTABLE
			and SF_IMMUTABLE to SMB_FA_READONLY, but
			the immutable flags have stronger meaning
			than the DOS readonly bit.

stat.h:			Add definitions for UF_SYSTEM, UF_SPARSE,
			UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
			and UF_HIDDEN.

			The definition of UF_HIDDEN is the same as
			the MacOS X definition.

			Add commented-out definitions of
			UF_COMPRESSED and UF_TRACKED.  They are
			defined in MacOS X (as of 10.8.2), but we
			do not implement them (yet).

ufs_vnops.c:		Add support for getting and setting
			UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
			UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
			Alphabetize the flags that are supported.

			These new flags are only stored, UFS does
			not take any action if the flag is set.

Sponsored by:	Spectra Logic
Reviewed by:	bde (earlier version)
2013-08-21 23:04:48 +00:00
Justin T. Gibbs
5119608387 Add kstat entries for ZFS compression statistics.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
	Add module lifetime functions to allocate and teardown
	state data.

	Report:
		- Compression attempts.
		- Buffers found to be empty.
		- Compression calls that are skipped because
		  the data length is already less than or
		  equal to the minimum block length.
		- Compression attempts that fail to yield a 12.5%
		  compression ratio.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:
	Add calls to the zio_compress.c module's init and fini
	functions.

Sponosred by:	Spectra Logic Corporation
MFC after:	2 weeks
2013-08-21 19:40:43 +00:00
Justin T. Gibbs
439d30d121 Enhance the ZFS vdev layer to maintain both a logical and a physical
minimum allocation size for devices.  Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.

Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.

Calculate the minimum blocksize of each metaslab class.  Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.

Report devices with sub-optimal block size configuration in "zpool
status".  Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.

Sponsored by:	Spectra Logic Corporaion
MFC after:	2 weeks

Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands.  Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size).  Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

1) Existing pools created with devices that have different logical
   and physical block sizes, but were configured to use the logical
   block size (e.g. because the OS version used for pool construction
   reported the logical block size instead of the physical block
   size) will suddenly find that the vdev allocation size has
   increased.  This can be easily tolerated for active members of
   the array, but ZFS would prevent replacement of a vdev with
   another identical device because it now appears that the smaller
   allocation size required by the pool is not supported by the new
   device.

2) The device's physical block size may be too large to be supported
   by ZFS.  The optimal allocation size for the vdev may be quite
   large.  For example, a RAID controller may export a vdev that
   requires read-modify-write cycles unless accessed using 64k
   aligned/sized requests.  ZFS currently has an 8k minimum block
   size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems.  A device may be used so long as the logical
block size is compatible with the configuration.  By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
	Add the SPA_ASHIFT constant.  ZFS currently has a hard upper
	limit of 13 (8k) for ashift and this constant is used to
	both document and enforce this limit.

sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
	Add the VDEV_AUX_ASHIFT_TOO_BIG error code.

	Add fields for exporting the configured, logical, and
	physical ashift to the vdev_stat_t structure.

	Add VDEV_STAT_VALID() macro which can be used to verify the
	presence of required vdev_stat_t fields in nvlist data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
	Provide a SYSCTL_PROC handler for "max_auto_ashift".  Since
	the limit is only referenced long after boot when a create
	operation occurs, there's no compelling need for it to be
	a boot time configurable tunable.  This also allows the
	validation code for the max_auto_ashift value to be contained
	within the sysctl handler.

	Populate the new fields in the vdev_stat_t structure.

	Fail vdev opens if the vdev reports an ashift larger than
	SPA_MAXASHIFT.

	Propogate vdev_logical_ashift and vdev_physical_ashift between
	child and parent vdevs as is done for vdev_ashift.

	In vdev_open(), restore code that fails opens for devices
	where vdev_ashift grows.  This can only happen now if the
	device's logical ashift grows, which means it really isn't
	safe to use the device.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
	Update the vdev_open() API so that both logical (what was
	just ashift before) and physical ashift are reported.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
	Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
	to vdev_t.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
	Add vdev_ashift_optimize().  Call it anytime a new top-level
	vdev is allocated.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
	Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.

	For each sub-optimally configured leaf vdev, report configured
	and native block sizes.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
	Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
	This status is reported on healthy pools containing vdevs
	configured to use a block size smaller than their reported
	physical block size.

cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
	Update find_vdev_problem() and supporting functions to
	provide the full vdev_stat_t structure to problem checking
	routines, and to allow decent into replacing vdevs.

	Add a vdev_non_native_ashift() validator which is used on
	the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.

cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
	Enhance sysctl userland stubs now that a SYSCTL_PROC handler
	is used in vdev.c.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
	When the group membership of a metaslab class changes (i.e.
	when a vdev is added or removed from a pool), walk the group
	list to determine the smallest block size currently available
	and record this in the metaslab class.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
	Add the metaslab_class_get_minblocksize() accessor.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
	In zio_compress_data(), take the minimum blocksize as an
	input parameter instead of assuming SPA_MINBLOCKSIZE.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
	In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
	blocksize of the device.  The l2arc code performs has it's own
	code for deciding if compression is worth while, so this
	effectively disables zio_compress_data() from second guessing
	the original decision.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
	In zio_write_bp_init(), use the minimum blocksize of the
	normal metaslab class when compressing data.
2013-08-21 04:10:24 +00:00
Xin LI
2640fb93f5 MFV r254421:
Illumos ZFS issues:
  3996 want a libzfs_core API to rollback to latest snapshot
2013-08-21 00:04:31 +00:00
Xin LI
c21d9cfe3d MFV r254220:
Illumos ZFS issues:
  4039 zfs_rename()/zfs_link() needs stronger test for XDEV
2013-08-20 22:31:13 +00:00
Pawel Jakub Dawidek
2c40899ecc Remove redundant variable. 2013-08-17 14:09:46 +00:00
Mark Johnston
12ede07ab8 Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.

Reported by:	jhb
X-MFC with:	r254266
2013-08-14 00:42:21 +00:00
Mark Johnston
8776669b53 FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
  trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

PR:		166927, 166926, 166928
Reported by:	davide [1]
Reviewed by:	avg, trociny (earlier version)
MFC after:	1 month
2013-08-13 03:10:39 +00:00
Rui Paulo
e009490afc fasttrap_fork(): unlock the processes before removing the tracepoints.
In the future, we'll need to come up with new proc_*() functions that accept
locked processes. For now, this prevents postgresql + DTrace from crashing the
system.

MFC after:	1 month
2013-08-11 00:57:01 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Xin LI
43667c1f68 MFV r254079:
Illumos ZFS issues:
  3957 ztest should update the cachefile before killing itself
  3958 multiple scans can lead to partial resilvering
  3959 ddt entries are not always resilvered
  3960 dsl_scan can skip over dedup-ed blocks if
       physical birth != logical birth
  3961 freed gang blocks are not resilvered and can cause pool to suspend
  3962 ztest should print out zfs debug buffer before exiting
2013-08-08 23:38:31 +00:00
Xin LI
9d2f243aa6 MFV r254071:
Fix a regression introduced by fix for Illumos bug #3834.  Quote from
Matthew Ahrens on the Illumos issue:

ztest fails this assertion because ztest_dmu_read_write() does
        dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
and then
    dmu_object_set_checksum(os, bigobj,
        (enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx);

If the region to free is past the end of the file, the DMU assumes that there
will be nothing to do for this object.  However, ztest does set_checksum(),
which must modify the dnode.  The fix is for ztest to also call

    dmu_tx_hold_bonus(tx, bigobj);

so we can account for the dirty data associated with setting the checksum

Illumos ZFS issues:
  3955 ztest failure: assertion refcount_count(&tx->tx_space_written)
         + delta <= tx->tx_space_towrite
2013-08-07 22:21:00 +00:00
Xin LI
4f7b34578b MFV r254070:
Merge vendor bugfix for ZFS test suite that triggers false positives.

Illumos ZFS issues:
  3949 ztest fault injection should avoid resilvering devices
  3950 ztest: deadman fires when we're doing a scan
  3951 ztest hang when running dedup test
  3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
2013-08-07 21:16:14 +00:00
Xin LI
c668ff330e MFV r254011:
This change have no effect to FreeBSD but integrated for
completeness.

Illumos ZFS issues:
  348 ZFS should handle DKIOCGMEDIAINFOEXT failure
2013-08-06 21:36:01 +00:00
Alexander Motin
d9aca4ed74 Block reporting of ZFS features for suspended pools.
Before executing any subcommand, zpool tool fetches pools configuration from
the kernel.  Before features support was added, kernel was regenerating that
configuration based on data always present in memory.  Unfortunately, pool
features list and activity counters are not such. They are stored in ZAP,
that normally resides in ARC, but under heavy memory pressure may be swapped
out.  If pool is suspended at this point, there is no way to recover it back
since any zpool command will stuck.

This change has one predictable flaw: `zpool upgrade` always wish to upgrade
suspended pools, but fortunately it can't do it due to the suspension.
2013-08-06 14:41:41 +00:00
Alexander Motin
f8dcf872c4 Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really
disable TRIM otherwise.

r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.

When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition.  Otherwise it queues free request as
it was before r252840.
2013-08-06 14:30:28 +00:00
Alexander Motin
526bb4af8a Make zpool clear to reopen also reconnected cache and spare devices.
Since `zpool status` reports about such kinds of errors, it is strange
that they are not cleared by `zpool clear`.
2013-08-06 14:23:33 +00:00
Alexander Motin
ad727e8d64 Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s).  That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.

In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.

Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.
2013-08-06 14:20:41 +00:00
Attilio Rao
be99683637 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
Attilio Rao
3b6714cacb The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
Steven Hartland
e44e975c1b zfs_ioc_rename should not leave the value of zc_name passed in via zc altered
on return.

MFC after:	1 week
2013-08-04 11:38:08 +00:00
Xin LI
bd3d1456a5 MFV r253783:
Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.

Illumos ZFS issues:
  3834 incremental replication of 'holey' file systems is slow

MFC after:      2 weeks
2013-07-30 21:35:02 +00:00
Xin LI
1c4ead73c6 MFV r253782:
To quote Illumos issue #3888:

When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).

However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).

Illumos ZFS issues:
  3888 zfs recv -F should destroy any snapshots created since the
       incremental source

MFC after:	2 weeks
2013-07-30 21:20:12 +00:00
Xin LI
d637247e1f MFV r253781 + r253871:
Illumos ZFS issues:
  3894 zfs should not allow snapshot of inconsistent dataset

MFC after:	2 weeks
2013-07-30 21:02:09 +00:00
Xin LI
44e362e207 MFV r253780:
To quote Illumos #3875:

The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.

Illumos ZFS issues:
  3875 panic in zfs_root() after failed rollback

MFC after:	2 weeks
2013-07-30 20:37:32 +00:00
Alexander Motin
ec4d2e0d96 Allow three IOCTLs to be used on suspended pool, restoring state that
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.

This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness.  There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.
2013-07-30 14:50:44 +00:00