Commit Graph

1428 Commits

Author SHA1 Message Date
Xin LI
7c1db36b28 MFV r270196:
Illumos issue:
    5047 don't use atomic_*_nv if you discard the return value

MFC after:	2 weeks
2014-08-20 22:39:26 +00:00
Xin LI
249ddb42f6 MFC r270195:
Illumos issue:
    5045 use atomic_{inc,dec}_* instead of atomic_add_*

MFC after:	2 weeks
2014-08-20 21:44:48 +00:00
Xin LI
2bcc37f99c MFV r270193:
Illumos issues:
    5042 stop using deprecated atomic functions

MFC after:	2 weeks
2014-08-20 18:29:18 +00:00
Mark Johnston
266b4a78c2 Factor out the common code for function boundary tracing instead of
duplicating the entire implementation for both x86 and powerpc. This makes
it easier to add support for other architectures and has no functional
impact.

Phabric:	D613
Reviewed by:	gnn, jhibbits, rpaulo
Tested by:	jhibbits (powerpc)
MFC after:	2 weeks
2014-08-16 21:42:55 +00:00
Xin LI
60723bfe21 MFV r269542:
In vdev_get_stats, check that the vdev is not a hole before computing the
fragmentation.  This fixes a panic when removing log device.

Illumos issue:
    5049 panic when removing log device

Author:		Alex Reece <alex@delphix.com>
MFC after:	2 weeks
2014-08-05 00:07:21 +00:00
Mark Johnston
2661328745 Return 0 for the PPID of threads in process 0, as process 0 doesn't have a
parent process.

MFC after:	2 weeks
2014-08-04 19:02:30 +00:00
Xin LI
cd741a5e1d Revert r269404 and use cpu_ticks() for dbuf allocation.
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().

Reviewed by:	mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after:	2 weeks
2014-08-03 09:47:51 +00:00
Xin LI
1dcef10eac MFV r269427:
In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.

This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.

Illumos issue:
    5038 Remove "old-style" flexible array usage in ZFS.
    Author: Justin T. Gibbs <justing@spectralogic.com>

MFC after:	2 weeks
2014-08-02 08:34:22 +00:00
Ian Lepore
c311f7078c When arm 64-bit atomic ops are available, define ARM_HAVE_ATOMIC64. Use
that symbol (which will be correct in both kernel and userland contexts)
rather than just __arm__ to decide whether to use a local implementation.
2014-08-02 03:44:27 +00:00
Ian Lepore
814f4c5896 Use the 64-bit atomics now provided by arm machine/atomic.h instead of
(conflicting) local versions.
2014-08-01 23:45:50 +00:00
Steven Hartland
6a369c018c Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods
This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.

We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.

zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.

Cleanup flow in vdev_geom_io_start while I'm here.

Also fix some cases not using SET_ERROR(..)

MFC after:	2 weeks
X-MFC-With:	r265321
2014-08-01 23:16:48 +00:00
Xin LI
125f68e708 Split gethrtime() and gethrtime_waitfree() and make the former use
nanouptime() instead of getnanouptime().  nanouptime(9) provides more
precise result at expense of being slower.

In r269223, gethrtime() is used as creation time of dbuf, which in turn
acts as portion of lookup key to maintain AVL invariant where there can
not be duplicate items.  Before this change, gethrtime() have preferred
better execution time by sacrificing precision, which may lead to panic
on busy systems with:

	panic: avl_find() succeeded inside avl_add()

Reported by:	allanjude, mav
PR:		kern/192284
MFC after:	11 days
X-MFC-with:	r269223
2014-08-01 22:33:23 +00:00
Rui Paulo
d18aa577d5 Copy strtolctype.h to sys/cddl/contrib/opensolaris/common/util to keep
the kernel self-contained.

Requested by:	jhb
2014-07-31 08:07:23 +00:00
Xin LI
9b046b421f MFV r269224:
Increase default ARC buf_hash_table size.  When typical block size is small,
the hash table could be too small, which would lead to long hash chains and
limit performance for cached reads.

A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which
allows users to override the default assumption of average (typical) block
size.  Old default was 65536 (64 KiB) and new default is 8192 (8 KiB).

Illumos issue:
    5034 ARC's buf_hash_table is too small

MFC after:	2 weeks
2014-07-29 09:36:48 +00:00
Xin LI
a3cbca537e MFV r269223:
Change dn->dn_dbufs from linked list to AVL tree.

Illumos issues:
  4873 zvol unmap calls can take a very long time for larger datasets

MFC after:	2 weeks
2014-07-29 08:42:22 +00:00
Xin LI
343c95a24e Reschedule the 'deadman' callout after handling, this makes our
code behave more like it is on Solaris.

Reported by:	avg
Reviewed by:	avg, mav (but bugs are mine)

Differential Revision: https://phabric.freebsd.org/D457
2014-07-29 06:57:13 +00:00
Konstantin Belousov
fe0e9a63e0 Initialize zfs vnode v_hash when the vnode is allocated, instead of
postponing it to zfs_vget().  zfs_root() returned vnode with the
default value of v_hash, which caused inconsistent v_hash value when
root vnode was obtained from zfs_vget().

Nullfs allocated two upper vnodes for the root zfs vnode due to
different hashes, causing consistency problems.

Reported and tested by:	Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-28 14:24:18 +00:00
Xin LI
50b74c6ef1 Add two sysctls for newly added tunables.
MFC after:	2 weeks
2014-07-26 19:07:08 +00:00
Xin LI
7e37b1e609 MFV r269010:
Import Illumos changes to address the following Illumos issues:
  4976 zfs should only avoid writing to a failing non-redundant
       top-level vdev
  4978 ztest fails in get_metaslab_refcount()
  4979 extend free space histogram to device and pool
  4980 metaslabs should have a fragmentation metric
  4981 remove fragmented ops vector from block allocator
  4982 space_map object should proactively upgrade when feature
       is enabled
  4984 device selection should use fragmentation metric

MFC after:	2 weeks
2014-07-26 10:20:48 +00:00
Alexander Motin
1bc04f6a8c Make sysctls under vfs.zfs.zfetch writeable.
I don't see any reason for them to be read-only, while tuning them without
reboot is much more convenient for experiments.

MFC after:	2 weeks
2014-07-26 09:09:14 +00:00
Xin LI
0aa4ce9b7d Transform the I/O when vdev_physical_ashift is greater than
SPA_MINBLOCKSHIFT.

MFC after:	2 weeks
2014-07-25 18:41:56 +00:00
Xin LI
883d80c104 As of r268075, the responsibility of rounding up buffer to optimal size have
been transferred from zio_compress_data to its caller.  Therefore, passing
the 'minblocksize' down will be a no-op.

Eliminate the parameter to reduce diff against upstream.

MFC after:	2 weeks
2014-07-25 06:53:20 +00:00
Xin LI
3d4d6b0883 Correct typo introduced with r268855.
MFC after:	10 days
X-MFC with:	r268855
2014-07-22 08:37:01 +00:00
Mark Johnston
5a5f9d21dd Use a C wrapper for trap() instead of checking and calling the DTrace trap
hook in assembly.

Suggested by:	kib
Reviewed by:	kib (original version)
X-MFC-With:	r268600
2014-07-19 02:27:31 +00:00
Xin LI
b4bb49887b Reduce lock contention on the z_teardown_lock under heavily cached
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.

Read acquisitions are randomly distributed among these locks based
on curthread pointer.  Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.

Illumos issue:
    5008 lock contention (rrw_exit) while running a read only load

MFC after:	2 weeks
2014-07-19 00:26:03 +00:00
Xin LI
82599d31fe MFV r268851:
When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).

Illumos issue:
    4753 increase number of outstanding async writes when sync task is waiting

MFC after:	2 weeks
2014-07-18 22:34:01 +00:00
Xin LI
f886b6e3bc MFV r268850:
Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC.  Instead
we simply coordinate the destruction of the DMU's data with the ARC.

The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.

Illumos issue:
    4631 zvol_get_stats triggering too many reads

MFC after:	2 weeks
2014-07-18 22:04:21 +00:00
Xin LI
7882b61f60 MFV r268848:
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.

Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.

This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).

While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.

Illumos issue:
  4958 zdb trips assert on pools with ashift >= 0xe

MFC after:	2 weeks
2014-07-18 20:41:40 +00:00
Xin LI
7079d5877c MFV r268714:
Improve extreme rewind import.

When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os.  This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.

For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm.  Three kernel tunables have been added:

	vfs.zfs.spa_load_verify_maxinflight
	vfs.zfs.spa_load_verify_metadata
	vfs.zfs.spa_load_verify_data

The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.

Make 'zpool import -T' imply scrub.

Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.

Skip txg's for which there is no uberblock when doing extreme rewind.

Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.

Illumos issues:
  4970 need controls on i/o issued by zpool import -XF
  4971 zpool import -T should accept hex values
  4972 zpool import -T implies extreme rewind, and thus a scrub
  4973 spa_load_retry retries the same txg
  4974 spa_load_verify() reads all data twice

MFC after:	2 weeks
2014-07-15 22:44:04 +00:00
Xin LI
eb75155228 MFV r268702:
Add missing *_destroy() calls in various places with ZFS.

Illumos issue:
  4975 missing mutex_destroy() calls in zfs

MFC after:	2 weeks
2014-07-15 20:32:23 +00:00
Mark Johnston
291624fdf6 Invoke the DTrace trap handler before calling trap() on amd64. This matches
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.

This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.

Submitted by:	Anton Rang <anton.rang@isilon.com> (original version)
Phabric:	D95
2014-07-14 04:38:17 +00:00
Xin LI
1b174fa1eb MFV r268455:
Use reserved space for ZFS administrative commands.

We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use.  Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.

Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used.  This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.

A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.

MFC after:	 2 weeks
2014-07-09 23:14:59 +00:00
Xin LI
fdc0ee2cf5 MFV r268452:
Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.

Illumos issue:	4950 files sometimes can't be removed from a full
		filesystem
MFC after:	2 weeks
2014-07-09 18:32:40 +00:00
Alexander Motin
e327a057a7 Remove IO_SYNC flag when writing extended file attributes on ZFS.
While it is possible to create and write file, modify its permissions, etc.
without ever doing sync, it looks odd that it is required for setting
extended file attributes on ZFS.  UFS does not do sync there too.

Samba uses those extended attributes to store some its data, and doing it
synchronously by many times reduces file creation performance for systems
without SLOG device.

Reviewed by:	delphij, jpaetzel, silence on fs@
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-07-08 17:26:08 +00:00
Marcel Moolenaar
e7d939bda2 Remove ia64.
This includes:
o   All directories named *ia64*
o   All files named *ia64*
o   All ia64-specific code guarded by __ia64__
o   All ia64-specific makefile logic
o   Mention of ia64 in comments and documentation

This excludes:
o   Everything under contrib/
o   Everything under crypto/
o   sys/xen/interface
o   sys/sys/elf_common.h

Discussed at: BSDcan
2014-07-07 00:27:09 +00:00
Pedro F. Giffuni
5f40879138 Merge from OpenSolaris (24-Jul-2010):
6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops
6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink

This finishes a set of merges from the older OpenSolaris releases.
Still the FreeBSD port has many differences that are difficult to
account for but that seems normal given that the kernels are different.

MFC after:	1 week
2014-07-05 15:36:17 +00:00
Pedro F. Giffuni
99d8c6efbd Merge from OpenSolaris (30-Jun-2009):
6851093 system drops to kmdb with anonymous dtrace probes + kmdb

This has no effect on FreeBSD (code is ifdef'ed) but is useful as
reference for future merges.

MFC after:	1 week
2014-07-03 19:25:24 +00:00
Pedro F. Giffuni
87e109c3e0 Merge from OpenSolaris (22-Apr-2008):
6823388 DTrace ioctl handlers must validate all structure members

MFC after:	1 week
2014-07-03 19:07:37 +00:00
Pedro F. Giffuni
e099b3a948 Merge from OpenSolaris (20-Apr-2008):
6822482 DOF validation needs to handle loadable sections flagged as unloadable

MFC after:	1 week
2014-07-03 17:36:59 +00:00
Alexander Motin
5a178afd41 Fix bug in sync control in new "dev" mode of ZVOL (r265678).
Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".

MFC after:	3 days
2014-07-02 21:25:32 +00:00
Pedro F. Giffuni
0b8f286e83 Merge from OpenSolaris (15-Sep-2008):
6735480 race between probe enabling and provider registration

MFC after:	1 week
2014-07-01 23:37:24 +00:00
Xin LI
30324e945a MFV r268122:
4929 want prevsnap property

illumos/illumos-gate@b461c7460e

MFC after:	2 weeks
2014-07-01 22:42:53 +00:00
Xin LI
9cc8a15b2e MFV r268121:
4924 LZ4 Compression for metadata

illumos/illumos-gate@b8289d24d8

MFC after:	2 weeks
2014-07-01 22:31:09 +00:00
Pedro F. Giffuni
f384ec379c Small merges from OpenSolaris:
These have no effect on FreeBSD, in fact they are ifdef'ed,
but make easier future merges:

6699767 panic in spec_open()

6718877 crgetzoneid() use can cause problems when forking processes with
USDT providers in a non global zone

MFC after:	3 days
2014-07-01 22:16:44 +00:00
Xin LI
aa882b9048 MFV r268119:
4914 zfs on-disk bookmark structure should be named *_phys_t

illumos/illumos-gate@7802d7bf98

MFC after:	2 weeks
2014-07-01 21:51:30 +00:00
Xin LI
55f6421982 - Fix handling of "new" style of ioctl in compatiblity mode [1];
- Reorganize code and reduce diff from upstream;
 - Improve forward compatibility shims for previous kernel;

Reported by:	sbruno [1]
X-MFC-With:	r268075
2014-07-01 20:57:39 +00:00
Pedro F. Giffuni
c6d712caf3 Revert r268007, and re-adapt MFV r260708:
4427 pid provider rejects probes with valid UTF-8 names

Use of u8_textprep.c required -Wno-cast-qual for powerpc.

MFC after:	2 weeks
2014-07-01 15:36:05 +00:00
Xin LI
be78a8db97 MFV r267570:
4756 metaslab_group_preload() could deadlock

illumos/illumos-gate@30beaff42d

MFC after:	2 weeks
2014-07-01 08:36:56 +00:00
Xin LI
3a0f8ff95e MFV r267569:
4897 Space accounting mismatch in L2ARC/zpool

illumos/illumos-dist@3038a2b421

MFC after:	2 weeks
2014-07-01 08:28:49 +00:00
Xin LI
93b8d53c09 MFV r267567:
4881 zfs send performance degradation when embedded block pointers are
     encountered

illumos/illumos-gate@06315b795c

MFC after:	2 weeks
2014-07-01 07:56:07 +00:00
Xin LI
71eaf0fda7 MFV r267566:
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption

MFC after:	2 weeks
2014-07-01 07:29:42 +00:00
Xin LI
29441ba3fa MFV r267565:
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

MFC after:	2 weeks
2014-07-01 06:43:15 +00:00
Pedro F. Giffuni
0135aadfc3 Reduce some warnings in the Solaris unicode support.
Clean some warnings from parenthesis and minor style issues.

MFC after:	3 days
2014-06-29 02:28:05 +00:00
Pedro F. Giffuni
f34dd28f7d Revert r267869:
MFV	r260708
4427 pid provider rejects probes with valid UTF-8 names

Use of u8_textprep.c broke the build on powerpc.

Reported by:	bz, rpaulo and tinderbox.
Pointyhat:	me
2014-06-28 19:59:12 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Rui Paulo
a43f0be9fe MFV illumos
4471 DTrace count() with histogram
4472 DTrace full width distribution histograms
4473 DTrace frequency trails

MFC after:	2 weeks
2014-06-26 23:24:59 +00:00
Rui Paulo
8e648814b0 MFV illumos
4474 DTrace Userland CTF Support
4475 DTrace userland Keyword
4476 DTrace tests should be better citizens
4479 pid provider types
4480 dof emulation is missing checks

MFC after:	2 weeks
2014-06-26 23:21:11 +00:00
Rui Paulo
b1f9167f94 MFV illumos
4477 DTrace should speak JSON

MFC after:	2 weeks
2014-06-26 21:45:49 +00:00
Rui Paulo
0c2b601953 MFV illumos r266986:
2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access

MFC after:	2 weeks
2014-06-26 19:38:16 +00:00
Rui Paulo
dd9b2abed8 Revert r267898. 2014-06-26 17:34:42 +00:00
Rui Paulo
d8e37c5f72 Bring the following change from the illumos-joyent repository:
commit 78e24ab6803bbe11ba37642624e1498ede5b239d
Author: Bryan Cantrill <bryan@joyent.com>
Date:   Thu Oct 31 01:20:54 2013

    OS-1688 DTrace count() with histogram
    OS-2360 DTrace full width distribution histograms
    OS-2361 DTrace frequency trails

MFC after:	2 weeks
2014-06-26 07:06:43 +00:00
Pedro F. Giffuni
af8bd6e468 MFV r260708
4427 pid provider rejects probes with valid UTF-8 names

This make use of Solaris' u8_validate() which we happen to
use since r185029 for ZFS.

Illumos Revision:	1444d846b126463eb1059a572ff114d51f7562e5

Reference:
https://www.illumos.org/issues/4427

Obtained from:	Illumos
MFC after:	2 weeks
2014-06-25 14:23:30 +00:00
Davide Italiano
a99098e2ba Continue the crusade towards a dev_clone()-free kernel, removing its
usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD
8, so this change should be quite harmless.

Reviewed by:	markj
Approved by:	markj
MFC after:	never
2014-06-25 03:54:02 +00:00
Mark Johnston
efa1aff675 Fix some bugs when fetching probe arguments in i386. Firstly ensure that
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.

MFC after:	3 weeks
2014-06-23 02:00:14 +00:00
Mark Johnston
8382ec9e6a Fix a couple of bugs on amd64 when fetching probe arguments beyond the
first five for probes entered through a UD fault (i.e. FBT probes).

Specifically, handle the fact that dtrace_invop_callsite must be
16 byte-aligned and thus may not immediately follow the call to
dtrace_invop() in dtrace_invop_start(). Also fetch register arguments and
the stack pointer through a struct trapframe instead of a struct reg.

PR:		191260
Submitted by:	luke.tw@gmail.com
MFC after:	3 weeks
2014-06-23 01:10:56 +00:00
Mark Johnston
9338d20884 Allow creation of SDT probes from a module in which no providers are
defined. This ensures that the sdt:zfs:: probes appear despite the fact
the sdt provider is defined in the kernel rather than in zfs.ko.

Reported by:	hiren
Tested by:	hiren
MFC after:	2 weeks
2014-06-21 19:29:40 +00:00
Steven Hartland
74ddec2b18 Removed stale comment about multi-vdev root pool config not working
MFC after:	1 week
2014-06-09 13:04:58 +00:00
Bryan Drewery
f3a7518361 - Naively fix build by partially reverting r267029 to still use
gethrtime() when building libzpool.

X-MFC-With:	267029
2014-06-04 05:04:15 +00:00
Alexander Motin
4220ebcf71 Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue
selection.  gethrtime() in our port updated with HZ rate, so unusable for
this specific purpose, completely draining benefit of multiple taskqueues.

MFC after:	2 weeks
2014-06-03 21:06:03 +00:00
Xin LI
f4c7dd6dd0 MFV 266913+266914:
3897 zfs filesystem and snapshot limits (fix leak)
4901 zfs filesystem/snapshot limit leaks

MFC after:	3 days
2014-05-31 01:00:22 +00:00
Xin LI
2bdf7f79bc MFV r266766:
Add a new zfs property, "redundant_metadata" which can have values "all" or
"most".  The default will be "all", which is the current behavior.  When set
to all, ZFS stores an extra copy of all metadata.  If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.

Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files.  This can improve performance of random writes,
because less metadata has to be written.  In practice,  at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.

The exact behavior of which metadata blocks are stored redundantly may change
in future releases.

Illumos issue: 3835 zfs need not store 2 copies of all metadata

MFC after:	2 weeks
2014-05-27 19:46:11 +00:00
Allan Jude
ecd9567c1a Improve sysctl descriptions for new ZFS sysctls:
vfs.zfs.dirty_data_max
vfs.zfs.dirty_data_max_max
vfs.zfs.dirty_data_sync

Reviewed by:	smh
Approved by:	wblock (mentor)
2014-05-22 05:30:38 +00:00
Steven Hartland
df23182a62 Added sysctls / tunables for ZFS dirty data tuning
Added the following new sysctls / tunables:
* vfs.zfs.dirty_data_max
* vfs.zfs.dirty_data_max_max
* vfs.zfs.dirty_data_max_percent
* vfs.zfs.dirty_data_sync
* vfs.zfs.delay_min_dirty_percent
* vfs.zfs.delay_scale

PR:		kern/189865
MFC after:	2 weeks
2014-05-21 13:36:04 +00:00
Peter Grehan
c3ddb60e2d Update dis_tables.c to the latest Illumos version.
This includes decodes of recent Intel instructions, in particular
VT-x and related instructions. This allows the FBT provider to
locate the exit points of routines that include these new
instructions.

Illumos issues:
 3414 Need a new word of AT_SUN_HWCAP bits
 3415 Add isainfo support for f16c and rdrand
 3416 Need disassembler support for rdrand and f16c
 3413 isainfo -v overflows 80 columns
 3417 mdb disassembler confuses rdtscp for invlpg
 1518 dis should support AMD SVM/AMD-V/Pacifica instructions
 1096 i386 disassembler should understand complex nops
 1362 add kvmstat for monitoring of KVM statistics
 1363 add vmregs[] variable to DTrace
 1364 need disassembler support for VMX instructions
 1365 mdb needs 16-bit disassembler support

This corresponds to Illumos-gate (github) version
eb23829ff08a873c612ac45d191d559394b4b408

Reviewed by:	markj
MFC after:	1 week
2014-05-15 01:06:27 +00:00
Xin LI
b8cdcb8ad8 Import George Wilson's change for Illumos #4730:
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
	Reviewed by: Alex Reece <alex.reece@delphix.com>
	Reviewed by: Matthew Ahrens <mahrens@delphix.com>
	Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>

	Original author: George Wilson

MFC after:	3 days
2014-05-06 19:03:04 +00:00
Steven Hartland
4f64781818 Use a zio flag to prevent recursion of vdev_queue_io_done which can
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.

This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.

MFC after:	1 month
X-MFC-With:	r265152
2014-05-04 14:05:14 +00:00
Steven Hartland
573621a6d6 Don't treat TRIM requests returning ENOTSUP as an unexpected error.
MFC after:	1 month
X-MFC-With:	r265152
2014-05-03 02:30:01 +00:00
Steven Hartland
10138166cf Removed pointless / duplicated call to trim_map_first.
MFC after:	1 month
X-MFC-With:	r265152
2014-05-02 09:31:21 +00:00
Steven Hartland
82ce008538 Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.

Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.

As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.

This is based off avg's original work, so credit to him.

MFC after:	1 month
2014-04-30 17:46:29 +00:00
Steven Hartland
101dfa0ed4 Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start
returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting
in the wrong ZIO continuing its pipeline.

This is a serious issue which could cause data loss / corruption but appears
to be limited to error handling such as when vdev_readable(vd) returns false.

MFC after:	2 days
2014-04-28 09:00:00 +00:00
Steven Hartland
c2b2c5fc76 Eliminate duplicate checks in vdev_geom_io_intr error handling
MFC after:	1 month
2014-04-24 15:36:00 +00:00
Steven Hartland
5b245b8ae0 Add the ability to set a minimum ashift size for ZFS pool creation or root level
vdev addition.

Change max_auto_ashift sysctl to error when an invalid value is requested instead
of silently limiting it.
2014-04-24 01:06:03 +00:00
Xin LI
754180f4ae MFV r264830:
4745 fix AVL code misspellings

MFC after:	2 weeks
2014-04-23 20:32:39 +00:00
Xin LI
f8587167e4 MFV r264829:
3897 zfs filesystem and snapshot limits

MFC after:	2 weeks
2014-04-23 20:29:46 +00:00
Xin LI
18ab4bd8d9 MFV r264668:
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

illumos/illumos@b6240e830b

MFC after:	2 weeks
2014-04-18 22:04:58 +00:00
Xin LI
d301d390a7 MFV r264667:
4752 fan out read zio taskqs

illumos/illumos-gate@1b497ab83e
2014-04-18 21:35:23 +00:00
Xin LI
613074ec08 MFV r264666:
4374 dn_free_ranges should use range_tree_t

illumos/illumos-gate@bf16b11e8d

MFC after:	2 weeks
2014-04-18 21:15:12 +00:00
Mark Johnston
38e6967f04 Ensure that all eight syscall arguments are available to dtrace_probe(),
rather than just the first five. This is done by calling dtrace_probe()
through a function pointer, as in illumos.

MFC after:	3 weeks
2014-04-14 00:23:18 +00:00
Mark Johnston
0626f3e435 DTrace's pid provider works by inserting breakpoint instructions at probe
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.

In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.

This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.

With this change it's possible to trace all libc functions in a program,
e.g. with

  pid$target:libc.so.*::entry {@[probefunc] = count();}

Previously this would generally cause the victim process to crash, as
tracing memcpy on amd64 requires the functionality described above.

Tested by:	Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version)
MFC after:	6 weeks
2014-04-14 00:22:42 +00:00
Davide Italiano
2f9e29745c Fix a panic in zfs_rename().
this is due to a wrong dereference of a vnode when it's not locked and
can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename()
entry point because the VFS can't be sure that this scenario is
LOR-free (it might violate the parent->child lock acquisition rule).
Dereference 'tdvp' instead, which is already locked on entry, and access
'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope.

While at it, remove the usage of VOP_REALVP, as long as this is a NOP
on FreeBSD.

Discussed with:	avg
Reviewed by:	pjd
2014-04-13 01:15:37 +00:00
Alexander Motin
f6e1dc83c3 Create zvol devices on zfs clone.
While big and shiny patch is not ready, it is better to have something.

PR:		kern/178999
MFC after:	1 week
2014-04-11 11:56:16 +00:00
Alexander Motin
a96fefe042 In addition to r264077, tell GEOM that we do support BIO_DELETE now. 2014-04-06 16:31:28 +00:00
Alexander Motin
537650f54d Add property and sysctl to control how ZVOLs are exposed to OS.
New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
 geom -- existing fully functional behavior (default);
 dev -- exposing volumes only as raw disk device file in devfs;
 none -- not exposing volumes outside ZFS.

The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.

Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2014-04-05 13:01:44 +00:00
Alexander Motin
89e84aead6 MFV r258922:
3580 Want zvols to return volblocksize when queried for physical block size

illumos/illumos-gate@a0b60564df

It is irrelevant for FreeBSD, just reducing diff.
2014-04-03 20:18:55 +00:00
Alexander Motin
4a03e8b64d Add BIO_DELETE support to ZVOL.
It is an adapted merge from the vendor branch of:
    701 UNMAP support for COMSTAR  (in part related to ZFS)
    2130 zvol DKIOCFREE uses nested DMU transactions
2014-04-03 15:04:32 +00:00
Pedro F. Giffuni
23e4da439c MFV r258379;
4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included

Illumos Revision:	4a20ab41aadcb81c53e72fc65886e964e9add59

Reference:
https://www.illumos.org/issues/4248
https://www.illumos.org/issues/4249

Obtained from:	Illumos
MFC after:	1 month
2014-04-02 15:32:44 +00:00
Bryan Drewery
44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
Alexander Motin
68d17718e0 Report ZVOL block size as GEOM stripesize.
MFC after:	2 weeks
2014-03-13 19:26:26 +00:00
Xin LI
8e41e26f65 MFV r262983:
4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!

illumos/illumos-gate@2144b121c0
2014-03-11 00:23:50 +00:00
Xin LI
ba680558a0 All callers of static method load_nvlist() in spa.c handles error case,
so there is no reason to assert that we won't hit an error.  Instead,
just return that error to caller and have the upper layer handle it.

Obtained from:	FreeNAS
Reported by:	rodrigc
Reviewed by:	Matthew Ahrens
MFC after:	2 weeks
2014-03-02 02:41:33 +00:00
Mark Johnston
b53bfbba65 Expose a few DTrace parameters as sysctls under kern.dtrace and add
descriptions for several existing sysctls.

PR:		187027
Submitted by:	Fedor Indutny <fedor@indutny.com> (original version)
MFC after:	2 weeks
2014-03-01 19:06:43 +00:00
Mark Johnston
ae520d3dc4 Fix emulation of call and jmp instructions on i386 and for 32-bit processes
on amd64.

Submitted by:	Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after:	2 weeks
2014-03-01 17:55:20 +00:00
Mark Johnston
ae9f1a185c 4478 dtrace_dof_maxsize is far too small
illumos/illumos-gate@d339a29bb4

PR:		187027
MFC after:	1 week
2014-02-28 02:04:41 +00:00
Mark Johnston
c0c943de72 Fix the struct reg mappings for i386 and amd64, which differ between illumos
and FreeBSD.

Submitted by:	Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after:	2 weeks
2014-02-27 01:24:47 +00:00
Mark Johnston
0339a1c2b4 Move some files that are identical on i386 and amd64 to an x86 subdirectory
rather than keeping duplicate copies.

Discussed with:	avg
MFC after:	1 week
2014-02-27 01:04:35 +00:00
Mark Johnston
5bcd30f3b1 Revert r262466, as it does not compile on PowerPC.
Reported by:	jhibbits
2014-02-26 01:00:00 +00:00
Mark Johnston
68ac8d05d3 Make all 8 syscall arguments available to syscall probes in the same way
that this is done for SDT probes. This fixes the syscall/tst.args.d test,
which was failing because mmap(2)'s sixth argument wasn't available to the
probe.

MFC after:	2 weeks
2014-02-25 02:58:11 +00:00
Mark Johnston
33db01542c 1452 DTrace buffer autoscaling should be less violent
illumos/illumos-gate@6fb4854bed

This fixes the tst.resize1.d and tst.resize2.d DTrace tests, which have
been failing since r261122 since they were causing dtrace(1) to attempt to
allocate and use large amounts of memory, and get killed by the OOM killer
as a result.

MFC after:	1 month
2014-02-22 05:18:55 +00:00
Mark Johnston
dc0f030e51 Define the KM_NORMALPRI flag for kmem_alloc(), as it is used in some
upstream DTrace code. It indicates that the kernel memory allocator need not
attempt to satisfy non-blocking allocations in low-memory conditions. This
has no direct equivalent in the malloc(9) flags, so it is just defined to 0
for now.
2014-02-22 05:13:35 +00:00
Xin LI
5f62f8cdcb MFV r261619:
4574 get_clones_stat does not call zap_count in non-debug kernel

zap_count(...) is never called in non-DEBUG kernel.
As result "count" variable is always 0, and "goto fail" is always
reached.  This means get_clones_stat function never makes up list
of clones for "clones" properties.

MFC after:	2 weeks
2014-02-08 05:35:36 +00:00
Xin LI
bea6313e6b MFV r260834:
Fix memory leak of compressed buffers in l2arc_write_done (Illumos
#3995).
2014-01-18 01:45:39 +00:00
Andriy Gapon
6d03ca5789 traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT
This is done to ensure that visited object IDs are always increasing.
Also, pass correct object ID to prefetch_dnode_metadata for
os_groupused_dnode.

Without this change we would hit an assert if traversal was paused on
a GROUPUSED object, which is unlikely but possible.

Apparently the same change was independently developed by Deplhix.

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	10 days
Sponsored by:	HybridCluster
2014-01-17 10:23:46 +00:00
Andriy Gapon
fec721bc43 fix a build problem with INVARIANTS enabled introduced in r260704
Reported by:	glebius
MFC after:	5 days
X-MFC with:	r260704
2014-01-16 13:44:37 +00:00
Andriy Gapon
876fa2c17b fix a bug in ZFS mirror code for handling multiple DVAa
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt.  Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds.  But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.

Reviewed by:	gibbs
MFC after:	13 days
Sponsored by:	HybridCluster
2014-01-16 13:24:10 +00:00
Andriy Gapon
00126789e6 Revert r260705: wrong patch committed by accident
An earlier, less efficient version was committed by accident.
2014-01-16 13:20:20 +00:00
Andriy Gapon
19f5e9076b zfs_deleteextattr: name buffer from namei is needed by zfs_rename
If we prematurely free the name buffer and it gets quickly recycled,
then zfs_rename may see data from another lookup or even unmapped memory
via cn_nameptr.

MFC after:	6 days
Sponsored by:	HybridCluster
2014-01-16 12:31:27 +00:00
Andriy Gapon
2f9a31944f fix a bug in ZFS mirror code for handling multiple DVAa
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt.  Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds.  But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.

Reviewed by:	gibbs
MFC after:	13 days
Sponsored by:	HybridCluster
2014-01-16 12:26:54 +00:00
Andriy Gapon
b8ca4667ed zfs: getnewvnode_reserve must be called outside of a zfs transaction
Otherwise we could run into the following deadlock.
A thread has a transaction open and assigned to a transaction group.
That would prevent the transaction group from be quiesced and synced.
The thread is blocked in getnewvnode_reserve waiting for a vnode to
a be reclaimed.  vnlru thread is blocked trying to enter ZFS VOP because
a filesystem is suspended by an ongoing rollback or receive operation.
In its turn the operation is waiting for the current transaction group
to be synced.

zfs_zget is always used outside of active transactions, but zfs_mknode
is always used in a transaction context.  Thus, we hoist
getnewvnode_reserve from zfs_mknode to its callers.

While there, assert that ZFS always calls getnewvnode while having
a vnode reserved.

Reported by:	adrian
Tested by:	adrian
MFC after:	17 days
Sponsored by:	HybridCluster
2014-01-16 12:22:46 +00:00
Marcel Moolenaar
642ebd6a18 In atomic_or_8_nv() load 1 and not 8 bytes from the address
given. Note that atomic_or_8_nv() is not used at this time.
2014-01-06 05:00:58 +00:00
Alexander Motin
77e2eaf5b8 Fix build after r260234 by converting ddi_get_lbolt64() from inline into
a macro.  Otherwise compiler complains that hz variable used there either
undefined or defined twice, thanks to header mess caused by compat shims.
2014-01-05 19:07:42 +00:00
Alexander Motin
ce05e707c4 In dmu_zfetch_stream_reclaim() replace division with multiplication and
move it out of the loop and lock.
2014-01-03 18:44:37 +00:00
Alexander Motin
99e2428636 Remove extra conversion to nanoseconds from ddi_get_lbolt64().
As result this uses one multiplication and shifts instead of one division
and two multiplications.
2014-01-03 18:08:31 +00:00
Xin LI
7c88e58f46 MFV r260155:
When we encounter an I/O error on a piece of metadata while deleting
a file system or zvol, we don't update the bptree_entry_phys_t's
bookmark.  This would lead to double free of bp's which will lead to
space map corruption.

Instead of tolerating and allowing the corruption, panic immediately.

See Illumos #4390 for more details.

4391 panic system rather than corrupting pool if we hit bug 4390

Illumos/illumos-gate@8b36997aa2

MFC after:	2 weeks
2014-01-02 08:10:35 +00:00
Xin LI
ab0b9f6b30 MFV r260154 + 260182:
4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools

Illumos/illumos-gate@78f1710053

MFC after:	2 weeks
2014-01-02 07:34:36 +00:00
Xin LI
6f2791f53a Fix build on platforms where atomic_swap_64 is not available. 2014-01-02 03:24:44 +00:00
Xin LI
647795d181 MFV r260153:
4121 vdev_label_init should treat request as succeeded when pool
     is read only

Illumos/illumos-gate@973c78e94b

MFC after:	2 weeks
2014-01-01 01:26:39 +00:00
Xin LI
f4c8ba8370 MFV r259170:
4370 avoid transmitting holes during zfs send

4371 DMU code clean up

illumos/illumos-gate@43466aae47

NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.

MFC after:	2 weeks
2014-01-01 00:45:28 +00:00
Xin LI
cca1e7c623 MFV r258385:
(Note: this change is not applicable to FreeBSD and the file
is not included in build.  It's integrated for completeness).

4128 disks in zpools never go away when pulled

illumos/illumos-gate@39cddb10a3

MFC after:	2 weeks
2013-12-31 21:24:00 +00:00
Xin LI
db2aff5f8b MFV r242733:
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man page
and help message

illumos/illumos-gate@31d7e8fa33

FreeBSD porting notes: the kernel part of this changeset depends
on Solaris buf(9S) interfaces and are not really applicable for
our use.  vdev_disk.c is patched as-is to reduce diverge from
upstream, but vdev_file.c is left intact.

MFC after:	2 weeks
2013-12-31 19:39:15 +00:00
Mark Johnston
b69b2ff588 Allocate the probe ID unrhdr before the DTrace kld_* event handlers are
registered. Otherwise there is a small window during which probe IDs may be
allocated before the unrhdr is allocated.

MFC after:	2 weeks
2013-12-31 15:41:16 +00:00
Mark Johnston
a333376bba Revert r260091. The vmem calls seem to be slower than the *_unr() calls that
they replaced, which is important considering that probe IDs are allocated
during process startup for USDT probes.
2013-12-31 15:37:51 +00:00
Mark Johnston
b9c04b396a Now that vmem(9) is available, use vmem arenas to allocate probe and
aggregation IDs, as is done in the upstream illumos code. This still
requires some FreeBSD-specific code, as our vmem API is not identical to the
one in illumos.

Submitted by:	Mike Ma <mikemandarine@gmail.com>
2013-12-30 17:37:32 +00:00
Xin LI
1aaa945f67 MFV r258374:
4171 clean up spa_feature_*() interfaces

4172 implement extensible_dataset feature for use by other zpool
features

illumos/illumos-gate@2acef22db7

MFC after:	2 weeks
2013-12-24 07:14:25 +00:00
Xin LI
ec097c1634 MFV r258373:
4168 ztest assertion failure in dbuf_undirty

4169 verbatim import causes zdb to segfa
4170 zhack leaves pool in ACTIVE state

illumos/illumos-gate@7fdd916c47

MFC after:	2 weeks
2013-12-24 06:56:17 +00:00
Justin Hibbits
ad5017b513 Fix a brain-o. I had misread the limit as a size, but it's a pointer.
Submitted by:	Howard Su
MFC after:	2 weeks
X-MFC-with:	r259668
2013-12-21 00:37:32 +00:00
Justin Hibbits
a76f5d59f4 Fix a couple bugs in FBT PowerPC. Clamp the size to a 'instruction size' not
'byte size', and fix a typo.

MFC after:	2 weeks
2013-12-20 23:18:14 +00:00
Pawel Jakub Dawidek
4106732882 MFV r258923: 4188 assertion failed in dmu_tx_hold_free(): dn_datablkshift != 0
illumos/illumos-gate@bb411a08b0

MFC after:	3 days
2013-12-18 21:45:46 +00:00
Mark Johnston
7159310fa6 The fasttrap fork handler is responsible for removing tracepoints in the
child process that were inherited from its parent. However, this should
not be done in the case of a vfork, since the fork handler ends up removing
the tracepoints from the shared vm space, and userland DTrace probes in the
parent will no longer fire as a result.

Now the child of a vfork may trigger userland DTrace probes enabled in its
parent, so modify the fasttrap probe handler to handle this case and handle
the child process in the same way that it would handle the traced process.
In particular, if once traces function foo() in a process that vforks, and
the child calls foo(), fasttrap will treat this call as having come from the
parent. This is the behaviour of the upstream code.

While here, add #ifdef guards to some code that isn't present upstream.

MFC after:	1 month
2013-12-18 01:41:52 +00:00
Alan Somers
cd730bd6b2 sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
When a da or ada device dissappears, outstanding IOs fail with
	ENXIO, not EIO.  The check for EIO was probably copied from Illumos,
	where that is indeed the correct errno.

	Without this change, pulling a busy drive from a zpool would usually
	turn it into UNAVAIL, even though pulling an idle drive would turn
	it into REMOVED.  With this change, it is REMOVED every time.

	Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
	results in devd getting two resource.fs.zfs.removed events.  The
	comment said that the event had to be sent directly instead of
	through the async removal thread because "the DE engine is using
	this information to discard prevoius I/O errors".  However, the fact
	that vdev_geom_io_intr was never actually sending the events until
	now, and that vdev_geom_orphan never sent them at all, and that
	vdev_geom_orphan usually gets called about 2 seconds after the
	actual removal, means that FreeBSD's userland can cope with a late
	event just fine.

Approved by:	ken (mentor)
Sponsored by:	Spectra Logic Corporation
MFC after:	4 weeks
2013-12-12 00:27:22 +00:00
Mark Johnston
e53c69c1f5 Correct the check for errors from proc_rwmem().
MFC after:	2 weeks
2013-12-11 04:31:40 +00:00
Alexander Motin
f192c4873d Don't even try to read vdev labels from devices smaller then SPA_MINDEVSIZE
(64MB).  Even if we would find one somehow, ZFS kernel code rejects such
devices.  It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.
2013-12-10 12:36:44 +00:00
Xin LI
9b11826d3d Expose spa_asize_inflation.
X-MFC-With:	r258632
2013-12-06 23:49:16 +00:00
Andriy Gapon
f77ffe1b22 zfs: add zfs_freebsd_putpages
this should be more optimal than writing pages one-by-one via zfs_write ->
update_pages in the case of multi-page putpages call

MFC after:	16 days
2013-11-29 15:39:39 +00:00
Andriy Gapon
6c5b7fffce zfs: add dmu_write_pages variant for freebsd
The freebsd variant of dmu_write_pages is hidden under _KERNEL
to avoid needlessly pulling in vm_page_t declaration.
Besides, this function seems to be useless for ZFS userland counterpart.

MFC after:	15 days
2013-11-29 15:34:43 +00:00
Andriy Gapon
fdbcc95a47 zfs: make zfs_map_page / zfs_unmap_page public
MFC after:	15 days
2013-11-29 15:33:40 +00:00
Andriy Gapon
998a42756c drop ZUT_OBJ, zfs unit testing driver never materialzied in freebsd
MFC after:	5 days
2013-11-29 15:32:53 +00:00
Andriy Gapon
ac79eedf85 zfs mappedread_sf: assert that a page is never partially valid
ZFS never partially validates or invalidates a page.
The higher level VM should not do that either.
mappedread_sf correct operation depends on a page being either fully
valid or invalid.

MFC after:	7 days
2013-11-29 12:19:52 +00:00
Andriy Gapon
be3d0087dc MFV r258665: 4347 ZPL can use dmu_tx_assign(TXG_WAIT)
illumos/illumos-gate@e722410c49

MFC after:	9 days
X-MFC after:	r258632
2013-11-28 19:44:36 +00:00
Andriy Gapon
456a87bb3b MFV r258371,r258372: 4101 metaslab_debug should allow for fine-grained control
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab

illumos/illumos-gate@0713e232b7

Note that some tunables have been removed and some new tunables have
been added.  Of particular note, FreeBSD-only knob
vfs.zfs.space_map_last_hope is removed as it was a nop for some time now
(after one of the previous merges from upstream).

MFC after:	11 days
Sponsored by:	HybridCluster [merge]
2013-11-28 19:37:22 +00:00
Andriy Gapon
02727275c5 opensolaris compat: add taskq_wait emulation
MFC after:	10 days
2013-11-28 19:17:11 +00:00
Andriy Gapon
7bc07f0575 fix a serious bug in r258632: offset parameter must be set in zio
In illumos all ioctl zio-s are "global" at the moment.  That is they act
on a whole disk, e.g. a cache flush command, and thus do not need either
offset or size parameters.
FreeBSD, on the other hand, has support for TRIM command and that
command requires proper offset and size parameters.
Without this fix all TRIM commands act on the start of any disk or
partition used by ZFS destroying any data there.

Pointyhat to:	avg
Tested by:	sbruno
MFC after:	3 days
X-MFC with:	r258632
Sponsored by:	HybridCluster
2013-11-28 08:48:49 +00:00
Andriy Gapon
2ac1eeec44 fix debug.zfs_flags sysctl description in r258638
Pointyhat to:	avg
MFC after:	3 days
2013-11-26 10:57:09 +00:00
Andriy Gapon
78affb8591 expose zfs_flags as debug.zfs_flags r/w tunable and sysctl
This knob is purposefully hidden under debug.

MFC after:	5 days
Sponsored by:	HybridCluster
2013-11-26 10:46:43 +00:00
Andriy Gapon
3761ac95f7 MFV r258376: 3964 L2ARC should always compress metadata buffers
illumos/illumos-gate@e4be62a2b7

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 10:14:23 +00:00
Andriy Gapon
fd51e905e2 MFV r255256: 3954 metaslabs continue to load even after hitting zfs_mg_alloc_failure limit
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold

illumos/illumos-gate@22e30981d8

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 10:02:02 +00:00
Andriy Gapon
2a4704ab01 MFV r255255: 4045 zfs write throttle & i/o scheduler performance work
illumos/illumos-gate@69962b5647

Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
  with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
  exposed as tunables / sysctls yet

MFC after:	10 days
Sponsored by:	HybridCluster [merge]
2013-11-26 09:57:14 +00:00
Andriy Gapon
fb8171c240 MFV r247578: 3581 spa_zio_taskq[ZIO_TYPE_FREE][ZIO_TASKQ_ISSUE]->tq_lock is piping hot
illumos/illumos-gate@ec94d32216

MFC after:	9 days
Sponsored by:	HybridCluster [merge]
2013-11-26 09:45:48 +00:00
Andriy Gapon
34140e78ab 734 taskq_dispatch_prealloc() desired
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e

Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.

After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.

Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.

MFC after:	8 days
2013-11-26 09:26:18 +00:00
Andriy Gapon
2dbdedbc46 opensolaris taskq: some cosmetic changes
- drop trailing whitespace
- remove redundant "extern" from function declarations
- remove unused macro

MFC after:	1 week
2013-11-26 09:10:01 +00:00
Andriy Gapon
a776a1c1c5 sdt: add support for solaris/illumos style DTRACE_PROBE macros
The new macros are implemented in terms of SDT_PROBE_DEFINE and SDT_PROBE.
Probes defined in this way will appear under SDT provider named "sdt".
Parameter types are exposed via SDT_PROBE_ARGTYPE.
This is something that illumos does not have by default.

This kind of SDT probes is already present in ZFS code, so those probes
will now be available if KDTRACE_HOOKS options is enabled.

A potential future illumos compatibility enhancement is to encode a provider
name as a prefix in a probe name.

Reviewed by:	markj
MFC after:	3 weeks
X-MFC after:	r258622
2013-11-26 08:49:53 +00:00
Andriy Gapon
d9fae5ab88 dtrace sdt: remove the ugly sname parameter of SDT_PROBE_DEFINE
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).

Reviewed by:	markj
MFC after:	3 weeks
2013-11-26 08:46:27 +00:00
Pawel Jakub Dawidek
1cef014007 When append-only, immutable or read-only flag is set don't allow for
hard links creation. This matches UFS behaviour.

Reported by:	Oleg Ginzburg <olevole@olevole.ru>
MFC after:	1 month
2013-11-25 21:17:14 +00:00
Attilio Rao
54366c0bd7 - For kernel compiled only with KDTRACE_HOOKS and not any lock debugging
option, unbreak the lock tracing release semantic by embedding
  calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
  version of the releasing functions for mutex, rwlock and sxlock.
  Failing to do so skips the lockstat_probe_func invokation for
  unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
  kernel compiled without lock debugging options, potentially every
  consumer must be compiled including opt_kdtrace.h.
  Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
  dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
  is linked there and it is only used as a compile-time stub [0].

[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested.  As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while.  Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].

Sponsored by:	EMC / Isilon storage division
Discussed with:	rstone
[0] Reported by:	rstone
[1] Discussed with:	philip
2013-11-25 07:38:45 +00:00
Andriy Gapon
a7236350c3 MFV r258378: 4089 NULL pointer dereference in arc_read()
illumos/illumos-gate@57815f6b95

Tested by:	adrian
MFC after:	4 days
2013-11-20 11:52:32 +00:00
Andriy Gapon
c5f4a0a2eb MFV r258377: 4088 use after free in arc_release()
illumos/illumos-gate@ccc22e1304

MFC after:	5 days
2013-11-20 11:47:50 +00:00
Justin Hibbits
de950c79f3 Fix the function search space.
Submitted by:	Howard Su
2013-11-20 01:33:13 +00:00
Andriy Gapon
3fd7f7bef7 zfs page_busy: fix the boundaries of the cleared range
This is a fix for a regression introduced in r246293.

vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries,
otherwise it extends them.  Thus it can happen that the whole page is
marked clean while actually having some small dirty region(s).
This commit makes the range properly aligned and ensures that only
the clean data is marked as such.

It would interesting to evaluate how much benefit clearing with DEV_BSIZE
granularity produces.  Perhaps instead we should clear the whole page
when it is completely overwritten and don't bother clearing any bits
if only a portion a page is written.

Reported by:	George Hartzell <hartzell@alerce.com>,
		Richard Todd <rmtodd@servalan.servalan.com>
Tested by:	George Hartzell <hartzell@alerce.com>,
Reviewed by:	kib
MFC after:	5 days
2013-11-19 18:43:47 +00:00
Alexander Motin
c5068af559 Reenable vfs.zfs.zio.use_uma for amd64, disabled at r209261.
On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage.  It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.
2013-11-19 11:19:07 +00:00
Alan Somers
1f9e80bcdb opensolaris/uts/common/dtrace/fasttrap.c
Fix several problems that can cause panics on kldload and kldunload.

	* kproc_create(fasttrap_pid_cleanup_cb, ...) gets called before
	  fasttrap_provs.fth_table gets allocated.  This can lead to a panic
	  on module load, because fasttrap_pid_cleanup_cb references
	  fasttrap_provs.fth_table.  Move kproc_create down after the point
	  that fasttrap_provs.fth_table gets allocated, and modify the error
	  handling accordingly.

	* dtrace_fasttrap_{fork,exec,exit} weren't getting NULLed until
	  after fasttrap_provs.fth_table got freed.  That caused panics on
	  module unload because fasttrap_exec_exit calls
	  fasttrap_provider_retire, which references
	  fasttrap_provs.fth_table.  NULL those function pointers earlier.

	* There wasn't any code to destroy the
	  fasttrap_{tpoints,provs,procs}.fth_table mutexes on module unload,
	  leading to a resource leak when WITNESS is enabled.  Destroy those
	  mutexes during fasttrap_unload().

Reviewed by:	markj
Approved by:	ken (mentor)
Sponsored by:	Spectra Logic
MFC after:	4 weeks
2013-11-18 16:51:56 +00:00
Steven Hartland
8dfd07b976 Fix ZFS deadlock when sending a snapshot which is mounted.
MFC after:	1 week
Sponsored by:	Multiplay
2013-11-18 11:28:19 +00:00
Mark Johnston
dd580326fe The fasttrap ioctl used to create probes takes a variable-sized argument.
It was not being correctly copied into the kernel on FreeBSD, and as a
result, probes with multiple probe sites were not being created properly.
To fix this, change the ioctl definition so that the fasttrap ioctl handler
is responsible for copying in userland data.

Submitted by:	Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after:	1 month
2013-11-18 03:24:50 +00:00
Alexander Motin
e5056f9882 Introduce allocation cache to store LZ4 compression contexts without kicking
VM subsystem twice for every written record.

Tests on 24-core system show double reduction of CPU time spent on copying
single large well-compressed file.

This patch is not really needed on illumos (while not harm either) since
their memory allocator by default uses caching for all requests up to 128K.

Reviewed by:	Saso Kiselkov <skiselkov.ml@gmail.com>
2013-11-14 15:54:54 +00:00
Mark Johnston
a4cbcb127c Use suword32 and suword64 instead of copyout(9). This fixes a bug in the
emulation of the call instruction caused by reversing the uaddr and kaddr
arguments when copying data out to userland: the suword* functions take the
uaddr as the first argument whereas copyout(9) takes the kaddr as the first
argument. This also partially undoes the fixes from r257143.

Submitted by:	Prashanth Kumar <pra_udupi@yahoo.co.in> (original version)
MFC after:	1 month
2013-11-05 06:13:46 +00:00
Mark Johnston
57170f49f2 Remove references to an unused fasttrap probe hook, and remove the
corresponding x86 trap type. Userland DTrace probes are currently handled
by the other fasttrap hooks (dtrace_pid_probe_ptr and
dtrace_return_probe_ptr).

Discussed with:	rpaulo
2013-10-31 02:35:00 +00:00
Mark Johnston
9c06d5a051 Do some cleanup of the SDT code. In particular,
* Remove the unused sdt cdev.
* Don't bother keeping a list of probes in struct sdt_prov; it's not needed.
* Invoke sdt_load and sdt_unload from the module handler instead of
  registering separate SYSINITs.
* Keep to within 80 columns.
* Check for errors from dtrace_unregister().
2013-10-26 06:23:51 +00:00
Mark Johnston
165de3f338 Fix a couple of bugs in the fasttrap emulation of a "push %rbp" instruction:
the code was trying to save the stack pointer rather than the frame pointer,
and the arguments to copyout(9) were reversed, so nothing ended up being
saved on the stack. This would cause process crashes when the pid provider
was being used to instrument calls of a function starting with this
instruction.

Reported by:	symbolics@gmx.com
Tested by:	symbolics@gmx.com (earlier version)
MFC after:	2 weeks
2013-10-26 03:21:54 +00:00
Justin Hibbits
594ce9ad6f ELF PowerPC64 ABI puts the LR save word at 16 byte offset, not 8. 2013-10-25 00:17:12 +00:00
Steven Hartland
c28078e903 Improve ZFS N-way mirror read performance by using load and locality
information.

The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.

The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.

Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.

This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.

The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.

With pre-fetch disabled (vfs.zfs.prefetch_disable=1):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s

With pre-fetch enabled (vfs.zfs.prefetch_disable=0):

== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s

In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.

The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc

These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487

Reviewed by:	gibbs, mav, will
MFC after:	2 weeks
Sponsored by:	Multiplay
2013-10-23 09:54:58 +00:00
Steven Hartland
70c3432663 Use the vdev's ashift to calculate the supported min block size passed to
zio_compress_data(..) when compressing l2arc buffers.

This eliminates l2arc I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.

MFC after:	2 days
2013-10-22 13:31:36 +00:00
Alexander Motin
40ea77a036 Merge GEOM direct dispatch changes from the projects/camlock branch.
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.

The defined now safety requirements are:
 - caller should not hold any locks and should be reenterable;
 - callee should not depend on GEOM dual-threaded concurency semantics;
 - on the way down, if request is unmapped while callee doesn't support it,
   the context should be sleepable;
 - kernel thread stack usage should be below 50%.

To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
 - G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
 - G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
 - G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
 - G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe.  If any of requirements are not met, request is queued to
g_up or g_down thread same as before.

Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).

To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION.  da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.

This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).

Sponsored by:	iXsystems, Inc.
MFC after:	2 months
2013-10-22 08:22:19 +00:00
Mark Johnston
7e75d58610 When fetching function arguments out of a frame on amd64, explicitly select
the register based on the argument index rather than relying on the fields
in struct reg to be in the right order. This assumption is incorrect on
FreeBSD and generally led to bogus argument values for the sixth argument
of PID and USDT probes; the first five are passed directly to dtrace_probe()
via the fasttrap trap handler and so were correctly handled.

MFC after:	2 weeks
2013-10-21 04:15:55 +00:00
Mark Johnston
e572bc11ec Add a function, memstr, which can be used to convert a buffer of
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:

dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}'

Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:

fbt::kern_execve:entry
{
    printf("%s", memstr(args[1]->begin_argv, ' ',
        args[1]->begin_envv - args[1]->begin_argv));
}

The debug.dtrace.memstr_max sysctl limits the maximum argument size to
memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace.

Tested by:	Fabian Keil (earlier version)
MFC after:	2 weeks
2013-10-16 01:39:26 +00:00
Justin Hibbits
30b318b92f Add fasttrap for PowerPC. This is the last piece of the dtrace/ppc puzzle.
It's incomplete, it doesn't contain full instruction emulation, but it should be
sufficient for most cases.

MFC after:	1 month
2013-10-15 15:00:29 +00:00
Andriy Gapon
5d8fac897e MFV r255257: 4082 zfs receive gets EFBIG from dmu_tx_hold_free()
illumos change 14172:be36a38bac3d:
  illumos ZFS issues:
    4082 zfs receive gets EFBIG from dmu_tx_hold_free()

Please note that this change is slightly different from r255257, because
it is merged out of order with other (larger) upstream changes.

PR:		kern/182570
Reported by:	Keith White <kwhite@site.uottawa.ca>
Tested by:	Keith White <kwhite@site.uottawa.ca>
Approved by:	re (glebius)
MFC after:	1 week
X-MFC after:	r254753
2013-10-10 09:53:46 +00:00
Mark Johnston
cb7320ce7b Initialize and free the DTrace taskqueue in the dtrace module load/unload
handlers rather than in the dtrace device open/close methods. The current
approach can cause a panic if the device is closed which the taskqueue
thread is active, or if a kernel module containing a provider is unloaded
while retained enablings are present and the dtrace device isn't opened.

Submitted by:	gibbs (original version)
Reviewed by:	gibbs
Approved by:	re (glebius)
MFC after:	2 weeks
2013-10-08 12:56:46 +00:00
Xin LI
6eb151f212 Improve lzjb decompress performance by reorganizing the code
to tighten the copy loop.

Submitted by:	Denis Ahrens <denis h3q com>
MFC after:	2 weeks
Approved by:	re (gjb)
2013-10-08 01:38:24 +00:00
Justin T. Gibbs
69d1b777e8 Optimize the block size used on ZFS cache devices as is already done
for data and log devices.

Reported by:	Dmitryy Makarov
Submitted by:	smh
Reviewed by:	gibbs
Approved by:	re (delphij)
MFC after:	2 weeks
2013-09-21 03:52:08 +00:00
Xin LI
253aa02fc3 MFV r254750:
Add support of Illumos dumps on zvol over RAID-Z.

Note that this only adds the features.  FreeBSD would
still need more work to support dumping on zvols.

Illumos ZFS issues:
  2932 support crash dumps to raidz, etc. pools

MFC after:	1 month
Approved by:	re (ZFS blanket)
2013-09-21 00:17:26 +00:00
Davide Italiano
a25a7e386a Fixup cross-device rename checks in ZFS. Add a check for the case
where 'fdvp' is a directory, 'tvp' is an already existing directory
and they have different mount points.

Reported by:	avg, pjd
Reviewed by:	pjd
Approved by:	re (rodrigc)
2013-09-20 23:22:00 +00:00
Xin LI
e8de677c74 MFV r247844 (illumos-gate 13975:ef6409bc370f)
Illumos ZFS issues:
  3582 zfs_delay() should support a variable resolution
  3584 DTrace sdt probes for ZFS txg states

Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.

Approved by:	re (ZFS blanket)
2013-09-10 01:46:47 +00:00
Davide Italiano
d56b4cd4ac - Use make_dev_credf(MAKEDEV_REF) instead of the race-prone make_dev()+
dev_ref() in the clone handlers that still use it.
- Don't set SI_CHEAPCLONE flag, it's not used anywhere neither in devfs
(for anything real)

Reviewed by:	kib
2013-09-07 13:45:44 +00:00
Pawel Jakub Dawidek
ab568de789 Handle cases where capability rights are not provided.
Reported by:	kib
2013-09-05 11:58:12 +00:00
Pawel Jakub Dawidek
7e473ea146 Add sysctl/tunables for various metaslab variables. 2013-09-05 00:53:01 +00:00
Pawel Jakub Dawidek
7008be5bd7 Change the cap_rights_t type from uint64_t to a structure that we can extend
in the future in a backward compatible (API and ABI) way.

The cap_rights_t represents capability rights. We used to use one bit to
represent one right, but we are running out of spare bits. Currently the new
structure provides place for 114 rights (so 50 more than the previous
cap_rights_t), but it is possible to grow the structure to hold at least 285
rights, although we can make it even larger if 285 rights won't be enough.

The structure definition looks like this:

	struct cap_rights {
		uint64_t	cr_rights[CAP_RIGHTS_VERSION + 2];
	};

The initial CAP_RIGHTS_VERSION is 0.

The top two bits in the first element of the cr_rights[] array contain total
number of elements in the array - 2. This means if those two bits are equal to
0, we have 2 array elements.

The top two bits in all remaining array elements should be 0.
The next five bits in all array elements contain array index. Only one bit is
used and bit position in this five-bits range defines array index. This means
there can be at most five array elements in the future.

To define new right the CAPRIGHT() macro must be used. The macro takes two
arguments - an array index and a bit to set, eg.

	#define	CAP_PDKILL	CAPRIGHT(1, 0x0000000000000800ULL)

We still support aliases that combine few rights, but the rights have to belong
to the same array element, eg:

	#define	CAP_LOOKUP	CAPRIGHT(0, 0x0000000000000400ULL)
	#define	CAP_FCHMOD	CAPRIGHT(0, 0x0000000000002000ULL)

	#define	CAP_FCHMODAT	(CAP_FCHMOD | CAP_LOOKUP)

There is new API to manage the new cap_rights_t structure:

	cap_rights_t *cap_rights_init(cap_rights_t *rights, ...);
	void cap_rights_set(cap_rights_t *rights, ...);
	void cap_rights_clear(cap_rights_t *rights, ...);
	bool cap_rights_is_set(const cap_rights_t *rights, ...);

	bool cap_rights_is_valid(const cap_rights_t *rights);
	void cap_rights_merge(cap_rights_t *dst, const cap_rights_t *src);
	void cap_rights_remove(cap_rights_t *dst, const cap_rights_t *src);
	bool cap_rights_contains(const cap_rights_t *big, const cap_rights_t *little);

Capability rights to the cap_rights_init(), cap_rights_set(),
cap_rights_clear() and cap_rights_is_set() functions are provided by
separating them with commas, eg:

	cap_rights_t rights;

	cap_rights_init(&rights, CAP_READ, CAP_WRITE, CAP_FSTAT);

There is no need to terminate the list of rights, as those functions are
actually macros that take care of the termination, eg:

	#define	cap_rights_set(rights, ...)				\
		__cap_rights_set((rights), __VA_ARGS__, 0ULL)
	void __cap_rights_set(cap_rights_t *rights, ...);

Thanks to using one bit as an array index we can assert in those functions that
there are no two rights belonging to different array elements provided
together. For example this is illegal and will be detected, because CAP_LOOKUP
belongs to element 0 and CAP_PDKILL to element 1:

	cap_rights_init(&rights, CAP_LOOKUP | CAP_PDKILL);

Providing several rights that belongs to the same array's element this way is
correct, but is not advised. It should only be used for aliases definition.

This commit also breaks compatibility with some existing Capsicum system calls,
but I see no other way to do that. This should be fine as Capsicum is still
experimental and this change is not going to 9.x.

Sponsored by:	The FreeBSD Foundation
2013-09-05 00:09:56 +00:00
Justin Hibbits
7fb93a40c2 Whitespace cleanup. 2013-09-02 23:22:05 +00:00
Justin Hibbits
f0bd82a11b Fixes for DTrace on PowerPC:
- Implement dtrace_getarg()
- Sync fbt with x86, and fix a typo.
- Pull in the time synchronization code from amd64.
2013-08-31 16:30:20 +00:00
Xin LI
1c1075ed93 Previously, both zfs_rename and zfs_link does a check on whether
the passed vnode belongs to the same mount point (v_vfsp or also
known as v_mount in FreeBSD).  This check prevents the code from
proceeding further on vnodes that do not belong to ZFS, for
instance, on UFS or NULLFS.

The recent change (merged as r254585) on upstream changes the
check of v_vfsp to instead check the znode's z_zfsvfs.  On Illumos
this would work because when the vnode comes from lofs, the
VOP_REALVP() would give the right vnode, this is not true on
FreeBSD where our VOP_REALVP is a no-op, and as such tdvp is
not guaranteed to be a ZFS vnode, and will later trigger a
failed assertion when verifying the vnode.

This changeset modifies our local shims (zfs_freebsd_rename and
zfs_freebsd_link) to check if v_mount matches before proceeding
further.

Reported by:		many
Diagnostic work by:	avg
2013-08-28 00:39:47 +00:00
Mark Johnston
29f4e216f2 Rename the kld_unload event handler to kld_unload_try, and add a new
kld_unload event handler which gets invoked after a linker file has been
successfully unloaded. The kld_unload and kld_load event handlers are now
invoked with the shared linker lock held, while kld_unload_try is invoked
with the lock exclusively held.

Convert hwpmc(4) to use these event handlers instead of having
kern_kldload() and kern_kldunload() invoke hwpmc(4) hooks whenever files are
loaded or unloaded. This has no functional effect, but simplifes the linker
code somewhat.

Reviewed by:	jhb
2013-08-24 21:13:38 +00:00
Xin LI
439024135c MFV r254749:
Don't hold dd_lock for long by breaking it when not doing dsl_dir
accounting.  It is not necessary to hold the lock while manipulating
the parent's accounting, because there is no interface for userland
to see a consistent picture of both parent and child at the same
time anyway.

Illumos ZFS issues:
  4046 dsl_dataset_t ds_dir->dd_lock is highly contended
2013-08-24 00:42:37 +00:00
Xin LI
00e37ef129 MFV r254747:
Fix a panic from dbuf_free_range() from dmu_free_object() while
doing zfs receive.  This is a regression from FreeBSD r253821.

Illumos ZFS issues:
  4047 panic from dbuf_free_range() from dmu_free_object() while
       doing zfs receive
2013-08-24 00:19:26 +00:00
Xin LI
3f0164abf3 MFV r254422:
Illumos DTrace issues:
  3089 want ::typedef
  3094 libctf should support removing a dynamic type
  3095 libctf does not validate arrays correctly
  3096 libctf does not validate function types correctly
2013-08-23 23:21:24 +00:00
Andriy Gapon
2073a41a42 zfs: do not reject any operations on a pool just because it's a boot pool
Unlike the upstream FreeBSD supports booting to all kinds of pools.

Requested by:	many
Tested by:	sbruno
MFC after:	12 days
2013-08-23 14:43:32 +00:00
Andriy Gapon
17a9f2d4db fbt: drop a local write-only variable
Discovered with:	gcc46
MFC after:	4 days
2013-08-23 14:41:27 +00:00
Andriy Gapon
05869c0ea7 zfs: inline and remove zfs_vnode_lock
It didn't serve any useful purpose, but obscured file and line information
useful for debugging.

MFC after:	5 days
X-MFC with:	r254445
2013-08-23 14:40:09 +00:00
Konstantin Belousov
5944de8ecd Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9).
The flag was mandatory since r209792, where vm_page_grab(9) was
changed to only support the alloc retry semantic.

Suggested and reviewed by:	alc
Sponsored by:	The FreeBSD Foundation
2013-08-22 07:39:53 +00:00
Justin Hibbits
7ccb72b31f Make dtrace_copy() actually work on PowerPC. Although unused currently,
it may be used in the future by dtrace.
2013-08-22 02:54:20 +00:00
Kenneth D. Merry
7da1a731c6 Expand the use of stat(2) flags to allow storing some Windows/DOS
and CIFS file attributes as BSD stat(2) flags.

This work is intended to be compatible with ZFS, the Solaris CIFS
server's interaction with ZFS, somewhat compatible with MacOS X,
and of course compatible with Windows.

The Windows attributes that are implemented were chosen based on
the attributes that ZFS already supports.

The summary of the flags is as follows:

UF_SYSTEM:	Command line name: "system" or "usystem"
		ZFS name: XAT_SYSTEM, ZFS_SYSTEM
		Windows: FILE_ATTRIBUTE_SYSTEM

		This flag means that the file is used by the
		operating system.  FreeBSD does not enforce any
		special handling when this flag is set.

UF_SPARSE:	Command line name: "sparse" or "usparse"
		ZFS name: XAT_SPARSE, ZFS_SPARSE
		Windows: FILE_ATTRIBUTE_SPARSE_FILE

		This flag means that the file is sparse.  Although
		ZFS may modify this in some situations, there is
		not generally any special handling for this flag.

UF_OFFLINE:	Command line name: "offline" or "uoffline"
		ZFS name: XAT_OFFLINE, ZFS_OFFLINE
		Windows: FILE_ATTRIBUTE_OFFLINE

		This flag means that the file has been moved to
		offline storage.  FreeBSD does not have any special
		handling for this flag.

UF_REPARSE:	Command line name: "reparse" or "ureparse"
		ZFS name: XAT_REPARSE, ZFS_REPARSE
		Windows: FILE_ATTRIBUTE_REPARSE_POINT

		This flag means that the file is a Windows reparse
		point.  ZFS has special handling code for reparse
		points, but we don't currently have the other
		supporting infrastructure for them.

UF_HIDDEN:	Command line name: "hidden" or "uhidden"
		ZFS name: XAT_HIDDEN, ZFS_HIDDEN
		Windows: FILE_ATTRIBUTE_HIDDEN

		This flag means that the file may be excluded from
		a directory listing if the application honors it.
		FreeBSD has no special handling for this flag.

		The name and bit definition for UF_HIDDEN are
		identical to the definition in MacOS X.

UF_READONLY:	Command line name: "urdonly", "rdonly", "readonly"
		ZFS name: XAT_READONLY, ZFS_READONLY
		Windows: FILE_ATTRIBUTE_READONLY

		This flag means that the file may not written or
		appended, but its attributes may be changed.

		ZFS currently enforces this flag, but Illumos
		developers have discussed disabling enforcement.

		The behavior of this flag is different than MacOS X.
		MacOS X uses UF_IMMUTABLE to represent the DOS
		readonly permission, but that flag has a stronger
		meaning than the semantics of DOS readonly permissions.

UF_ARCHIVE:	Command line name: "uarch", "uarchive"
		ZFS_NAME: XAT_ARCHIVE, ZFS_ARCHIVE
		Windows name: FILE_ATTRIBUTE_ARCHIVE

		The UF_ARCHIVED flag means that the file has changed and
		needs to be archived.  The meaning is same as
		the Windows FILE_ATTRIBUTE_ARCHIVE attribute, and
		the ZFS XAT_ARCHIVE and ZFS_ARCHIVE attribute.

		msdosfs and ZFS have special handling for this flag.
		i.e. they will set it when the file changes.

sys/param.h:		Bump __FreeBSD_version to 1000047 for the
			addition of new stat(2) flags.

chflags.1:		Document the new command line flag names
			(e.g. "system", "hidden") available to the
			user.

ls.1:			Reference chflags(1) for a list of file flags
			and their meanings.

strtofflags.c:		Implement the mapping between the new
			command line flag names and new stat(2)
			flags.

chflags.2:		Document all of the new stat(2) flags, and
			explain the intended behavior in a little
			more detail.  Explain how they map to
			Windows file attributes.

			Different filesystems behave differently
			with respect to flags, so warn the
			application developer to take care when
			using them.

zfs_vnops.c:		Add support for getting and setting the
			UF_ARCHIVE, UF_READONLY, UF_SYSTEM, UF_HIDDEN,
			UF_REPARSE, UF_OFFLINE, and UF_SPARSE flags.

			All of these flags are implemented using
			attributes that ZFS already supports, so
			the on-disk format has not changed.

			ZFS currently doesn't allow setting the
			UF_REPARSE flag, and we don't really have
			the other infrastructure to support reparse
			points.

msdosfs_denode.c,
msdosfs_vnops.c:	Add support for getting and setting
			UF_HIDDEN, UF_SYSTEM and UF_READONLY
			in MSDOSFS.

			It supported SF_ARCHIVED, but this has been
			changed to be UF_ARCHIVE, which has the same
			semantics as the DOS archive attribute instead
			of inverse semantics like SF_ARCHIVED.

			After discussion with Bruce Evans, change
			several things in the msdosfs behavior:

			Use UF_READONLY to indicate whether a file
			is writeable instead of file permissions, but
			don't actually enforce it.

			Refuse to change attributes on the root
			directory, because it is special in FAT
			filesystems, but allow most other attribute
			changes on directories.

			Don't set the archive attribute on a directory
			when its modification time is updated.
			Windows and DOS don't set the archive attribute
			in that scenario, so we are now bug-for-bug
			compatible.

smbfs_node.c,
smbfs_vnops.c:		Add support for UF_HIDDEN, UF_SYSTEM,
			UF_READONLY and UF_ARCHIVE in SMBFS.

			This is similar to changes that Apple has
			made in their version of SMBFS (as of
			smb-583.8, posted on opensource.apple.com),
			but not quite the same.

			We map SMB_FA_READONLY to UF_READONLY,
			because UF_READONLY is intended to match
			the semantics of the DOS readonly flag.
			The MacOS X code maps both UF_IMMUTABLE
			and SF_IMMUTABLE to SMB_FA_READONLY, but
			the immutable flags have stronger meaning
			than the DOS readonly bit.

stat.h:			Add definitions for UF_SYSTEM, UF_SPARSE,
			UF_OFFLINE, UF_REPARSE, UF_ARCHIVE, UF_READONLY
			and UF_HIDDEN.

			The definition of UF_HIDDEN is the same as
			the MacOS X definition.

			Add commented-out definitions of
			UF_COMPRESSED and UF_TRACKED.  They are
			defined in MacOS X (as of 10.8.2), but we
			do not implement them (yet).

ufs_vnops.c:		Add support for getting and setting
			UF_ARCHIVE, UF_HIDDEN, UF_OFFLINE, UF_READONLY,
			UF_REPARSE, UF_SPARSE, and UF_SYSTEM in UFS.
			Alphabetize the flags that are supported.

			These new flags are only stored, UFS does
			not take any action if the flag is set.

Sponsored by:	Spectra Logic
Reviewed by:	bde (earlier version)
2013-08-21 23:04:48 +00:00
Justin T. Gibbs
5119608387 Add kstat entries for ZFS compression statistics.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
	Add module lifetime functions to allocate and teardown
	state data.

	Report:
		- Compression attempts.
		- Buffers found to be empty.
		- Compression calls that are skipped because
		  the data length is already less than or
		  equal to the minimum block length.
		- Compression attempts that fail to yield a 12.5%
		  compression ratio.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c:
	Add calls to the zio_compress.c module's init and fini
	functions.

Sponosred by:	Spectra Logic Corporation
MFC after:	2 weeks
2013-08-21 19:40:43 +00:00
Justin T. Gibbs
439d30d121 Enhance the ZFS vdev layer to maintain both a logical and a physical
minimum allocation size for devices.  Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.

Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.

Calculate the minimum blocksize of each metaslab class.  Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.

Report devices with sub-optimal block size configuration in "zpool
status".  Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.

Sponsored by:	Spectra Logic Corporaion
MFC after:	2 weeks

Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands.  Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size).  Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.

Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:

1) Existing pools created with devices that have different logical
   and physical block sizes, but were configured to use the logical
   block size (e.g. because the OS version used for pool construction
   reported the logical block size instead of the physical block
   size) will suddenly find that the vdev allocation size has
   increased.  This can be easily tolerated for active members of
   the array, but ZFS would prevent replacement of a vdev with
   another identical device because it now appears that the smaller
   allocation size required by the pool is not supported by the new
   device.

2) The device's physical block size may be too large to be supported
   by ZFS.  The optimal allocation size for the vdev may be quite
   large.  For example, a RAID controller may export a vdev that
   requires read-modify-write cycles unless accessed using 64k
   aligned/sized requests.  ZFS currently has an 8k minimum block
   size limit.

Reporting both the logical and physical allocation sizes for vdevs
solves these problems.  A device may be used so long as the logical
block size is compatible with the configuration.  By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
	Add the SPA_ASHIFT constant.  ZFS currently has a hard upper
	limit of 13 (8k) for ashift and this constant is used to
	both document and enforce this limit.

sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
	Add the VDEV_AUX_ASHIFT_TOO_BIG error code.

	Add fields for exporting the configured, logical, and
	physical ashift to the vdev_stat_t structure.

	Add VDEV_STAT_VALID() macro which can be used to verify the
	presence of required vdev_stat_t fields in nvlist data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
	Provide a SYSCTL_PROC handler for "max_auto_ashift".  Since
	the limit is only referenced long after boot when a create
	operation occurs, there's no compelling need for it to be
	a boot time configurable tunable.  This also allows the
	validation code for the max_auto_ashift value to be contained
	within the sysctl handler.

	Populate the new fields in the vdev_stat_t structure.

	Fail vdev opens if the vdev reports an ashift larger than
	SPA_MAXASHIFT.

	Propogate vdev_logical_ashift and vdev_physical_ashift between
	child and parent vdevs as is done for vdev_ashift.

	In vdev_open(), restore code that fails opens for devices
	where vdev_ashift grows.  This can only happen now if the
	device's logical ashift grows, which means it really isn't
	safe to use the device.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
	Update the vdev_open() API so that both logical (what was
	just ashift before) and physical ashift are reported.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
	Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
	to vdev_t.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
	Add vdev_ashift_optimize().  Call it anytime a new top-level
	vdev is allocated.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
	Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.

	For each sub-optimally configured leaf vdev, report configured
	and native block sizes.

cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
	Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
	This status is reported on healthy pools containing vdevs
	configured to use a block size smaller than their reported
	physical block size.

cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
	Update find_vdev_problem() and supporting functions to
	provide the full vdev_stat_t structure to problem checking
	routines, and to allow decent into replacing vdevs.

	Add a vdev_non_native_ashift() validator which is used on
	the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.

cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
	Enhance sysctl userland stubs now that a SYSCTL_PROC handler
	is used in vdev.c.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
	When the group membership of a metaslab class changes (i.e.
	when a vdev is added or removed from a pool), walk the group
	list to determine the smallest block size currently available
	and record this in the metaslab class.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
	Add the metaslab_class_get_minblocksize() accessor.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
	In zio_compress_data(), take the minimum blocksize as an
	input parameter instead of assuming SPA_MINBLOCKSIZE.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
	In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
	blocksize of the device.  The l2arc code performs has it's own
	code for deciding if compression is worth while, so this
	effectively disables zio_compress_data() from second guessing
	the original decision.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
	In zio_write_bp_init(), use the minimum blocksize of the
	normal metaslab class when compressing data.
2013-08-21 04:10:24 +00:00
Xin LI
2640fb93f5 MFV r254421:
Illumos ZFS issues:
  3996 want a libzfs_core API to rollback to latest snapshot
2013-08-21 00:04:31 +00:00
Xin LI
c21d9cfe3d MFV r254220:
Illumos ZFS issues:
  4039 zfs_rename()/zfs_link() needs stronger test for XDEV
2013-08-20 22:31:13 +00:00
Justin Hibbits
cc117e2773 Fix some ppc64 dtrace bugs, and enable systrace_freebsd32 for ppc64. 2013-08-19 05:10:46 +00:00
Mark Johnston
7bc992c037 Add a "translated type" argument to SDT_PROBE_ARGTYPE() and add some macros
which allow one to define SDT probes that specify translated types. The idea
is to make it easy to write SDT probe definitions that can work across
multiple operating systems. In particular, this makes it possible to port
illumos SDT probes to FreeBSD without changing their argument types, so long
as the appropriate translators are defined. Then DTrace scripts written for
Solaris/illumos will work on FreeBSD without any changes.

MFC after:	1 week
2013-08-17 22:02:26 +00:00
Pawel Jakub Dawidek
2c40899ecc Remove redundant variable. 2013-08-17 14:09:46 +00:00
Mark Johnston
12ede07ab8 Use kld_{load,unload} instead of mod_{load,unload} for the linker file load
and unload event handlers added in r254266.

Reported by:	jhb
X-MFC with:	r254266
2013-08-14 00:42:21 +00:00
Mark Johnston
8776669b53 FreeBSD's DTrace implementation has a few problems with respect to handling
probes declared in a kernel module when that module is unloaded. In
particular,

* Unloading a module with active SDT probes will cause a panic. [1]
* A module's (FBT/SDT) probes aren't destroyed when the module is unloaded;
  trying to use them after the fact will generally cause a panic.

This change fixes both problems by porting the DTrace module load/unload
handlers from illumos and registering them with the corresponding
EVENTHANDLER(9) handlers. This allows the DTrace framework to destroy all
probes defined in a module when that module is unloaded, and to prevent a
module unload from proceeding if some of its probes are active. The latter
problem has already been fixed for FBT probes by checking lf->nenabled in
kern_kldunload(), but moving the check into the DTrace framework generalizes
it to all kernel providers and also fixes a race in the current
implementation (since a probe may be activated between the check and the
call to linker_file_unload()).

Additionally, the SDT implementation has been reworked to define SDT
providers/probes/argtypes in linker sets rather than using SYSINIT/SYSUNINIT
to create and destroy SDT probes when a module is loaded or unloaded. This
simplifies things quite a bit since it means that pretty much all of the SDT
code can live in sdt.ko, and since it becomes easier to integrate SDT with
the DTrace framework. Furthermore, this allows FreeBSD to be quite flexible
in that SDT providers spanning multiple modules can be created on the fly
when a module is loaded; at the moment it looks like illumos' SDT
implementation requires all SDT probes to be statically defined in a single
kernel table.

PR:		166927, 166926, 166928
Reported by:	davide [1]
Reviewed by:	avg, trociny (earlier version)
MFC after:	1 month
2013-08-13 03:10:39 +00:00
Rui Paulo
e009490afc fasttrap_fork(): unlock the processes before removing the tracepoints.
In the future, we'll need to come up with new proc_*() functions that accept
locked processes. For now, this prevents postgresql + DTrace from crashing the
system.

MFC after:	1 month
2013-08-11 00:57:01 +00:00
Attilio Rao
c7aebda8a1 The soft and hard busy mechanism rely on the vm object lock to work.
Unify the 2 concept into a real, minimal, sxlock where the shared
acquisition represent the soft busy and the exclusive acquisition
represent the hard busy.
The old VPO_WANTED mechanism becames the hard-path for this new lock
and it becomes per-page rather than per-object.
The vm_object lock becames an interlock for this functionality:
it can be held in both read or write mode.
However, if the vm_object lock is held in read mode while acquiring
or releasing the busy state, the thread owner cannot make any
assumption on the busy state unless it is also busying it.

Also:
- Add a new flag to directly shared busy pages while vm_page_alloc
  and vm_page_grab are being executed.  This will be very helpful
  once these functions happen under a read object lock.
- Move the swapping sleep into its own per-object flag

The KPI is heavilly changed this is why the version is bumped.
It is very likely that some VM ports users will need to change
their own code.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff, kib
Tested by:	gavin, bapt (older version)
Tested by:	pho, scottl
2013-08-09 11:11:11 +00:00
Xin LI
43667c1f68 MFV r254079:
Illumos ZFS issues:
  3957 ztest should update the cachefile before killing itself
  3958 multiple scans can lead to partial resilvering
  3959 ddt entries are not always resilvered
  3960 dsl_scan can skip over dedup-ed blocks if
       physical birth != logical birth
  3961 freed gang blocks are not resilvered and can cause pool to suspend
  3962 ztest should print out zfs debug buffer before exiting
2013-08-08 23:38:31 +00:00
Xin LI
9d2f243aa6 MFV r254071:
Fix a regression introduced by fix for Illumos bug #3834.  Quote from
Matthew Ahrens on the Illumos issue:

ztest fails this assertion because ztest_dmu_read_write() does
        dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
and then
    dmu_object_set_checksum(os, bigobj,
        (enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx);

If the region to free is past the end of the file, the DMU assumes that there
will be nothing to do for this object.  However, ztest does set_checksum(),
which must modify the dnode.  The fix is for ztest to also call

    dmu_tx_hold_bonus(tx, bigobj);

so we can account for the dirty data associated with setting the checksum

Illumos ZFS issues:
  3955 ztest failure: assertion refcount_count(&tx->tx_space_written)
         + delta <= tx->tx_space_towrite
2013-08-07 22:21:00 +00:00
Xin LI
4f7b34578b MFV r254070:
Merge vendor bugfix for ZFS test suite that triggers false positives.

Illumos ZFS issues:
  3949 ztest fault injection should avoid resilvering devices
  3950 ztest: deadman fires when we're doing a scan
  3951 ztest hang when running dedup test
  3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
2013-08-07 21:16:14 +00:00
Jeff Roberson
5df87b21d3 Replace kernel virtual address space allocation with vmem. This provides
transparent layering and better fragmentation.

 - Normalize functions that allocate memory to use kmem_*
 - Those that allocate address space are named kva_*
 - Those that operate on maps are named kmap_*
 - Implement recursive allocation handling for kmem_arena in vmem.

Reviewed by:	alc
Tested by:	pho
Sponsored by:	EMC / Isilon Storage Division
2013-08-07 06:21:20 +00:00
Xin LI
c668ff330e MFV r254011:
This change have no effect to FreeBSD but integrated for
completeness.

Illumos ZFS issues:
  348 ZFS should handle DKIOCGMEDIAINFOEXT failure
2013-08-06 21:36:01 +00:00
Andriy Gapon
c319ea15f4 opensolaris code: translate INVARIANTS to DEBUG and ZFS_DEBUG
Do this by forcing inclusion of
sys/cddl/compat/opensolaris/sys/debug_compat.h
via -include option into all source files from OpenSolaris.
Note that this -include option must always be after -include opt_global.h.

Additionally, remove forced definition of DEBUG for some modules and fix
their build without DEBUG.

Also, meaning of DEBUG was overloaded to enable WITNESS support for some
OpenSolaris (primarily ZFS) locks.  Now this overloading is removed and
that use of DEBUG is replaced with a new option OPENSOLARIS_WITNESS.

MFC after:	17 days
2013-08-06 15:51:56 +00:00
Alexander Motin
d9aca4ed74 Block reporting of ZFS features for suspended pools.
Before executing any subcommand, zpool tool fetches pools configuration from
the kernel.  Before features support was added, kernel was regenerating that
configuration based on data always present in memory.  Unfortunately, pool
features list and activity counters are not such. They are stored in ZAP,
that normally resides in ARC, but under heavy memory pressure may be swapped
out.  If pool is suspended at this point, there is no way to recover it back
since any zpool command will stuck.

This change has one predictable flaw: `zpool upgrade` always wish to upgrade
suspended pools, but fortunately it can't do it due to the suspension.
2013-08-06 14:41:41 +00:00
Alexander Motin
f8dcf872c4 Disable r252840 when ZFS TRIM is enabled (vfs.zfs.trim.enabled=1) and really
disable TRIM otherwise.

r252840 (illumos bug 3836) is based on assumption that zio_free_sync() has
no lock dependencies and should complete immediately. Unfortunately, with our
TRIM implementation that is not true due to ZIO_STAGE_VDEV_IO_START added
to the ZIO_FREE_PIPELINE, which, while not really accessing devices, still
acquires SCL_ZIO lock for read to be sure devices won't disappear.

When TRIM is disabled, this patch enables direct free execution from r252840
and removes ZIO_STAGE_VDEV_IO_START and ZIO_STAGE_VDEV_IO_ASSESS stages from
the pipeline to avoid lock acquisition.  Otherwise it queues free request as
it was before r252840.
2013-08-06 14:30:28 +00:00
Alexander Motin
526bb4af8a Make zpool clear to reopen also reconnected cache and spare devices.
Since `zpool status` reports about such kinds of errors, it is strange
that they are not cleared by `zpool clear`.
2013-08-06 14:23:33 +00:00
Alexander Motin
ad727e8d64 Make ZFS to use separate thread to handle SPA_ASYNC_REMOVE async events.
Existing async thread is running only on successfull spa_sync() completion,
that is impossible in case of pool loosing required (last) disk(s).  That
indefinite delay of SPA_ASYNC_REMOVE processing made ZFS to not close the
lost disks, preventing GEOM/CAM from destroying devices and reusing names
on later disk reattach.

In earlier version of the patch I've tried to just run existing thread
immediately, unrelated to spa_sync() completion, but that exposed number
of situations where it could stuck due to locks held by stuck spa_sync(),
that are required for other kinds of async events.

Experiments with OpenIndiana snapshot confirmed that they also have this
issue with lost disks reattach.
2013-08-06 14:20:41 +00:00
Attilio Rao
be99683637 Revert r253939:
We cannot busy a page before doing pagefaults.
Infact, it can deadlock against vnode lock, as it tries to vget().
Other functions, right now, have an opposite lock ordering, like
vm_object_sync(), which acquires the vnode lock first and then
sleeps on the busy mechanism.

Before this patch is reinserted we need to break this ordering.

Sponsored by:	EMC / Isilon storage division
Reported by:	kib
2013-08-05 08:55:35 +00:00
Attilio Rao
3b6714cacb The page hold mechanism is fast but it has couple of fallouts:
- It does not let pages respect the LRU policy
- It bloats the active/inactive queues of few pages

Try to avoid it as much as possible with the long-term target to
completely remove it.
Use the soft-busy mechanism to protect page content accesses during
short-term operations (like uiomove_fromphys()).

After this change only vm_fault_quick_hold_pages() is still using the
hold mechanism for page content access.
There is an additional complexity there as the quick path cannot
immediately access the page object to busy the page and the slow path
cannot however busy more than one page a time (to avoid deadlocks).

Fixing such primitive can bring to complete removal of the page hold
mechanism.

Sponsored by:	EMC / Isilon storage division
Discussed with:	alc
Reviewed by:	jeff
Tested by:	pho
2013-08-04 21:07:24 +00:00
Steven Hartland
e44e975c1b zfs_ioc_rename should not leave the value of zc_name passed in via zc altered
on return.

MFC after:	1 week
2013-08-04 11:38:08 +00:00
Xin LI
bd3d1456a5 MFV r253783:
Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.

Illumos ZFS issues:
  3834 incremental replication of 'holey' file systems is slow

MFC after:      2 weeks
2013-07-30 21:35:02 +00:00
Xin LI
1c4ead73c6 MFV r253782:
To quote Illumos issue #3888:

When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).

However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).

Illumos ZFS issues:
  3888 zfs recv -F should destroy any snapshots created since the
       incremental source

MFC after:	2 weeks
2013-07-30 21:20:12 +00:00
Xin LI
d637247e1f MFV r253781 + r253871:
Illumos ZFS issues:
  3894 zfs should not allow snapshot of inconsistent dataset

MFC after:	2 weeks
2013-07-30 21:02:09 +00:00
Xin LI
44e362e207 MFV r253780:
To quote Illumos #3875:

The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.

Illumos ZFS issues:
  3875 panic in zfs_root() after failed rollback

MFC after:	2 weeks
2013-07-30 20:37:32 +00:00
Alexander Motin
ec4d2e0d96 Allow three IOCTLs to be used on suspended pool, restoring state that
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.

This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness.  There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.
2013-07-30 14:50:44 +00:00
Andriy Gapon
0f09691df8 dtrace disassembler: take the latest/last CDDL code from OpenSolaris
OpenSolaris version is:
13108:33bb8a0301ab
6762020 Disassembly support for Intel Advanced Vector Extensions (AVX)

This corresponds to Illumos-gate (github) version
ab47273fedff893c8ae22ec39ffc666d4fa6fc8b

MFC after:	3 weeks
2013-07-29 16:56:38 +00:00
Alexander Motin
698cd997d6 Partially close race between calls of orphan() method from GEOM and close()
method from ZFS core, that reliably causes use-after-free panic if SSD vdev
detached during inititial erase.
2013-07-28 20:07:34 +00:00
Alexander Motin
ffacde9be5 Following r222950, revert unintentional change cls -> class in argument name
in r245264.  Aside from non-uniformity, that again confused C++ compilers.
2013-07-25 08:41:22 +00:00
Andriy Gapon
f66c1f6482 zfs module: perform cleanup during shutdown in addition to module unload
- move init and fini code into separate functions (like it is done upstream)
- invoke fini code via shutdown_post_sync event hook

This should make zfs close its underlying devices during shutdown,
which may be important for their drivers.

MFC after:	20 days
2013-07-24 09:59:16 +00:00
Andriy Gapon
886dbd270f zfs: move vnode creation from zfs_znode_cache_constructor to zfs_znode_alloc
All other places where a znode is allocated do not need z_vnode at all.
These are:
- zfs_create_share_dir
- zfs_create_fs

This chnage ensures two things:
- VN_LOCK_ASHARE is not erroneously called for VFIFO vnodes
- vn_lock is called on a fully constructed vnode with correct v_ops

The change also allows to make zfs_znode_cache_constructor a normal
kmem_cache constructor again (as it is in upstream).
This allows to avoid a problem where zfs_znode_cache_destructor
may be called on un-constructed znodes.

MFC after:	17 days
2013-07-24 09:15:59 +00:00
Xin LI
c92bc5e996 Manually merge part of vendor import r238583 from Illumos.
Illumos changeset: 13680:2bd022a765e2
Illumos ZFS issue:

    2671 zpool import should not fail if vdev ashift has increased

MFC after:	3 days
2013-07-18 00:22:42 +00:00
Andriy Gapon
37b8b2d4d8 dtrace/fasttrap: install hook functions only after all data is
initialized

Sponsored by:	HybridCluster
MFC after:	 7 days
2013-07-09 09:05:00 +00:00
Andriy Gapon
9c1f50af0a zfs: try to properly handle i/o errors in mappedread_sf
Unconditionally freeing a page is not good, especially if it is the page
that was wired by the caller.  The checks are picked up from
kern_sendfile.

MFC after:	3 weeks
2013-07-09 08:47:11 +00:00
Andriy Gapon
78ed7a7855 zfs: load zpool.cache after a root fs is mounted
MFC after:	3 weeks
2013-07-09 08:37:42 +00:00
Mark Johnston
46d27dbb38 Hide references to mod_lock. In FreeBSD it is always acquired with the
provider lock held, so its use has no effect.
2013-07-05 22:42:10 +00:00
Martin Matuska
12df7d65b0 MFV r252839:
Quoting illumos issue #3836:
  Currently zio_free() always puts the zio on a list for subsequent
  processing by zio_free_sync().  This is only necessary for frees that
  might need to issue reads (gang and dedup blocks).

  By processing the majority of the frees as we encounter them, we reduce
  the amount of time that the spa_sync() thread spends burning CPU and
  not doing any i/o, thus increasing the overall write throughput of the
  system.

Illumos ZFS issues:
  3836 zio_free() can be processed immediately in the common case

MFC after:	1 week
2013-07-05 21:29:59 +00:00
Mark Johnston
0022f867b4 Be sure to destory the fasttrap cleanup mutex when unloading the fasttrap
module. This should be MFCed with r250953.
2013-07-01 23:12:59 +00:00
Robert Millan
2592710c47 Enable kernel-specific code for FreeBSD also on other systems that use
the kernel of FreeBSD.

Reviewed by:	pjd
2013-06-30 23:14:55 +00:00
Steven Hartland
3666c4917b style(9) fixes
MFC after:	2 days
2013-06-29 23:39:38 +00:00
Steven Hartland
baa0b41221 Remove invalid ASSERT which causes a panic on zfs renames when run with ASSERTS.
Removal was missed in merge of illumos 3464 (r248571)

MFC after:	2 days
2013-06-29 23:15:45 +00:00
Martin Matuska
f82ca5238a Unbreak "zfs jail" and "zfs unjail" (broken since r248571)
I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's
with the new zfs_ioctl_register_legacy() function.

These operations do not modify pools or datasets so there is no need to
log them to pool history.

Reported by:	Alexander Leidinger <ale@FreeBSD.org> and others on current@
MFC after:	3 days
2013-06-29 16:45:37 +00:00
Gavin Atkinson
af582854d8 Don't try to re-insert an already present but invalid page.
This could happen if a thread doing a page-in loses a ZFS range lock
race to a thread writing to the same range

This fixes "panic: vm_page_alloc: pindex already allocated" in
http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel

Submitted by:	avg
MFC after:	1 week
2013-06-28 07:51:12 +00:00
Mark Johnston
837610eb04 The dtmalloc provider uses the short description of a malloc type as the
function name of its corresponding DTrace probes. These descriptions may
contain whitespace, but probe names cannot, so just replace any whitespace
with underscores when creating probes.

MFC after:	1 week
2013-06-28 03:14:40 +00:00
Xin LI
e33806a54a MFV r252215:
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).

Illumos ZFS issues:
  3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl

MFC after:	1 week
2013-06-25 22:14:32 +00:00
Steven Hartland
43e695497c Switch ZFS mutex_owner macro to use sx_xholder as its now exported via sx.h
MFC after:	1 week
2013-06-21 15:55:03 +00:00
Steven Hartland
9446debe6b Fix intermittent ZFS lock panic when kernel is compiled with debugging caused
by access of uninitialized smlock in mmutex_init.

MFC after:	1 week
2013-06-21 15:47:10 +00:00
Steven Hartland
5f921c5911 Fixed import of destroyed ZFS pools failing due to vdev_geom incorrectly
preventing config loads from devices associated with destroyed pools.

Reviewed by:	avg
MFC after:	1 week
2013-06-21 12:02:09 +00:00
Xin LI
9625321547 MFV r251644:
Poor ZFS send / receive performance due to snapshot
hold / release processing (by smh@)

Illumos ZFS issues:
  3740 Poor ZFS send / receive performance due to snapshot
       hold / release processing

MFC after:      2 weeks
2013-06-12 07:07:06 +00:00
Xin LI
ed8fd1989f MFV r251626:
ZFS event processing should work on R/O root filesystems

Illumos ZFS issues:
  3749 zfs event processing should work on R/O root filesystems

MFC after:      2 weeks
2013-06-11 19:35:44 +00:00
Xin LI
3b245f3ee1 MFV r251624:
txg commit callbacks don't work

Illumos ZFS issues:
  3747 txg commit callbacks don't work

MFC after:      2 weeks
2013-06-11 19:29:31 +00:00
Xin LI
3f3a9cac29 MFV r251622:
ZFS shouldn't ignore errors unmounting snapshots

Illumos ZFS issues:
  3744 zfs shouldn't ignore errors unmounting snapshots

MFC after:      2 weeks
2013-06-11 19:22:20 +00:00
Xin LI
57e06a1a63 MFV r251621:
ZFS needs a refcount audit

Illumos ZFS issues:
  3741 zfs needs a refcount audit

MFC after:      2 weeks
2013-06-11 19:16:14 +00:00
Xin LI
a91afe8a8d MFV r251620:
ZFS comments need cleaner, more consistent style

Illumos ZFS issues:
  3741 zfs comments need cleaner, more consistent style

MFC after:      2 weeks
2013-06-11 19:12:06 +00:00
Xin LI
4acaabea05 MFV r251619:
ZFS needs better comments.

Illumos ZFS issues:
  3741 zfs needs better comments

MFC after:      2 weeks
2013-06-11 19:02:36 +00:00
Xin LI
9e43a32a5c MFV r251519:
* Illumos ZFS issue #3805 arc shouldn't cache freed blocks

Quote from the Illumos issue:

    ZFS should proactively evict freed blocks from the cache.

    Even though these freed blocks will never be used again, and thus
    will eventually be evicted, this causes us to use memory
    inefficiently for 2 reasons:

    1. A block that is freed has no chance of being accessed again, but
       will be kept in memory preferentially to a block that was accessed
       before it (and is thus older) but has not been freed and thus has
       at least some chance of being accessed again.

    2. We partition the ARC into several buckets:
       user data that has been accessed only once (MRU)
       metadata that has been accessed only once (MRU)
       user data that has been accessed more than once (MFU)
       metadata that has been accessed more than once (MFU)

    The user data vs metadata split is somewhat arbitrary, and the
    primary control on how much memory is used to cache data vs metadata
    is to simply try to keep the proportion the same as it has been in the
    past (each bucket "evicts against" itself).  The secondary control is
    to evict data before evicting metadata.

    Because of this bucketing, we may end up with one bucket mostly
    containing freed blocks that are very old, while another bucket has
    more recently accessed, still-allocated blocks.  Data in the useful
    bucket (with still-allocated blocks) may be evicted in preference to
    data in the useless bucket (with old, freed blocks).

    On dcenter, we saw that the MFU metadata bucket was 230MB, while the
    MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
    However, the vast majority of data in the MRU metadata bucket (256GB)
    was freed blocks, and thus useless.  Meanwhile, the MFU metadata bucket
    (230MB) was constantly evicting useful blocks that will be soon needed.

    The problem of cache segmentation is a larger problem that needs more
    investigation.  However, if we stop caching freed blocks, it should
    reduce the impact of this more fundamental issue.

MFC after:	2 weeks
2013-06-08 09:11:20 +00:00
Xin LI
ca8a27d4b1 MFV r251474:
* Illumos zfs issue #3137 L2ARC compression

Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.

The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.

MFC after:	2 weeks
2013-06-06 23:21:41 +00:00
Mark Johnston
f263e440d4 SDT probes can directly pass up to five arguments as arguments to
dtrace_probe(). Arguments beyond these five must be obtained in an
architecture-specific way; this can be done through the getargval provider
method, and through dtrace_getarg() if getargval isn't overridden.

This change fixes two off-by-one bugs in the way these arguments are fetched
in FreeBSD's DTrace implementation. First, the SDT provider must set the
aframes parameter to 1 when creating a probe. The aframes parameter controls
the number of frames that dtrace_getarg() will step over in order to find
the frame containing the extra arguments. On FreeBSD, dtrace_getarg() is
called in SDT probe context via

dtrace_probe()->dtrace_dif_emulate()->dtrace_dif_variable->dtrace_getarg()

so aframes must be 3 since the arguments are in dtrace_probe()'s frame; it
was previously being called with a value of 2 instead. illumos uses a
different aframes value for SDT probes, but this is because illumos SDT
probes fire by triggering the #UD fault handler rather than calling
dtrace_probe() directly.

The second bug has to do with the way arguments are grabbed out
dtrace_probe()'s frame on amd64. The code currently jumps over the first
stack argument and retrieves the rest of them using a pointer into the
stack. This works on i386 because all of dtrace_probe()'s arguments will be
on the stack and the first argument is the probe ID, which should be
ignored. However, it is incorrect to ignore the first stack argument on
amd64, so we correct the pointer used to access the arguments.

MFC after:	2 weeks
2013-06-02 01:05:36 +00:00
Mark Johnston
18161786c6 Port the SDT test now that it's possible to create SDT probes that take
seven arguments.

The original test uses Solaris' uadmin system call to trigger the test
probe; this change adds a sysctl to the dtrace_test module and gets the test
program to trigger the test probe via the sysctl handler.

The test is currently failing on amd64 because of some bugs in the way that
probe arguments beyond the first five are obtained - these bugs will be
fixed in a separate change.
2013-06-02 00:33:36 +00:00
Mark Johnston
427bc75e19 The fasttrap provider cleans up probes asynchronously when a process with
USDT probes exits. This was previously done with a callout; however, it is
possible to sleep while holding the DTrace mutexes, so a panic will occur
on INVARIANTS kernels if the callout handler can't immediately acquire one
of these mutexes. This panic will be frequently triggered on systems where
a USDT-enabled program (perl, for instance) is often run.

This revision changes the fasttrap cleanup mechanism so that a dedicated
thread is used instead of a callout. The old behaviour is otherwise
preserved.

Reviewed by:	rpaulo
MFC after:	1 month
2013-05-24 03:29:32 +00:00
Mark Johnston
09e6105ff4 Bring back part of r249367 by adding DTrace's temporal option, which allows
users to guarantee that the output of DTrace scripts will be time-ordered.
This option is enabled by adding the line

  #pragma D option temporal

to the beginning of a script, or by adding '-x temporal' to the arguments of
dtrace(1).

This change fixes a bug in the original port of the temporal option. This
bug was causing some assertions to fail, so they had been disabled; in this
revision the assertions are working properly and are enabled.

The DTrace version number has been bumped from 1.9.0 to 1.9.1 to reflect
the language change that's being introduced.

This change corresponds to part of illumos-gate commit e5803b76927480:
  3021 option for time-ordered output from dtrace(1M)

Reviewed by:	pfg
Obtained from:	illumos
MFC after:	1 month
2013-05-12 16:26:33 +00:00
Davide Italiano
a28d25df31 In case ZFS doesn't use UMA for buffers there's no need to waste memory
creating zones that will remain empty.

Reviewed by:	pjd
2013-05-01 17:34:44 +00:00
Steven Hartland
562a9d583b Changed ZFS TRIM sysctl from vfs.zfs.trim_disable -> vfs.zfs.trim.enabled
Enabled ZFS TRIM by default

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-04-26 11:24:20 +00:00
Martin Matuska
95b6497f5e MFV r249857:
Merge vendor bugfix for a possible deadlock related to async destroy
and improve write performance by introducing a new lock protecting
tx_open_txg.

Illumos ZFS issues:
  3642 dsl_scan_active() should not issue I/O to determine if async
       destroying is active
  3643 txg_delay should not hold the tc_lock

MFC after:	1 week
2013-04-24 21:21:03 +00:00
Martin Matuska
90eafc0bb8 The zfs synctask code restructuring introduced a new bug that makes it
impossible to set quota and reservation on pools lower than version 22.
Problem has been reported and a solution discussed with vendor.

Illumos ZFS issues:
  3739 cannot set zfs quota or reservation on pool version < 22

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
Reported by:	Steve Wills <swills@FreeBSD.org>
MFC after:	3 days
2013-04-23 06:28:35 +00:00
Pedro F. Giffuni
03836978be DTrace: Revert r249367
The following change from illumos brought caused DTrace to
pause in an interactive environment:

3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This was not detected during testing because it doesn't
affect scripts.

We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.

Reference:
https://www.illumos.org/issues/3026

Reported by:	Navdeep Parhar and Mark Johnston
2013-04-17 02:20:17 +00:00
Pedro F. Giffuni
ddd5b8e9b4 DTrace: option for time-ordered output
Merge changes from illumos:

3021 option for time-ordered output from dtrace(1M)
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
3025 register leak in D code generation
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This brings yet another feature implemented in upstream DTrace.
A complete description is available here:
http://dtrace.org/blogs/ahl/2012/07/28/my-new-dtrace-favorite/

This change bumps the DT_VERS_* number to 1.9.1 in
accordance to what is done in illumos.

This change was somewhat complicated because upstream is mixed many
changes in an individual commit and some of the tests don't really
apply to us.

There are also appear to be differences in timestamping with Solaris
so we had to workaround some assertions making sure no regression
happened.

Special thanks to Fabian Keil for changes and testing.

Illumos Revisions:	13758:23432da34147

Reference:
https://www.illumos.org/issues/3021
https://www.illumos.org/issues/3022
https://www.illumos.org/issues/3023
https://www.illumos.org/issues/3024
https://www.illumos.org/issues/3025
https://www.illumos.org/issues/1694

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 months
2013-04-11 16:24:36 +00:00
Martin Matuska
2cb0c5e424 MFV r249354:
Merge bugfixes accepted and integrated by vendor. Underlying problems
have been reported by us and fixed in r240942 and r249196.

Illumos ZFS issues:
  3645 dmu_send_impl: possibilty of pool hold leak
  3692 Panic on zfs receive of a recursive deduplicated stream

MFC after:	8 days
2013-04-11 07:40:30 +00:00
Martin Matuska
86161c3eeb Cast to (void *)(uintptr_t) on copyout and copyin of zfs_iocparm_t.zfs_cmd
MFC after:	9 days
2013-04-10 07:01:17 +00:00
Martin Matuska
83b4af1142 ZFS expects a copyout of zfs_cmd_t on an ioctl error. Our sys_ioctl()
doesn't copyout in this case.

To solve this issue a new struct zfs_iocparm_t is introduced consisting of:
- zfs_ioctl_version (future backwards compatibility purposes)
- user space pointer to zfs_cmd_t (copyin and copyout)
- size of zfs_cmd_t (verification purposes)

The copyin and copyout of zfs_cmd_t is now done the illumos (vendor) way
what makes porting of new changes easier and ensures correct behavior if
returning an error.

MFC after:	10 days
2013-04-09 22:27:44 +00:00
Martin Matuska
a548fef5dc MFV r249186:
Do not list read-only pools in zpool.cache
Reduce diff against vendor in unused vdev_disk.c

Illumos ZFS issues:
  3639 zpool.cache should skip over readonly pools
  3640 want automatic devid updates

MFC after:	1 week
2013-04-06 17:24:00 +00:00
Martin Matuska
95e7edacfe MFV r248660:
Merge vendor change - modify time processing in deadman thread.

Illumos ZFS issues:
  3618 ::zio dcmd does not show timestamp data

MFC after:	3 weeks
2013-04-06 17:15:47 +00:00
Martin Matuska
367437755d Provide a fix for kernel panic if receiving recursive deduplicated streams.
Problem reported to vendor.

Illumos ZFS issues:
  3692 Panic on zfs receive of a recursive deduplicated stream

MFC after:	2 weeks
2013-04-06 11:54:41 +00:00
Martin Matuska
f1b5c26470 MFV r248217:
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.

Illumos ZFS issues:
  3598 want to dtrace when errors are generated in zfs

MFC after:	3 weeks
2013-04-06 10:39:38 +00:00
Martin Matuska
bae7bccf39 MFV r242816:
Import vendor change to reduce diff, no effect on FreeBSD.

Illumos ZFS issues:
  3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
2013-04-06 08:21:37 +00:00
Andriy Gapon
9ff9b984c9 spa_open_common: fix argument to zvol_create_minors
Prior to r248571 spa_open was always called with a bare pool name,
but now it is called with a dataset name instead (spa_lookup handles
that).
So, when a ZFS root is mounted spa_open is called with a name of a root
dataset, which can very well be different from the pool name.
But zvol_create_minors should be called with the pool name, because it
performs a recursive traversal of all datasets under the name to find
all those that are volumes.

MFC after:	7 days
2013-04-03 11:06:26 +00:00
Martin Matuska
03863a70e1 Fix possible pool hold leak in dmu_send_impl()
Problem reported to vendor:
  https://www.illumos.org/issues/3645

Reported by:	Andriy Gapon <avg@FreeBSD.org>
MFC after:	15 days
2013-04-03 09:52:30 +00:00
Martin Matuska
41451f4a0e Do not check against uninitialized rc and comment out vendor code
MFC after:	16 days
2013-04-02 08:15:39 +00:00
Pedro F. Giffuni
9f4c7ba460 Dtrace: enablings on defunct providers prevent providers from unregistering
Merge change from illumos:

1368 enablings on defunct providers prevent providers from unregistering

We try to address some underlying differences between the Solaris
and FreeBSD implementations: dtrace_attach() / dtrace_detach() are
currently unimplemented in FreeBSD but the new code from illumos
makes use of taskq so some adaptations were made to dtrace_open()
and dtrace_close() to handle them appropriately.

Illumos Revision:	r13430:8e6add739e38

Reference:
https://www.illumos.org/issues/1368

Reviewed by:	gnn
Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	3 weeks
2013-04-01 19:13:46 +00:00
Martin Matuska
20547d41f8 Call dmu_snapshot_list_next() in zvol.c with dsl_pool_config lock held
Submitted by:	Andriy Gapon <avg@FreeBSD.org>
MFC after:	17 days
2013-04-01 16:14:57 +00:00
Pedro F. Giffuni
f5678b698a Dtrace: dtrace.c erroneously checks for memory alignment on amd64.
Merge change from illumos:

3511 dtrace.c erroneously checks for memory alignment on amd64

Illumos Revision:	c93cc65

Reference:
https://www.illumos.org/issues/3511

Obtained from:	Illumos
MFC after:	3 weeks
2013-03-26 20:17:08 +00:00
Pedro F. Giffuni
5472787377 Dtrace: Add SUN MDB-like type-aware print() action.
Merge change from illumos:

1694 Add type-aware print() action

This is a very nice feature implemented in upstream Dtrace.
A complete description is available here:
http://dtrace.org/blogs/eschrock/2011/10/26/your-mdb-fell-into-my-dtrace/

This change bumps the DT_VERS_* number to 1.9.0 in
accordance to what is done in illumos.

While here also include some minor cleanups to ease further merging
and appease clang with a fix by Fabian Keil.

Illumos Revisions:	13501:c3a7090dbc16
			13483:f413e6c5d297

Reference:
https://www.illumos.org/issues/1560
https://www.illumos.org/issues/1694

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 month
2013-03-25 20:38:09 +00:00
Pedro F. Giffuni
730cecb05a Dtrace: add toupper()/tolower() and enhancements to lltostr().
Merge changes from illumos:

1451 DTrace needs toupper()/tolower() subroutines
1457 lltostr() D subroutine should take an optional base

This change bumps the DT_VERS_* number to 1.8.1 in
accordance to what is done in illumos.

The test suite we currently include is outdated and
doesnt support some updates in tst.subr.d which had to
be left out for now.

Illumos Revisions:	r13458 5e394d8db762
			r13459 c3454574dd1a

Reference:
https://www.illumos.org/issues/1451
https://www.illumos.org/issues/1457

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 month
2013-03-25 15:40:57 +00:00
Pedro F. Giffuni
f2e66d30b8 Dtrace: add optional size argument to tracemem().
Merge change from illumos:

1455 DTrace tracemem() should take an optional size argument

Our local enhancements to dt_print_bytes were equivalent to
those in illumos but we made it match the illumos version
to ease further code merges.

For now leave out tst.smallsize.d and tst.smallsize.d.out
since those don't seem to work cleanly on FreeBSD.

This change bumps the DT_VERS_* number to 1.7.1 in accordance
to what is done in illumos.

Illumos Revision:	13457:571b0355c2e3

Reference:
https://www.illumos.org/issues/1455

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 month
2013-03-24 19:12:08 +00:00
Will Andrews
58567a1b4e ZFS: Fix a panic while unmounting a busy filesystem.
This particular scenario was easily reproduced using a NFS export.  When the
first 'zfs unmount' occurred, it returned EBUSY via this path, while
vflush() had flushed references on the filesystem's root vnode, which in
turn caused its v_interlock to be destroyed.  The next time 'zfs unmount'
was called, vflush() tried to obtain this lock, which caused this panic.

Since vflush() on FreeBSD is a definitive call, there is no need to check
vfsp->vfs_count after it completes.  Simply #ifdef sun this check.

Submitted by:	avg
Reviewed by:	avg
Approved by:	ken (mentor)
MFC after:	1 month
2013-03-23 16:34:56 +00:00
Andriy Gapon
aaf2546b67 fbt_getargdesc: correctly handle types for return probes
MFC after:	6 days
2013-03-23 08:52:50 +00:00
Andriy Gapon
a47016e9a9 fbt_typoff_init: fix an off by one in determining required memory size
This issue would be silent most of the time, but if the requested memory
is a multiple of a page size, then accessing one element beyond the end
would lead to a kernel page fault.
Otherwise, the unlucky last type would just be inaccessible.

Reported by:	glebius
Tested by:	glebius
MFC after:	6 days
2013-03-23 08:48:44 +00:00
Steven Hartland
def84b9736 Fix for building libzpool under i386.
Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 23:06:11 +00:00
Steven Hartland
2b114ad2a4 Add missing descriptions for ZFS sysctls
Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 11:25:21 +00:00
Steven Hartland
adea827b21 Optimisation of TRIM processing.
Previously TRIM processing was very bursty. This was made worse by the fact
that TRIM requests on SSD's are typically much slower than reads or writes.
This often resulted in stalls while large numbers of TRIM's where processed.

In addition due to the way the TRIM thread was only woken by writes, deletes
could stall in the queue for extensive periods of time.

This patch adds a number of controls to how often the TRIM thread for each
SPA processes its outstanding delete requests.
vfs.zfs.trim.timeout: Delay TRIMs by up to this many seconds
vfs.zfs.trim.txg_delay: Delay TRIMs by up to this many TXGs (reduced to 32)
vfs.zfs.vdev.trim_max_bytes: Maximum pending TRIM bytes for a vdev
vfs.zfs.vdev.trim_max_pending: Maximum pending TRIM segments for a vdev
vfs.zfs.trim.max_interval: Maximum interval between TRIM queue processing
(seconds)

Given the most common TRIM implementation is ATA TRIM the current defaults
are targeted at that.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 11:02:08 +00:00
Steven Hartland
6ad46cec23 Names the ZFS TRIM thread
Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 10:41:30 +00:00
Steven Hartland
89e5b43079 TRIM cache devices based on time instead of TXGs.
Currently, the trim module uses the same algorithm for data and cache
devices when deciding to issue TRIM requests, based on how far in the
past the TXG is.

Unfortunately, this is not ideal for cache devices, because the L2ARC
doesn't use the concept of TXGs at all. In fact, when using a pool for
reading only, the L2ARC is written but the TXG counter doesn't
increase, and so no new TRIM requests are issued to the cache device.

This patch fixes the issue by using time instead of the TXG number as
the criteria for trimming on cache devices. The basic delay principle
stays the same, but parameters are expressed in seconds instead of
TXGs. The new parameters are named trim_l2arc_limit and
trim_l2arc_batch, and both default to 30 second.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	17122c31ac
MFC after:	2 weeks
2013-03-21 10:29:05 +00:00
Steven Hartland
78ad0c1c80 Improve TXG handling in the TRIM module.
This patch adds some improvements to the way the trim module considers
TXGs:

 - Free ZIOs are registered with the TXG from the ZIO itself, not the
   current SPA syncing TXG (which may be out of date);
 - L2ARC are registered with a zero TXG number, as L2ARC has no concept
   of TXGs;
 - The TXG limit for issuing TRIMs is now computed from the last synced
   TXG, not the currently syncing TXG. Indeed, under extremely unlikely
   race conditions, there is a risk we could trim blocks which have been
   freed in a TXG that has not finished syncing, resulting in potential
   data corruption in case of a crash.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	5b46ad40d9
MFC after:	2 weeks
2013-03-21 10:16:10 +00:00
Steven Hartland
e07e3a3792 Don't register repair writes in the trim map.
The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.

I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	Source: 6a3cebaf7c
2013-03-21 10:02:32 +00:00
Steven Hartland
e05aad2d33 Add TRIM support for L2ARC.
This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	31aae37399
MFC after:	2 weeks
2013-03-21 09:34:41 +00:00
Martin Matuska
192d547574 Release hold on pool before calling zvol_create_minor() 2013-03-20 09:56:20 +00:00
Martin Matuska
a0abc0d302 Run zvol_create_minors() only if in non-error case 2013-03-19 22:27:15 +00:00
Martin Matuska
e56718d734 Run zvol_create_minors() on snapshot creation 2013-03-19 22:14:50 +00:00
Martin Matuska
07091d8f14 MFV r247580:
Merge synctask code restructuring from vendor.

Modify forward and backward compatibility to support new change.

Illumos ZFS issues:
  3464 zfs synctask code needs restructuring

Sponsored by:	Hybrid Logic Ltd.
2013-03-19 12:51:18 +00:00
Martin Matuska
520268fb97 MFC @248493 2013-03-19 11:09:15 +00:00
Martin Matuska
87a5cb4650 Plug memory leak in dsl_check_snap_cb()
This was unnoticed because the function is very rarely used.

MFC after:	3 days
2013-03-19 07:47:51 +00:00
Martin Matuska
a602517b63 Add missing zvol_create_mirrors() on zfs_ioc_create() 2013-03-18 20:22:40 +00:00
John Baldwin
3cf3b9f097 Partially revert r195702. Deferring stops is now implemented via a set of
calls to toggle TDF_SBDRY rather than passing PBDRY to individual sleep
calls.
- Remove the stop_allowed parameters from cursig() and issignal().
  issignal() checks TDF_SBDRY directly.
- Remove the PBDRY and SLEEPQ_STOP_ON_BDRY flags.
2013-03-18 17:23:58 +00:00
Martin Matuska
876a84e867 MFC @248461 2013-03-18 09:39:51 +00:00
Martin Matuska
6f4accc2de Move common zfs ioctl compatibility functions (userland) into libzfs_compat.c
Introduce additional constants for zfs ioctl versions
2013-03-18 09:32:29 +00:00
Justin Hibbits
80a5635c8b Add FBT for PowerPC DTrace. Also, clean up the DTrace assembly code,
much of which is not necessary for PowerPC.

The FBT module can likely be factored into 3 separate files: common,
intel, and powerpc, rather than duplicating most of the code between
the x86 and PowerPC flavors.

All DTrace modules for PowerPC will be MFC'd together once Fasttrap is
completed.
2013-03-18 05:30:18 +00:00
Martin Matuska
af2e40ccd1 Merge libzfs_core part of r239388
Illumos ZFS issues:
  3085 zfs diff panics, then panics in a loop on booting

References:
  https://www.illumos.org/issues/3085
2013-03-17 18:49:11 +00:00
Martin Matuska
70b0720877 Fix accidentially changed ioc variable for old v15 compatibility 2013-03-17 17:28:06 +00:00
Martin Matuska
6cf922c88b Fix typo in sysctl description
Reported by:	Jeremy Chadwick
MFC after:	3 days
2013-03-17 15:53:27 +00:00
Martin Matuska
e2b4467975 libzfs_core:
- provide complete backwards compatibility (old utility, new kernel)
  - add zfs_cmd_t compatibility mapping in both directions
  - determine ioctl address in zfs_ioctl_compat.c
2013-03-17 10:57:04 +00:00
Martin Matuska
4f33cfb284 Initialize "error" variable where illumos does. 2013-03-16 20:28:38 +00:00
Martin Matuska
a03fbc7ecf MFC @248093 2013-03-09 11:57:51 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Attilio Rao
c934116100 Merge from vmc-playground:
Introduce a new KPI that verifies if the page cache is empty for a
specified vm_object.  This KPI does not make assumptions about the
locking in order to be used also for building assertions at init and
destroy time.
It is mostly used to hide implementation details of the page cache.

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	alc (vm_radix based version)
Tested by:	flo, pho, jhb, davide
2013-03-09 02:05:29 +00:00
Martin Matuska
91edf37414 Comment out unfeasible illumos copyin code and restore previous behavior. 2013-03-07 23:45:16 +00:00
Martin Matuska
b3b6851c78 Add missing init functions
Reduce diff to illumos
2013-03-06 11:33:25 +00:00
Xin LI
2f79ac7f21 Diff reduction with Illumos 2013-03-06 01:21:56 +00:00
Xin LI
227d24fc59 Use adx2 instead of adx in the second vsprintf, this fixes a
panic.
2013-03-05 22:58:53 +00:00
Martin Matuska
400c4069a5 MFV r247845:
Import ZFS bpobj bugfix from vendor.

Illumos ZFS issues:
  3603 panic from bpobj_enqueue_subobj()
  3604 zdb should print bpobjs more verbosely

References:
  https://www.illumos.org/issues/3603
  https://www.illumos.org/issues/3604

MFC after:	1 week
2013-03-05 18:54:41 +00:00
Martin Matuska
dce1a726f2 WiP merge of libzfs_core (MFV r238590, r238592)
not yet working, ioctl handling needs to be changed
2013-03-05 08:09:53 +00:00
Justin T. Gibbs
7e2a739f03 Fix assertion failure when using userland DTrace probes from
the pid provider on a kernel compiled with INVARIANTS.

sys/cddl/contrib/opensolaris/uts/intel/dtrace/fasttrap_isa.c:
	In fasttrap_probe_pid(), attempts to write to the
	address space of the thread that fired the probe
	must be performed with the process of the thread
	held.  Use _PHOLD() to ensure this is the case.

	In fasttrap_probe_pid(), use proc_write_regs() instead
	of calling set_regs() directly.  proc_write_regs()
	performs invariant checks to verify the calling
	environment of set_regs().  PROC_LOCK()/UNLOCK() around
	the call to proc_write_regs() so that it's invariants
	are satisfied.

Sponsored by:	Spectra Logic Corporation
Reviewed by:	gnn, rpaulo
MFC after:	1 week
2013-03-04 22:07:36 +00:00
Rui Paulo
e0dffa2de2 Remove the extra parenthesis from the cv_init() macro. They are not
necessary because we already use parenthesis in zfs_cv_init().

This fixes a long standing bug where there would be an extra ")" at the
end of the string. This extra parenthesis would show up in the WCHAN of
the process (top, stty status, etc.).
2013-03-03 06:42:36 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Xin LI
5c737b11df MFV r247575:
Import a fix tighten assertion on SPA versions from vendor (Illumos).

Illumos ZFS issue:

  3543 Feature flags causes assertion in spa.c to miss certain cases

MFC after:	2 weeks
2013-03-01 22:20:13 +00:00
Martin Matuska
4abd59512a MFV r247316:
Merge new read-only zfs properties from vendor (illumos)

Illumos ZFS issues:
  3588 provide zfs properties for logical (uncompressed) space used and
       referenced

References:
  https://www.illumos.org/issues/3588

MFC after:	2 weeks
2013-03-01 21:58:51 +00:00
Martin Matuska
bb508e7732 Fix the zfs_ioctl compat layer to support zfs_cmd size change introduced
in r247265 (ZFS deadman thread). Both new utilities now support the old
kernel and new kernel properly detects old utilities.

For future backwards compatibility, the vfs.zfs.version.ioctl read-only
sysctl has been introduced. With this sysctl zfs utilities will be able
to detect the ioctl interface version of the currently loaded zfs module.

As a side effect, the zfs utilities between r247265 and this revision don't
support the old kernel module. If you are using HEAD newer or equal than
r247265, install the new kernel module (or whole kernel) first.

MFC after:	10 days
2013-03-01 09:42:58 +00:00
Martin Matuska
24245e76ea MFV 247176, 247178, 247315:
Import metaslab_sync() speedup from vendor (illumos).

Illumos ZFS issues:
  3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
  3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not
       condensing)
  3578 transferring the freed map to the defer map should be constant time
  3579 ztest trips assertion in metaslab_weight()

References:
  https://www.illumos.org/issues/3552
  https://www.illumos.org/issues/3564
  https://www.illumos.org/issues/3578
  https://www.illumos.org/issues/3579

MFC after:	2 weeks
2013-02-27 14:45:23 +00:00
Martin Matuska
e4428d63a8 Be more verbose on ZFS deadman I/O panic
Patch suggested upstream.

Suggested by:	Olivier Cinquin
MFC after:	12 days
2013-02-26 20:41:27 +00:00
Martin Matuska
e70664bafc MFV v242732:
Merge the ZFS I/O deadman thread from vendor (illumos).
This feature panics the system on hanging ZFS I/O, helps debugging
and resumes failed service.

The panic behavior can be controlled with the loader-only tunables:
vfs.zfs.deadman_enabled (enable or disable panic on stalled ZFS I/O)
vfs.zfs.deadman_synctime (expiration time for stalled ZFS I/O)

By default, ZFS I/O deadman is enabled by default on amd64 and i386
excluding virtual guest machines.

Illumos ZFS issues:
  3246 ZFS I/O deadman thread

References:
  https://www.illumos.org/issues/3246

MFC after:	2 weeks
2013-02-25 12:33:31 +00:00
Martin Matuska
781c0f87d3 MFV r246653:
Import vendor change to avoid "unitialized variable" warnings.

Illumos ZFS issues:
  3522 zfs module should not allow uninitialized variables

References:
  https://www.illumos.org/issues/3522
2013-02-23 11:21:05 +00:00
Justin T. Gibbs
6a8f90edf5 Avoid panic when tearing down the DTrace pid provider for a
process that has crashed.

sys/cddl/contrib/opensolaris/uts/common/dtrace/fasttrap.c:
	In fasttrap_pid_disable(), we cannot PHOLD the proc
	structure for a process that no longer exists, but
	we still have other, fasttrap specific, state that
	must be cleaned up for probes that existed in the
	dead process.  Instead of returning early if the
	process related to our probes isn't found,
	conditionalize the locking and carry on with a NULL
	proc pointer.  The rest of the fasttrap code already
	understands that a NULL proc is possible and does
	the right things in this case.

Sponsored by:	Spectra Logic Corporation
Reviewed by:	rpaulo, gnn
MFC after:	1 week
2013-02-20 17:55:17 +00:00
Xin LI
e469d5a70f Eliminate real_LZ4_uncompress. It's unused and does not perform sufficient
check against input stream (i.e. it could read beyond specified input
buffer).
2013-02-14 21:02:18 +00:00
Martin Matuska
7c695febc9 Change vfs.zfs.write_to_degraded from CTLFLAG_RW to CTLFLAG_RWTUN
Suggested by:	pjd
2013-02-13 23:11:25 +00:00
Xin LI
314caba11c Restore De Bruijn algorithm for sparc64 where the compiler rely on a
library function for __builtin_c?z.

Tested by:	Michael Moll <kvedulv kvedulv de>
2013-02-13 17:30:54 +00:00
Martin Matuska
6a33bbc041 Merge zfs_ioctl.c code that should have been merged together with ZFS v28.
Fixes several problems if working with read-only pools.

Changed code originaly introduced in onnv-gate 13061:bda0decf867b
Contains changes up to illumos-gate 13700:4bc0783f6064

PR:		kern/175897
Suggested by:	avg

MFC after:	2 weeks
2013-02-11 21:10:55 +00:00
Martin Matuska
9689178c3f MFV r246633:
Import vendor bugfixes regarding SA rounding, header size and layout.
This was already partially fixed by avg.

Illumos ZFS issues:
  3512 rounding discrepancy in sa_find_sizes()
  3513 mismatch between SA header size and layout

References:
  https://www.illumos.org/issues/3512
  https://www.illumos.org/issues/3513

MFC after:	2 weeks
2013-02-11 14:29:38 +00:00
Martin Matuska
bb03847418 MFV r246394:
Add tunable to allow block allocation on degraded vdevs.

Illumos ZFS issues:
  3507 Tunable to allow block allocation even on degraded vdevs

References:
  https://www.illumos.org/issues/3507

MFC after:	2 weeks
2013-02-11 13:59:57 +00:00
Martin Matuska
ff20578569 MFV r246392:
Import vendor ZFS bugfix fixing a possible deadlock in arc_read().

Illumos ZFS issues:
  3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

References:
  https://www.illumos.org/issues/3498

MFC after:	2 weeks
2013-02-11 12:42:11 +00:00
Martin Matuska
8a2dc7faae MFV r246390:
Import minor type change in refcount.h header from vendor (illumos).

MFC after:	2 weeks
2013-02-11 07:48:57 +00:00
Martin Matuska
fd9778c236 MFV r246388:
Import vendor bugfixes

Illumos ZFS issues:
  3422 zpool create/syseventd race yield non-importable pool
  3425 first write to a new zvol can fail with EFBIG

References:
  https://www.illumos.org/issues/3422
  https://www.illumos.org/issues/3425

MFC after:	2 weeks
2013-02-10 19:32:55 +00:00
Xin LI
ef17620fc8 MFV r245512:
* Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.

LZ4 is a new high-speed BSD-licensed compression algorithm created
by Yann Collet that delivers very high compression and decompression
performance compared to lzjb (>50% faster on compression, >80% faster
on decompression and around 3x faster on compression of incompressible
data), while giving better compression ratio [1].

This version of LZ4 corresponds to upstream's [2] revision 85.

Please note that for obvious reasons this is not backward read
compatible.  This means once a pool have LZ4 compressed data, these
data can no longer be read by older ZFS implementations.

Local changes:

 - On-stack hash table disabled and using kernel slab allocator
   instead, at this time.  This requires larger kernel thread stack
   for zio workers.  This may change in the future should we adjusted
   the zio workers' thread stack size.
 - likely and unlikely will be undefined if they are already defined,
   this is required for i386 XEN build.
 - Removed De Bruijn sequence based __builtin_ctz family of builtins
   in favor of the latter.  Both GCC and clang supports these builtins.
 - Changed the way the LZ4 code detects endianness.
 - Manual pages modifications to mention the feature based on Illumos
   counterpart.
 - Boot loader changes to make it support LZ4 decompression.

[1] https://www.illumos.org/issues/3035
[2] http://code.google.com/p/lz4/source/list

Obtained from:	Illumos (13921:9d721847e469)
Tested on:	FreeBSD/amd64
MFC after:	1 month
2013-02-09 06:39:28 +00:00
Sergey Kandaurov
3c56b4f165 Fix warning: comparison of unsigned expression < 0 is always false.
Reported by:	clang
2013-02-08 09:54:53 +00:00
Andriy Gapon
0dcab786b8 zfs_vget, zfs_fhtovp: properly handle the z_shares_dir object
A special gfs vnode corresponds to that object.
A regular zfs vnode must not be returned.

This should be upstreamed.

Reported by:	pluknet
Submitted by:	rmacklem
Tested by:	pluknet
MFC after:	10 days
2013-02-08 07:49:54 +00:00
Andriy Gapon
e2bb19dce5 zfs: update comments about zfid_long_t to match the FreeBSD definitions
MFC after:	1 week
2013-02-08 07:44:15 +00:00
Andriy Gapon
c7d346f269 zfs: fix, improve and re-organize page_lookup and page_unlock
Now they are split into two pairs: page_hold/page_unhold for mappedread
and page_busy/page_unbusy for update_pages.

For mappedread we simply hold a page that is to be used as a source if it
is resident and valid (and not busy).  This is sufficient since we are
only doing page -> user buffer copying.  There is no page <-> backing
storage I/O involved.

update_pages is now better split to properly handle the putpages case
(page -> arc) and the regular write case (arc -> page).

For the latter we use complete protocol of marking an object with
paging-in-progress and marking a page with io_start (busy count).
Also, in this case we remove the write bit from all page mappings and
clear dirty bits of the pages, the former is needed to ensure that the
latter does the right thing.
Additionally we update a page if it is cached instead of just freeing it
as was done before.  This needs to be verified.

A minor detail: ZFS-backed pages should always be either fully valid
or fully invalid.  Assert this and use simpler API that does not deal
with sub-page blocks.

Reviewed by:	kib
MFC after:	26 days
2013-02-03 18:42:20 +00:00
Justin Hibbits
7e7a9efdb5 Fix the PowerPC DTrace copy functions. The kernel doesn't hold the same view to
the user map, so use the md copy in/out functions provided by the kernel.

MFC with:	r242723
2013-02-03 00:19:34 +00:00
Andriy Gapon
5583e07188 solaris compat: remove KM_ZERO
- there is no such flag in Solaris and derivatives
- the flag was added in an unrelated change
- the flag is not used

The proper way to allocate zeroed out memory is to use kmem_zalloc.

MFC after:	3 days
2013-02-02 11:41:05 +00:00
Andriy Gapon
13235aaa89 zfs: add MODULE_VERSION for zfsctrl
This should allow the kernel linker to easily detect a situation
when the module is present both in a kernel and in a preloaded file
(zfs.ko).

Reviewed by:	jhb
MFC after:	5 days
2013-02-02 11:35:18 +00:00
Andriy Gapon
ea84c62f93 spa_generate_rootconf: add support for old vdev labels
It seems that old ZFS versions (v15) completely omit "vdev_children"
property when there is a single child.

Reported by:	jase
Tested by:	jase
MFC after:	1 week
2013-01-26 10:34:17 +00:00
Xin LI
5c74885e99 MFV r245510:
improve the comment in txg.c

Obtained from:	Illumos (13910:f3454e0a097c)
MFC after:	2 weeks
2013-01-16 22:59:50 +00:00
Konstantin Belousov
614b9f9130 For zfs vnodes, use the standard inode number based hash algorithm.
Reviewed and tested by:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:45:33 +00:00
Xin LI
290a1ba9a4 The current ZFS code expects ddt_zap_count to always succeed by asserting
the underlying zap_count() to return no errors.  However, it is possible
that the pool reaches to such a state where zap_count would return error,
leading to panics when a pool is imported.

This commit changes the ddt_zap_count to return error returned from
zap_count and handle the error appropriately.  With this change, it's now
possible to let zpool rollback damaged transaction groups and import the
pool.

Obtained from:	ZFS on Linux github (e8fd45a0f9)
MFC after:	1 month
2013-01-10 19:26:56 +00:00
Andriy Gapon
f71fbb1d12 zfs: solaris doesn't have KM_ZERO, kmem_zalloc should be used instead
To do:		remove KM_ZERO declaration
Pointyhat to:	avg (for mindlessly using the pseudo-flag)
MFC after:	instantly (to fix stable/8 build)
2012-12-23 19:58:41 +00:00
Ryan Stone
6969ef6656 Correct a series of errors in the hand-rolled locking for drace_debug.c:
- Use spinlock_enter()/spinlock_exit() to prevent a thread holding a
  debug lock from being preempted to prevent other threads waiting
  on that lock from starvation.

- Handle the possibility of CPU migration in between the fetch of curcpu
  and the call to spinlock_enter() by saving curcpu in a local variable.

- Use memory barriers to prevent reordering of loads and stores of the
  data protected by the lock outside of the critical section

- Eliminate false sharing of the locks by moving them into the structures
  that they protect and aligning them to a cacheline boundary.

- Record the owning thread in the lock to make debugging future problems
  easier.

Reviewed by:	rpaulo (initial version)
MFC after:	2 weeks
2012-12-23 15:50:37 +00:00
Steven Hartland
5780c4a723 Added vfs.zfs.vdev.trim_on_init sysctl which allows full vdev trim on
initialisation to be enabled (1) / disabled (0) defaults to enabled.

This is useful for devices which have a slow trim speed and are either
new or have otherwise already been wiped e.g. secure erase.

PR:     kern/173116
Submitted by:   Steven Hartland
Approved by:    pjd (mentor)
2012-12-13 17:39:07 +00:00
Steven Hartland
c440a359ca Upgrades trim free request sizes before inserting them into to free map,
making range consolidation much more effective particularly for small
deletes.

This reduces memory used by the free map as well as reducing the number
of bio requests down to geom required to process all deletes.

In tests this achieved a factor of 10 reduction of trim ranges / geom
call downs.

While I'm here correct the description of zio_vdev_io_start.

PR:		kern/173254
Submitted by:	Steven Hartland
Approved by:	pjd (mentor)
2012-12-13 17:06:38 +00:00
Steven Hartland
7150222c0a Renamed zfs trim stats removing duplicate zio_trim identifier from the name
Added description option to kstats.
Added descriptions for zio_trim kstats

PR:		kern/173113
Submitted by:	Steven Hartland
Reviewed by:	pjd
Approved by:	pjd
MFC after:	2 weeks
2012-12-12 16:14:14 +00:00
Xin LI
2740382ebd Use SA_ZPL_CRTIME instead of SA_ZPL_CTIME for creation time.
Submitted by:	phil.stone at gmx.com
MFC after:	2 weeks
2012-12-03 04:25:37 +00:00
Andriy Gapon
289b3b96ac zfs_getpages: make use of vm_page_readahead_finish
Suggested by:	kib
MFC after:	5 days
2012-12-01 18:13:53 +00:00
Andriy Gapon
992ffc58ae gfs_file_inactive: replace bad code with ugly code
Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.

The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.

The ugly code employs potentially unsafe locking tricks.

Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.

PR:		kern/151111
Reviewed by:	kib
MFC after:	13 days
2012-12-01 18:12:55 +00:00
Martin Matuska
7faa32552f MFV r243395:
Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.

illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted

References:
https://www.illumos.org/issues/3254

MFC after:	2 weeks
2012-11-26 12:24:39 +00:00
Martin Matuska
53e5858c68 Add loader(8) tunable to enable/disable nopwrite functionality:
vfs.zfs.nopwrite_enabled

MFC after:	2 weeks
2012-11-25 16:54:43 +00:00
Martin Matuska
dd801aa546 MFV r243013 and r243267:
Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after:	2 weeks
2012-11-25 16:32:07 +00:00
Andriy Gapon
3a0e1b57bb zfs_freebsd_reclaim: remove a stray variable
... which leaked from a subsequent local change.
Unfortunately I noticed that only after commit.

MFC after:	5 weeks
X-MFC with:	r243520
2012-11-25 15:46:29 +00:00
Andriy Gapon
4ff1c77d22 zfs: overhaul zfs-vfs glue for vnode life-cycle management
* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.

* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model.  All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.

* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.

* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.

* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.

The above changes fix at least two known/reported problems:

o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit.  This type of deadlock has not been
reported as of now.

o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim.  This would happen if the znode is unlinked
while its vnode is still active.

To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start.  This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested.  Or to avoid blocking
when it is not allowed (see zil_commit example above).

ffs_vgetf seems like a good source of inspiration.

Tested by:	Willem Jan Withagen <wjw@digiware.nl>
MFC after:	6 weeks
2012-11-25 15:33:26 +00:00
Andriy Gapon
7ca5310ea3 zfs_fhtovp: there is no reason to amend lock flags with LK_RETRY here
MFC after:	12 days
2012-11-25 15:07:27 +00:00
Andriy Gapon
7192f62bcc add zfs_bmap to aid vnode_pager_haspage
... otherwise zfs_getpages would mostly be called with one page at a time.

It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block.  Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk.  The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed.  vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem.  So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.

vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.

vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.

Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.

The above list should cover all the current calls to VOP_BMAP.

Reviewed by:	kib
MFC after:	6 weeks
2012-11-25 15:01:12 +00:00
Andriy Gapon
b609e5f891 zfs_getpages: optimize for large block sizes
MFC after:	6 weeks
2012-11-25 14:53:26 +00:00
Martin Matuska
2f06dfc9a3 MFV r243012:
Illumos 13886:e3261d03efbf

3349 zpool upgrade -V bumps the on disk version number, but leaves
     the in core version

References:
https://www.illumos.org/issues/3349

MFC after:	1 week
2012-11-25 10:53:42 +00:00
Martin Matuska
2b8d4033cc MFV r242735:
Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after:	2 weeks
2012-11-25 09:06:32 +00:00
Andriy Gapon
328998eac1 zfs roopool: add support for multi-vdev configurations
Tested by:	madpilot
MFC after:	10 days
2012-11-24 13:23:15 +00:00
Andriy Gapon
e1fccde2c9 spa_import_rootpool: initialize ub_version before calling spa_config_parse
... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.

This is not an issue for upstream because they do not seem to support
using raidz as a root pool.

Reported by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after:	6 days
2012-11-24 13:16:49 +00:00
Andriy Gapon
cfca00a2fb spa_import_rootpool: do not call spa_history_log_version
The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.

The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).

MFC after:	6 days
2012-11-24 13:14:53 +00:00
Andriy Gapon
59e407dfbf opensolaris compat: terminate cmn_err mesages with a new line
MFC after:	6 days
2012-11-24 13:10:36 +00:00
Andriy Gapon
c4f59a3c09 zfs: create devices/geoms from zvols after receiveing them
PR:		kern/167066
Tested by:	Andreas Nilsson <andrnils@gmail.com>
MFC after:	13 days
2012-11-24 13:07:31 +00:00
Andriy Gapon
dbe922173c zfs_remove: assert that delete_now case is never true on FreeBSD
That case is specific to Solaris VFS and it would violate pretty
fundamental contracts of FreeBSD VFS.

Discussed with:	pjd
MFC after:	12 days
2012-11-19 11:30:08 +00:00
Andriy Gapon
7b069f7fee zfs_remove: set VV_NOSYNC flag if a node is unlinked
Suggested by:	kib
MFC after:	12 days
2012-11-19 11:25:20 +00:00
Andriy Gapon
62875a6b10 spa_import_rootpool: fall back to use configuration from zpool.cache...
if we fail to generate a proper root pool configuration based on disk
probing.  Currently we can not properly generate the configuration for
multi-vdev pools.  Make that explicit.

Reported by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
Tested by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
MFC after:	4 days
2012-11-18 11:47:25 +00:00
Konstantin Belousov
f13b5a0f01 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
Andriy Gapon
7631b580ff zfs_ioc_destroy_snaps_nvl: remove disk device entries for zvol snapshots
... before trying to destroy the zvol snapshots themselves.

PR:		kern/173442
Reported by:	Petri Helenius <petri@helenius.fi>,
		mm
Obtained from:	Brian Behlendorf <behlendorf1@llnl.gov>,
		Illumos Bug #3170
Tested by:	Petri Helenius <petri@helenius.fi>
MFC after:	10 days
2012-11-10 12:22:26 +00:00
Xin LI
a9b09a3f3c MFV r242729 (mm):
Illumos r13840:97fd5cdf328a:

3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Illumos r13849:3468a95b27cd:

3258 ztest's use of file descriptors is unstable
2012-11-10 01:52:52 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Justin Hibbits
c757049235 Implement DTrace for PowerPC. This includes both 32-bit and 64-bit.
There is one known issue:  Some probes will display an error message along the
lines of:  "Invalid address (0)"

I tested this with both a simple dtrace probe and dtruss on a few different
binaries on 32-bit.  I only compiled 64-bit, did not run it, but I don't expect
problems without the modules loaded.  Volunteers are welcome.

MFC after:	1 month
2012-11-07 23:45:09 +00:00
Andriy Gapon
2c6024ec1b zfs_dirlook: bailout early if directory is unlinked
Otherwise we could fail with an incorrect error if e.g. parent
object id is removed too or we can even return a wrong vnode if
parent object has been already re-used.

Discussed with:	pjd
Also see:	http://article.gmane.org/gmane.os.freebsd.devel.file-systems/13863
MFC after:	26 days
2012-11-04 14:50:08 +00:00
Andriy Gapon
5c997cc429 zfsctl_snapdir_lookup: obtain a snapname in the remount case
... which is triggered if somebody did regular umount on a snapshot mount.

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	20 days
2012-11-04 14:43:15 +00:00
Andriy Gapon
88c8884a71 zfs: set MNTK_EXTENDED_SHARED flag
Discussed with:	kib
MFC after:	20 days
2012-11-04 14:36:11 +00:00
Andriy Gapon
f8abf4a1e4 opensolaris compat: clear VI_MOUNT before returning if mount_snapshot fails
To do: investigate if it would be possible to use normal vfs_domount here.

Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:27:31 +00:00