Commit Graph

1408 Commits

Author SHA1 Message Date
Alexander Motin
558dcd4e42 MFV r289187: 6251 add tunable to disable free_bpobj processing
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@139510fb6e
2015-10-12 15:44:44 +00:00
Alexander Motin
72b6ad9bb5 MFV r289185: 6250 zvol_dump_init() can hold txg open
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@b10bba7246
2015-10-12 15:39:03 +00:00
Alexander Motin
ec5a8cf7c0 Restore original array_rd_sz semantics.
Before r278702 prefetch was blocked for I/Os > 1MB, after -- >= 1MB.
1MB I/Os are used for bulk operations in CTL (XCOPY, VERIFY), and disabling
prefetch for them reduced the performance.

This is temporary local patch, that should be replaced when upstreamed.

Discussed with:	mahrens
MFC after:	3 days
2015-10-03 11:05:58 +00:00
Mark Johnston
f7c3db2537 MFV r288408:
6266 harden dtrace_difo_chunksize() with respect to malicious DIF

illumos/illumos-gate@395c7a3dcf

Reviewed by: Alex Wilson <alex.wilson@joyent.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Bryan Cantrill <bryan@joyent.com>

MFC after:	1 week
2015-09-30 05:24:22 +00:00
Andriy Gapon
a26cc6c081 sdt: static-ize couple of variables
MFC after:	11 days
2015-09-29 12:14:22 +00:00
Andriy Gapon
ab8d248801 sdt module does not seem to actually use any symbol from opensolaris module
MFC after:	11 days
2015-09-29 12:13:31 +00:00
Andriy Gapon
3bd9b9a600 std: it is important that func name is never an empty string
otherwise DTRACE_ANCHORED() returns false and that makes stack()
insert a bogus frame at the top.
For example:
dtrace -n 'test:dtrace_test::sdttest { stack(); }

This change is not really a solution, but just a work-around.
The real solution is to record the probe's call site and to use
that for resolving a function name.

PR:		195222
MFC after:	22 days
2015-09-29 12:02:23 +00:00
Andriy Gapon
09999d92b1 sdt: start checking version field when parsing probe definitions
This is an extra safety measure.

MFC after:	21 days
2015-09-29 11:58:21 +00:00
Andriy Gapon
c9d71814d5 dtrace_getarg: remove stray return statement on amd64, powerpc
MFC after:	10 days
2015-09-29 11:55:26 +00:00
Andriy Gapon
9b977fcea2 define aok in libnvpair which is linked to all zfs libraries that need aok
This removes the circular dependency of libnvpair on libzfs / libzpool.

PR:		199811
Obtained from:	bapt
MFC after:	23 days
2015-09-28 15:25:36 +00:00
Xin LI
8012d6910c MFV r288063: make dataset property de-registration operation O(1)
A change to a property on a dataset must be propagated to its descendants
in case that property is inherited. For datasets whose information is
not currently loaded into memory (e.g. a snapshot that isn't currently
mounted), there is nothing to do; the property change will take effect
the next time that dataset is loaded. To handle updates to datasets that
are in-core, ZFS registers a callback entry for each property of each
loaded dataset with the dsl directory that holds that dataset. There
is a dsl directory associated with each live dataset that references
both the live dataset and any snapshots of the live dataset. A property
change is effected by doing a traversal of the tree of dsl directories
for a pool, starting at the directory sourcing the change, and invoking
these callbacks.

The current implementation both registers and de-registers properties
individually for each loaded dataset. While registration for a property is
O(1) (insert into a list), de-registration is O(n) (search list and then
remove). The 'n' for de-registration, however, is not limited to the size
(number of snapshots + 1) of the dsl directory. The eviction portion
of the life cycle for the in core state of datasets is asynchronous,
which allows multiple copies of the dataset information to be in-core
at once. Only one of these copies is active at any time with the rest
going through tear down processing, but all copies contribute to the
cost of performing a dsl_prop_unregister().

One way to create multiple, in-flight copies of dataset information
is by performing "zfs list" operations from multiple threads
concurrently. In-core dataset information is loaded on demand and then
evicted when reference counts drops to zero. For datasets that are not
mounted, there is no persistent reference count to keep them resident.
So, a list operation will load them, compute the information required to
do the list operation, and then evict them. When performing this operation
from multiple threads it is possible that some of the in-core dataset
information will be reused, but also possible to lose the race and load
the dataset again, even while the same information is being torn down.

Compounding the performance issue further is a change made for illumos
issue 5056 which made dataset eviction single threaded. In environments
using automation to manage ZFS datasets, it is now possible to create
enough of a backlog of dataset evictions to consume excessive amounts
of kernel memory and to bog down the system.

The fix employed here is to make property de-registration O(1). With this
change in place, it is hoped that a single thread is more than sufficient
to handle eviction processing. If it isn't, the problem can be solved
by increasing the number of threads devoted to the eviction taskq.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_prop.h:
    Associate dsl property callback records with both the
    dsl directory and the dsl dataset that is registering the
    callback. Both connections are protected by the dsl directory's
    "dd_lock".

    When linking callbacks into a dsl directory, group them by
    the property type. This helps reduce the space penalty for the
    double association (the property name pointer is stored once
    per dsl_dir instead of in each record) and reduces the number of
    strcmp() calls required to do callback processing when updating
    a single property. Property types are stored in a linked list
    since currently ZFS registers a maximum of 10 property types
    for each dataset.

    Note that the property buckets/records associated with a dsl
    directory are created on demand, but only freed when the dsl
    directory is freed. Given the static nature of property types
    and their small number, there is no benefit to freeing the few
    bytes of memory used to represent the property record earlier.
    When a property record becomes empty, the dsl directory is either
    going to become unreferenced a little later in this thread of
    execution, or there is a high chance that another dataset is
    going to be loaded that would recreate the bucket anyway.

    Replace dsl_prop_unregister() with dsl_prop_unregister_all().
    All callers of dsl_prop_unregister() are trying to remove
    all property registrations for a given dsl dataset anyway. By
    changing the API, we can avoid doing any lookups of callbacks
    by property type and just traverse the list of all callbacks
    for the dataset and free each one.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
    Replace use of dsl_prop_unregister() with the new
    dsl_prop_unregister_all() API.

illumos/illumos-gate@03bad06fbb
    Author: Justin Gibbs <gibbs@scsiguy.com>
    Reviewed by: Matthew Ahrens <mahrens@delphix.com>
    Reviewed by: Prakash Surya <prakash.surya@delphix.com>
    Approved by: Dan McDonald <danmcd@omniti.com>

Illumos issue:
    6171 dsl_prop_unregister() slows down dataset eviction
    https://www.illumos.org/issues/6171

MFC after:	2 weeks
2015-09-25 01:05:44 +00:00
Andriy Gapon
e88445a48b MFV r287817: 6220 memleak in l2arc on debug build
c546f36aa8
https://www.illumos.org/issues/6220
  5408 introduced a memleak in l2arc, namely the member b_thawed gets leaked when
  an arc_hdr is realloced from full to l2only.

Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <sensille@gmx.net>
2015-09-21 12:23:01 +00:00
Xin LI
b4f6099b9f MFV r287623: 5997 FRU field not set during pool creation and never
updated

ZFS already supports storing the vdev FRU in a vdev property.  There
is code in libzfs to work with this property, and there is code in
the zfs-retire FMA module that looks for that information.  But there
is no code actually setting or updating the FRU.

To address this, ZFS is changed to send a handful of new events
whenever a vdev is added, attached, cleared, or onlined, as well
as when a pool is created or imported.

Note that syseventd is not currently available on FreeBSD and thus
some work is needed to actually support the new ZFS events (e.g. in
zfsd) to actually use this capability, this changeset is mostly a
diff reduction from upstream.

illumos/illumos-gate@1437283407

Illumos issues:

    5997 FRU field not set during pool creation and never updated
    https://www.illumos.org/issues/5997
2015-09-13 07:15:14 +00:00
Xin LI
011ecb128f Note r286552 as merged and reduce diff against upstream. 2015-09-13 06:49:42 +00:00
Xin LI
653809335f MFV r287699: 6214 zpools going south
In r286570 (MFV of r277426) an unprotected write to b_flags to
set the compression mode was introduced.  This would open a race
window where data is partially decompressed, modified, checksummed
and written to the pool, resulting in pool corruption due to the
partial decompression.

Prevent this by reintroducing b_compress

illumos/illumos-gate@d4cd038c92

Illumos issues:

    6214 zpools going south
    https://www.illumos.org/issues/6214
2015-09-12 09:56:23 +00:00
Xin LI
3e691a57db MFV r287684: 6091 avl_add doesn't assert on non-debug builds
Use assfail() from libuutil instead of ASSERT() in userland
AVL avl_add.

illumos/illumos-gate@faa2b6be2f

Illumos issues:

    6091 avl_add doesn't assert on non-debug builds
    https://www.illumos.org/issues/6091
2015-09-12 08:50:43 +00:00
Xin LI
8c4f41ff34 MFV r287624: 5987 zfs prefetch code needs work
Rewrite the ZFS prefetch code to detect only forward, sequential
streams.

The following kstats have been added:

    kstat.zfs.misc.arcstats.sync_wait_for_async

	How many sync reads have waited for async read
	to complete. (less is better)

    kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch

	How many demand read didn't have to wait for I/O
	because of predictive prefetch.  (more is better)

zfetch kstats have been similified to hits, misses, and max_streams,
with max_streams representing times when we were not able to create
new stream because we already have the maximum number of sequences
for a file.

The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been
replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes
to prefetch per stream.

illumos/illumos-gate@cf6106c8a0

Illumos ZFS issues:

    5987 zfs prefetch code needs work
    https://www.illumos.org/issues/5987
2015-09-12 08:35:51 +00:00
Mark Johnston
1e954a7c63 Remove the arg0 field from struct amd64_frame. Its existence was a bug,
since on amd64 the first argument to a function is generally not on the
stack.

Revert an old DTrace bug fix to some code that assumed that
sizeof(struct amd64_frame) == 16.

Reviewed by:	jhb, kib
Sponsored by:	EMC / Isilon Storage Division
Differential Revision:	https://reviews.freebsd.org/D3255
2015-09-11 03:31:22 +00:00
Mark Johnston
e8baaa998c MFV r283513:
5930 fasttrap_pid_enable() panics when prfind() fails in forking process
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bryan Cantrill <bryan@joyent.com>

illumos/illumos-gate@9df7e4e12e
2015-09-11 03:06:34 +00:00
Mark Johnston
2275da185c MFV r283512:
3599 dtrace_dynvar tail calls can blow stack
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Gordon Ross <gordon.ross@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bryan Cantrill <bryan@joyent.com>

illumos/illumos-gate@d47448f09a
2015-09-11 03:04:24 +00:00
Xin LI
28ffe927c2 Expose an interface to determine if an ACE is inherited.
Submitted by:	sef
Reviewed by:	trasz
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D3540
2015-09-04 00:14:20 +00:00
Allan Jude
de84a5132c Apply the noline attribute to vdev_queue_max_async_writes
This makes it possible to analyze the performance of the new ZFS
write throttle with dtrace

PR:		200316
Submitted by:	Lacey Powers <lacey.leanne@gmail.com>
Reviewed by:	avg, smh, delphij (no objection)
Approved by:	bapt (mentor)
MFC after:	1 month
Sponsored by:	ScaleEngine Inc.
Differential Revision:	https://reviews.freebsd.org/D3472
2015-08-31 23:10:42 +00:00
Xin LI
9053fe148b Fix a buffer overrun which may lead to data corruption, introduced in
r286951 by reinstating changes in r274628.

In l2arc_compress_buf(), we allocate a buffer to stash away the compressed
data in 'cdata', allocated of l2hdr->b_asize bytes.

We then ask zio_compress_data() to compress the buffer, b_l1hdr.b_tmp_cdata,
which is of l2hdr->b_asize bytes, and have the compressed size (or original
size, if compress didn't gain enough) stored in csize.

To pad the buffer to fit the optimal write size, we round up the compressed
size to L2 device's vdev_ashift.

Illumos code rounds up the size by at most SPA_MINBLOCKSIZE.  Because we
know csize <= b_asize, and b_asize is integer multiple of SPA_MINBLOCKSIZE,
we are guaranteed that the rounded up csize would be <= b_asize. However,
this is not necessarily true when we round up to 1 << vdev_ashift, because
it could be larger than SPA_MINBLOCKSIZE.

So, in the worst case scenario, we are overwriting at most

	(1 << vdev_ashift - SPA_MINBLOCKSIZE)

bytes of memory next to the compressed data buffer.

Andriy's original change in r274628 reorganized the code a little bit,
by moving the padding to after we determined that the compression was
beneficial.  At which point, we would check rounded size against the
allocated buffer size, and the buffer overrun would not be possible.
2015-08-29 09:22:32 +00:00
Xin LI
253d699d3c In r286705 (Illumos 5960/a2cdcdd), a separate thread is created with curproc
as parent.  In the case of a send or receive, the curproc would be the
userland application that issues the ioctl.  This would trigger an assertion
failure introduced in Solaris compatibility shims in r196458 when kernel is
compiled with INVARIANTS.

Fix this by using p0 (proc0 or kernel) as the parent thread when creating
the kernel threads.
2015-08-29 08:16:57 +00:00
Andriy Gapon
9f2d1b28df MFV (partial) r286889: 5692 expose the number of hole blocks in a file
FreeBSD porting notes:
- only kernel-side changes are merged
- the new ioctl is not actually implemented yet
- thus, the goal is to synchronize DMU code

illumos/illumos-gate@2bcf0248e9

https://www.illumos.org/issues/5692
we would like to expose the number of hole (sparse) blocks in a file.
this can be useful to for example if you want to fill in the holes with
some data; knowing the number of holes in advances allows you to report
progress on hole filling. We could use SEEK_HOLE to do that but it would
be O(n) where n is the number of holes present in the file.

Author: Max Grossman <max.grossman@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
2015-08-24 09:48:50 +00:00
Andriy Gapon
21fc429242 spa_import_rootpool: prevent lock and resource leak
The lock leak could lead to a deadlock later.

PR:		198563
Submitted by:	Fabian Keil <fk@fabiankeil.de>
MFC after:	1 week
2015-08-24 08:44:44 +00:00
Andriy Gapon
082fcc9ed2 account for ashift when gathering buffers to be written to l2arc device
The change that introduced the L2ARC compression support also introduced
a bug where the on-disk size of the selected buffers could end up larger
than the target size if the ashift is greater than 9.  This was because
the buffer selection could did not take into account the fact that
on-disk size could be larger than the in-memory buffer size due to
the alignment requirements.

At the moment b_asize is a misnomer as it does not always represent the
allocated size: if a buffer is compressed, then the compressed size is
properly rounded (on FreeBSD), but if the compression fails or it is not
applied, then the original size is kept and it could be smaller than what
ashift requires.

For the same reasons arcstat_l2_asize and the reported used space
on the cache device could be smaller than the actual allocated size
if ashift > 9.  That problem is not fixed by this change.

This change only ensures that l2ad_hand is not advanced by more
than target_sz.  Otherwise we would overwrite active (unevicted)
L2ARC buffers.  That problem is manifested as growing l2_cksum_bad
and l2_io_error counters.

This change also changes 'p' prefix to 'a' prefix in a few places
where variables represent allocated rather than physical size.

The resolved problem could also result in the reported allocated size
being greater than the cache device's capacity, because of the
overwritten buffers (more than one buffer claiming the same disk
space).

This change is already in ZFS-on-Linux:
zfsonlinux/zfs@ef56b0780c

PR:		198242
PR:		195746 (possibly related)
Reviewed by:	mahrens (https://reviews.csiden.org/r/229/)
Tested by:	gkontos@aicom.gr (most recently)
MFC after:	15 days
X-MFC note:	patch does not apply as is at the moment
Relnotes:	yes
Sponsored by:	ClusterHQ
Differential Revision:	https://reviews.freebsd.org/D2764
Reviewed by:	noone (@FreeBSD.org)
2015-08-24 08:10:52 +00:00
Andriy Gapon
243f5e3085 try to fix lor between z_teardown_lock and spa_namespace_lock
The lock order reversal and a resulting deadlock were introduced
in r285021 / D2865.  The problem is that zfs_register_callbacks() calls
dsl_prop_get_integer() that has to acquire spa_namespace_lock.
At the same time, spa_config_sync() is called with spa_namespace_lock
held and then it performs ZFS vnode operations that acquire
z_teardown_lock in the reader mode.

So, fix the problem by using dsl_prop_get_int_ds() instead of
dsl_prop_get_integer().  The former does not need to look up
the pool and the dataset by name.

Reported by:	many
Reviewed by:	delphij
Tested by:	delphij, Jens Schweikhardt <schweikh@schweikhardt.net>
MFC after:	5 days
X-MFC with:	r285021
2015-08-21 08:17:44 +00:00
Andriy Gapon
b985dac5ff fix a mismerge in r286539 (MFV 286538: 5562 ZFS sa_handle's violate...)
PR:		202358
X-MFC with:	r286539
X-MFC attn:	mav
2015-08-21 08:04:56 +00:00
Alexander Motin
602015fd15 Restore part of r274628, reverted at r286776.
Submitted by:	avg
2015-08-20 07:41:33 +00:00
Mariusz Zaborski
347a39b4a6 Add support for the arrays in nvlist library.
- Add
  nvlist_{add,get,take,move,exists,free}_{number,bool,string,nvlist,
  descriptor} functions.
- Add support for (un)packing arrays.
- Add the nvl_array_next field to the nvlist structure.
  If an array is added by the nvlist_{move,add}_nvlist_array function
  this field will contains next element in the array.
- Add the nitems field to the nvpair and nvpair_header structure.
  This field contains number of elements in the array.
- Add special flag (NV_FLAG_IN_ARRAY) which is set if nvlist is a part of
  an array.
- Add special type (NV_TYPE_NVLIST_ARRAY_NEXT).This type is used only
  on packing/unpacking.
- Add new API for traversing arrays (nvlist_get_array_next).
- Add the nvlist_get_pararr function which combines the
  nvlist_get_array_next and nvlist_get_parent functions. If nvlist is in
  the array it will return next element from array. If nvlist is last
  element in array or it isn't in array it will return his
  container (parent). This function should simplify traveling over nvlist.
- Add tests for new features.
- Add documentation for new functions.
- Add my copyright.
- Regenerate the sys/cddl/compat/opensolaris/sys/nvpair.h file.

PR:		191083
Reviewed by:	allanjude (doc)
Approved by:	pjd (mentor)
2015-08-15 06:34:49 +00:00
Alexander Motin
d8928f479b Remove some random accumulated diff from Illumos.
Submitted by:	avg (partially)
2015-08-14 13:43:12 +00:00
Alexander Motin
6cb8dbf791 2618 arc.c mistypes in the comments
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bart Coddens <bart.coddens@gmail.com>

illumos/illumos-gate@fc98fea58e
2015-08-14 13:10:30 +00:00
Alexander Motin
997d864ce0 Fix r286766 build with debug. 2015-08-14 11:47:53 +00:00
Alexander Motin
ab4d08c3d3 Fix minor mismerge sometimes earlier. 2015-08-14 09:48:23 +00:00
Alexander Motin
5ba12a280a MFV r286765: 5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@2fd872a734

As a way to make it more difficult to introduce bugs into the ARC, and to
make it easier to diagnose issues when bugs do creep in, it would be
beneficial to change the type of the arc_state_t's arcs_size field to be
a refcount_t instead of a uint64_t. This would allow us to make stricter
checks when incrementing and decrementing the value with debugging enabled,
but still fallback to simple, fast atomic operations when debugging is
disabled.
2015-08-14 09:39:23 +00:00
Alexander Motin
ab4930d98c MFV r285025: 6033 arc_adjust() should search MFU lists for oldest buffer
when adjusting MFU size.

illumos/illumos-gate@31c46cf23c

https://www.illumos.org/issues/6033
  When we're looking for the list containing oldest buffer we never
  actually look at the MFU lists even when we try to evict from MFU.
  looks like a copy paste error, the fix is here:

Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Xin Li <delphij@delphij.net>
Reviewed by: Prakash Surya <me@prakashsurya.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>
Obtained from:  illumos
2015-08-14 09:33:46 +00:00
Alexander Motin
e0360e14d2 MFV r277431: 5497 lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@244781f10d

This patch attempts to reduce lock contention on the current arc_state_t
mutexes. These mutexes are used liberally to protect the number of LRU
lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
which these locks are acquired has been shown to greatly affect the
performance of highly concurrent, cached workloads.
2015-08-14 09:31:07 +00:00
Alexander Motin
267b62ec43 Revert part of r205231, introducing multiple ARC state locks.
This local implementation will be replaced by one from Illumos to reduce
code divergence and make further merges easier.
2015-08-14 09:25:54 +00:00
Alexander Motin
49114ce463 MFV 286711: 6096 ZFS_SMB_ACL_RENAME needs to cleanup better
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

illumos/illumos-gate@8f5190a540
2015-08-13 00:13:55 +00:00
Alexander Motin
3b1f51e911 MFV 286709:
6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@0f92170f1e
2015-08-13 00:10:36 +00:00
Alexander Motin
0d0def87fe MFV 286707: 5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@ca0cc3918a

A ZFS feature flags (large blocks) tracks its refcounts as the number of
datasets that have ever used the feature. Several features of this type
are planned to be added (new checksum functions). This code should be made
common infrastructure rather than duplicating the code for each feature.
2015-08-12 23:59:17 +00:00
Alexander Motin
b696497df0 MFV r286704: 5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

While running 'zfs recv' we noticed that every 128th 8K block required a
read. We were seeing that restore_write() was calling dmu_tx_hold_write()
and the indirect block was not cached. We should prefetch upcoming indirect
blocks to avoid having to go to disk and blocking the restore_write().

Allow an incremental send stream to be received as a clone, even if the
stream does not mark it as a clone.
2015-08-12 22:41:06 +00:00
Alexander Motin
d0687a01d7 MFV r284763: 5981 Deadlock in dmu_objset_find_dp
illumos/illumos-gate@1d3f896f54

https://www.illumos.org/issues/5981
  When dmu_objset_find_dp gets called with a read lock held, it fans out
  the work to the task queue. Each task in turn acquires its own read
  lock before calling the callback. If during this process anyone tries
  to a acquire a write lock, it will stall all read lock requests.Thus
  the tasks will never finish, the read lock of the caller will never
  get freed and the write lock never acquired.  deadlock.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <jansen@webgods.de>
2015-08-12 19:10:29 +00:00
Alexander Motin
101a6d4eac MFV r284762: 5269 zpool import slow
illumos/illumos-gate@12380e1e70

https://www.illumos.org/issues/5269
  When importing a pool (at boot or with zpool import) with many
  filesystem, the process can take minutes. It doesn't matter whether
  the pool has been exported cleanly or uncleanly.  The problem is that
  each dataset has its own log chain. On import, all datasets have to be
  checked if there are logs to replay.  The idea is to speed up this
  process by paralellizing it.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Arne Jansen <jansen@webgods.de>
2015-08-12 18:47:30 +00:00
Alexander Motin
ebf527de10 MFV r286682: 5765 add support for estimating send stream size with
lzc_send_space when source is a bookmark

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
Author: Max Grossman <max.grossman@delphix.com>

illumos/illumos-gate@643da460c8
2015-08-12 18:23:08 +00:00
Alexander Motin
2d41b1006f MFV r286224: 5695 dmu_sync'ed holes do not retain birth time
illumos/illumos-gate@70163ac57e

https://www.illumos.org/issues/5695
  In dmu_sync_ready(), a hole block pointer will have it's logical size
  explicitly set as it's necessary for replay purposes. To "undo" this,
  dmu_sync_done() will zero out any hole that it finds. This becomes a
  problem when using the "hole_birth" feature, as this will also wipe out
  any birth time that might have happened to be set on the hole.
  ...
  As a fix, the logic to zero out a hole is only applied to old style
  holes with a birth time of zero. Holes created with the "hole_birth"
  feature enabled will have a non-zero birth time, and will be skipped
  (thus preserving the ltime, type, and level information as well).
  In addition, zdb was updated to also print the ltime, type, and level
  information for these new style holes. Previously, only the logical
  birth time would be printed.

Author: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
2015-08-12 17:21:41 +00:00
Alexander Motin
ef00c83db1 Fix set of sign extension bugs in r286625. 2015-08-12 08:36:58 +00:00
Alexander Motin
b3fc966389 Fix assertion panic caused by combination of r286598 and TRIM. 2015-08-11 19:15:55 +00:00
Alexander Motin
3caed89878 Fix r286625 build on i386. 2015-08-11 12:38:01 +00:00
Alexander Motin
a3b3a9752c Fix minor mismerge in r286574. 2015-08-11 12:22:16 +00:00
Alexander Motin
c350858a50 MFV r277425:
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@2ec99e3e98
2015-08-11 10:39:19 +00:00
Alexander Motin
6be7d38913 Remove extra lock, that IMO only creates potential problems now. 2015-08-11 09:18:51 +00:00
Alexander Motin
1af86496cb MFV 286604: 5812 assertion failed in zrl_tryenter(): zr_owner==NULL
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@8df173054c
2015-08-10 21:36:51 +00:00
Alexander Motin
799f47828d MFV 286602: 5810 zdb should print details of bpobj
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@732885fca0
2015-08-10 21:32:40 +00:00
Alexander Motin
c70c15ffa9 MFV 286599: 5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@23367a2f2c
2015-08-10 21:19:42 +00:00
Alexander Motin
f7bf11ab59 MFV 286597: 5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@a52fc310ba
2015-08-10 21:13:59 +00:00
Alexander Motin
2e92f38b63 Local addition and mismerge fix for r286579. 2015-08-10 20:34:46 +00:00
Alexander Motin
de8b7ceff1 MFV 286588: 5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>

illumod/illumos-gate@34e8acef00
2015-08-10 19:38:07 +00:00
Alexander Motin
57f7c5acf5 MFV 286586: 5746 more checksumming in zfs send
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@98110f08fa
2015-08-10 19:32:58 +00:00
Alexander Motin
9b4b955150 MFV r277430: 5313 Allow I/Os to be aggregated across ZIO priority classes
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@fe319232d2
2015-08-10 12:39:10 +00:00
Alexander Motin
498b9d6c63 Fix r286574 build in user-space. 2015-08-10 12:25:26 +00:00
Alexander Motin
0702ce1a52 Fix r286570 build with debug. 2015-08-10 11:52:54 +00:00
Alexander Motin
83a6947e11 MFV r277428: 5056 ZFS deadlock on db_mtx and dn_holds
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin Gibbs <justing@spectralogic.com>

illumos/illumos-gate@bc9014e6a8
2015-08-10 11:30:07 +00:00
Alexander Motin
f13e9e1470 MFV r277427: 5445 Add more visibility via arcstats; specifically
arc_state_t stats and differentiate between "data" and "metadata"

Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@4076b1bf41
2015-08-10 10:59:58 +00:00
Alexander Motin
c908dc6f4b MFV r277426: 5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <Chris.Williamson@delphix.com>

illumos/illumos-gate@89c86e3229

Currently, every buffer cached in the L2ARC is accompanied by a 240-byte
header in memory, leading to very high memory consumption when using very
large cache devices. These changes significantly reduce this overhead.

Currently:

L1-only header = 176 bytes
L1 + L2 or L2-only header = 176 bytes + 32 byte checksum + 32 byte l2hdr
    = 240 bytes

Memory-optimized:

L1-only header = 176 bytes
L1 + L2 header = 176 bytes + 32 byte checksum = 208 bytes
L2-only header = 96 bytes + 32 byte checksum = 128 bytes

So overall:

          Trunk  Optimized
        +-----------------+
L1-only | 176 B  | 176 B  | (same)
        +-----------------+
L1 & L2 | 240 B  | 208 B  | (saved 32 bytes)
        +-----------------+
L2-only | 240 B  | 128 B  | (saved 116 bytes)
        +-----------------+

For an average blocksize of 8KB, this means that for the L2ARC, the ratio
of metadata to data has gone down from about 2.92% to 1.56%.  For a
'storage optimized' EC2 instance with 1600GB of SSD and 60GB of RAM, this
means that we expect a completely full L2ARC to use (1600 GB * 0.0156) /
60GB = 41% of the available memory, down from 78%.
2015-08-10 10:34:23 +00:00
Alexander Motin
1838fc920f MFV 286555: Avoid 128K kmem allocations in mzap_upgrade()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Rich Lowe <richlowe@richlowe.net>

illumos/illumos-gate@be3e2ab906
2015-08-09 20:41:44 +00:00
Alexander Motin
475e2caa3d MFV 286553: 5769 Cast 'zfs bad bloc' to ULL for x86
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@8c76e0763b
2015-08-09 20:32:30 +00:00
Alexander Motin
a5a2e782be MFV 286550: 5694 traverse_prefetcher does not prefetch enough
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@34d7ce052c
2015-08-09 20:23:35 +00:00
Alexander Motin
927211755b MFV 286548:
5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override

Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@7f7ace3700
2015-08-09 20:08:38 +00:00
Alexander Motin
4ff9527edc MFV 286546:
5661 ZFS: "compression = on" should use lz4 if feature is enabled

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@db1741f555
2015-08-09 20:02:16 +00:00
Alexander Motin
9834135947 MFV 286544:
5630 stale bonus buffer in recycled dnode_t leads to data corruption

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>
2015-08-09 19:35:39 +00:00
Alexander Motin
07470009be MFV 286542: 5592 NULL pointer dereference in dsl_prop_notify_all_cb()
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

illumos/illumos-gate@9d47dec048
2015-08-09 19:29:10 +00:00
Alexander Motin
1d367fea94 MFV 286540: 5531 NULL pointer dereference in dsl_prop_get_ds()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@e57a022b8f
2015-08-09 19:26:21 +00:00
Alexander Motin
c44a37d0b7 MFV 286538:
5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@0fda3cc5c1
2015-08-09 19:17:59 +00:00
Mark Johnston
1c9a705223 Remove a couple of unused fields from the FBT probe struct. 2015-08-03 17:39:36 +00:00
Steven Hartland
ebbc56ecd6 Fix KSTACK_PAGES check in ZFS module
The check introduced by r285946 failed to add the dependency on
opt_kstack_pages.h which meant the default value for the platform instead
of the customised options KSTACK_PAGES=X was being tested.

Also wrap in #ifdef __FreeBSD__ for portability.

MFC after:	3 days
Sponsored by:	Multiplay
2015-08-03 09:34:09 +00:00
Mark Johnston
48fcd357c4 Avoid dereferencing curthread->td_proc->p_cred in DTrace probe context.
When a process is exiting, there is a narrow window where p_cred may be
NULL while its threads are still executing. Specifically, the last thread
to exit a process sets the process state to PRS_ZOMBIE with the proc
spinlock held and then calls thread_exit(). thread_exit() drops the spin
lock, permitting the process to be reaped and thus causing its cred struct
to be released. However, the exiting thread may still cause DTrace probes
to fire by calling sched_throw(), resulting in a double fault if such a
probe enabling attempts to access the GID or UID DIF variables.

The thread's cred reference is not susceptible to this race since it is not
released until after the thread has exited.

MFC after:	1 week
2015-08-02 00:11:56 +00:00
Steven Hartland
1635369e99 Add warning about low KSTACK_PAGES for ZFS use
As ZFS requires a more kernel stack pages than is the default on some
architectures e.g. i386, warn if KSTACK_PAGES is less than
ZFS_MIN_KSTACK_PAGES (which is 4 at the time of writing).

MFC after:	3 days
Sponsored by:	Multiplay
2015-07-28 11:19:38 +00:00
Mark Johnston
d258fd1d98 Remove checks for a NULL return value from M_WAITOK allocations. 2015-07-21 23:44:36 +00:00
Ed Schouten
5a170c1b0e Add an API for easily creating userspace threads in kernelspace.
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.

This function is going to be used by the CloudABI compatibility layer.

It looks like the OpenSolaris compatibility framework already provides a
function called thread_create(). Rename this function to
do_thread_create() and use a macro to deal with the namespacing
conflict. A similar approach is already used for thread_exit().

MFC after:	1 month
2015-07-20 10:20:04 +00:00
Mark Johnston
32cd0147fa Implement the lockstat provider using SDT(9) instead of the custom provider
in lockstat.ko. This means that lockstat probes now have typed arguments and
will utilize SDT probe hot-patching support when it arrives.

Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D2993
2015-07-19 22:14:09 +00:00
Mark Johnston
efe8b26b82 Modify lockstat_nsecs() to just return unless lockstat probes are actually
enabled. The cost of a timecounter read can be quite significant, and the
problem became more apparent after r284297, since that change resulted in
a call to lockstat_nsecs() for each acquisition of an rwlock read lock.

PR:		201642
Reviewed by:	avg
Tested by:	Jason Unovitch
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D3073
2015-07-18 00:22:00 +00:00
Konstantin Belousov
888e282ab4 When checking for the valid value of the frame pointer, verify that it
belongs to the kernel stack address range for the thread.  Right now,
code checks that new frame is not farther then KSTACK_PAGES pages from
the current frame, which allows the address to point past the top of
the stack.

Reviewed by:	andrew, emaste, markj
Differential revision:	https://reviews.freebsd.org/D3108
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-16 19:40:18 +00:00
Mateusz Guzik
752fc07d33 vfs: implement v_holdcnt/v_usecount manipulation using atomic ops
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by:	kib (earlier version)
Tested by:	pho
2015-07-16 13:57:05 +00:00
Mateusz Guzik
8a08cec166 Create a dedicated function for ensuring that cdir and rdir are populated.
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.

Reviewed by:	kib
2015-07-11 16:22:48 +00:00
Mariusz Zaborski
306a82f8f4 Rename zfs nvpair files to not colidate with our nvlist.
PR:		201356
Approved by:	pjd (mentor)
2015-07-09 21:53:40 +00:00
Mateusz Guzik
f131759f54 fd: make 'rights' a manadatory argument to fget* functions 2015-07-05 19:05:16 +00:00
Konstantin Belousov
6fdfd88220 Use single instance of the identical INKERNEL() and PMC_IN_KERNEL()
macros on amd64 and i386.  Move the definition to machine/param.h.
kgdb defines INKERNEL() too, the conflict is resolved by renaming kgdb
version to PINKERNEL().

On i386, correct the lowest kernel address.  After the shared page was
introduced, USRSTACK no longer points to the last user address + 1 [*]

Submitted by:	Oliver Pinter [*]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-02 14:37:21 +00:00
Andriy Gapon
74f75cb1bd zfs_mount(MS_REMOUNT): protect zfs_(un)register_callbacks calls
We now take z_teardown_lock as a writer to ensure that there is no I/O
while the filesystem state is in a flux.  Also, zfs_suspend_fs() ->
zfsvfs_teardown() call zfs_unregister_callbacks() and zfs_resume_fs() ->
zfsvfs_setup() call zfs_unregister_callbacks().  Previously there was no
synchronization between those calls and the calls in the re-mounting
case.  That could lead to concurrent execution and a crash.

PR:		180060
Differential Revision:	https://reviews.freebsd.org/D2865
Suggested by:	mahrens
Reviewed by:	delphij, pho, mahrens, will
MFC after:	13 days
Sponsored by:	ClusterHQ
2015-07-02 08:32:02 +00:00
Ruslan Bukin
b78ee15e9f First cut of DTrace for AArch64.
Reviewed by:	andrew, emaste
Sponsored by:	ARM Limited
Differential Revision:	https://reviews.freebsd.org/D2738
2015-07-01 15:51:11 +00:00
Ruslan Bukin
0ff41755cd Add a central location for exclusion checks. We check
here if function is excluded from FBT instrumentation.

Reviewed by:	andrew, emaste, markj
Differential Revision:	https://reviews.freebsd.org/D2899
2015-07-01 14:09:59 +00:00
Andriy Gapon
bc97daa07e MFV r284412: 5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@46e1baa6cf

https://www.illumos.org/issues/5911
Sometimes ZFS appears to hang while deleting a file. It is actually
making slow progress at the file deletion, but other operations
(administrative and writes via the data path) "hang" until the file
removal completes, which can take a long time if the file has many
blocks. The deletion (or most of it) happens in a single txg, and the
sync thread spends most of its time reading indirect blocks via this
stack trace:
	swtch+0x141()
	cv_wait+0x70()
	zio_wait+0x5b()
	dbuf_read+0x2c0()
	free_children+0x50()
	free_children+0x12a()
	free_children+0x12a()
	free_children+0x12a()
	dnode_sync_free_range_impl+0xdf()
	dnode_sync_free_range+0x52()
	range_tree_vacate+0x65()
	dnode_sync+0x1d8()
	dmu_objset_sync_dnodes+0x77()
	dmu_objset_sync+0x19f()
	dsl_dataset_sync+0x51()
	dsl_pool_sync+0x9a()
	spa_sync+0x2ff()
	txg_sync_thread+0x21f()
	thread_start+8()
One way to reproduce the problem is if we are over the arc_meta_limit,
e.g. because lots of indirect blocks are pinned because we have L0
dbufs under them.  It could be that most of the L1 indirects are cached,
in which case when dmu_free_long_range_impl() calls dmu_tx_hold_free(),
it will complete very quickly. This allows dmu_free_long_range_impl() to
put many (perhaps all of its) transactions in the same TXG. However,
dmu_free_long_range_impl() calls dnode_evict_dbufs (and
dnode_free_range()), which removes the L0 dbufs, thus reducing the hold
count on the L1 indirect blocks above it, allowing them to be evicted.
Because we are over the arc_meta_limit(), these L1 blocks will be
evicted ASAP. Thus when we get to syncing context, the L1 indirects are
no longer cached and must be read in.

Obtained from:	illumos
MFC after:	15 days
2015-06-19 06:58:05 +00:00
Andriy Gapon
ab50c99d40 illums compat: use flsl/flsll for highbit/highbit64
Do that only when when fast inline versions are available.
At the moment that can be the case only in the kernel and not for all
platforms.

The original code uses the binary search and that's kept as a fallback.
This is a micro optimization.

Differential Revision:	https://reviews.freebsd.org/D2839
Reviewed by:	delphij, mahrens, mav
MFC after:	17 days
2015-06-19 06:41:53 +00:00
Gleb Smirnoff
093ebe1d28 o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async().
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-06-17 22:44:27 +00:00
Andriy Gapon
783379a942 Revert r284511 because it caused build failures on many platforms
The problem is that when inline versions of flsl and flsll are not
available, then libkern.h must be included for their declarations
in kernel sources.
The fix would be trivial, but I would like to figure out first if
it even makes sense to use the libkern provided implementations.

Reported by:	bz
Pointyhat to:	avg
2015-06-17 17:16:06 +00:00
Andriy Gapon
6470c31911 l2arc: pass correct size to trim requests
b_size is a logical size of a buffer in memory, b_asize is its physical
size that accounts for possible compression.
Currently the latter is the best approximation for the allocated, on-disk
size.

L2ARC TRIM support was committed a few weeks before L2ARC compression
was imported, so originally the code was correct, because b_size was
the size.

Further thoughts.  Given that the cache device is being overwritten
in a circular fashion it is not clear if a TRIM per each evicted L2ARC
buffer has any benefits.
Maybe it would be sufficient to issue a single trim request for the whole
device when it is loaded, e.g. after a bootup, or when it is unloaded, e.g.
before a shutdown.  At least as long as L2ARC is not persistent across
reboots.

Discussed with:	smh
MFC after:	19 says
2015-06-17 12:28:13 +00:00
Andriy Gapon
1fa1d4a651 illumos compat: use flsl/flsll for highbit/highbit64
This is a micro optimization.
The upstream code uses the binary search.

Differential Revision:	https://reviews.freebsd.org/D2839
Reviewed by:	delphij, mav
MFC after:	15 days
2015-06-17 12:05:04 +00:00
Andriy Gapon
bab89d0897 MFV r284036: 5961 Fix stack overflow in zfs_create_fs
illumos/illumos-gate@c701fde691

Author:		glebius
MFC after:	11 days
2015-06-12 11:10:49 +00:00
Andriy Gapon
ff7e06fbf4 MFV r284030: 5818 zfs {ref}compressratio is incorrect with 4k sector size
illumos/illumos-gate@81cd5c555f

Author:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	17 days
2015-06-12 10:57:05 +00:00