Commit Graph

940 Commits

Author SHA1 Message Date
Chunwei Chen
21a96fb635 Fix "BUG: Bad page state" caused by writeback flag
Commit d958324 fixed the deadlock between page lock and range lock by
unlocking the page lock before acquiring the range lock. However,
this created a new issue #3075.

The problem is that if we can't set the write back bit before releasing
the page lock.  Then other processes will be unaware that the page is
under active write back.  They may therefore truncate the page,
invalidate the page, or not honor the sync semantics.

To workaround this problem we re-dirty the page before dropping the
page lock.  While this doesn't prevent the page from being truncated
it does ensure it won't be invalidated.  Then the range lock and the
page lock are reacquired in the correct deadlock-free order.

Once both locks are safely held the page state can be rechecked.  If
all is well and the page is in the expect state the dirty bit can be
removed, the write back bit set, and the page removed from the skip
count.  If not the page will be handled as appropriate.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3075
2015-07-29 07:38:15 -07:00
Brian Behlendorf
1229323d5f Align thread priority with Linux defaults
Under Linux filesystem threads responsible for handling I/O are
normally created with the maximum priority.  Non-I/O filesystem
processes run with the default priority.  ZFS should adopt the
same priority scheme under Linux to maintain good performance
and so that it will complete fairly when other Linux filesystems
are active.  The priorities have been updated to the following:

$ ps -eLo rtprio,cls,pid,pri,nice,cmd | egrep 'z_|spl_|zvol|arc|dbu|meta'
     -  TS 10743  19 -20 [spl_kmem_cache]
     -  TS 10744  19 -20 [spl_system_task]
     -  TS 10745  19 -20 [spl_dynamic_tas]
     -  TS 10764  19   0 [dbu_evict]
     -  TS 10765  19   0 [arc_prune]
     -  TS 10766  19   0 [arc_reclaim]
     -  TS 10767  19   0 [arc_user_evicts]
     -  TS 10768  19   0 [l2arc_feed]
     -  TS 10769  39   0 [z_unmount]
     -  TS 10770  39 -20 [zvol]
     -  TS 11011  39 -20 [z_null_iss]
     -  TS 11012  39 -20 [z_null_int]
     -  TS 11013  39 -20 [z_rd_iss]
     -  TS 11014  39 -20 [z_rd_int_0]
     -  TS 11022  38 -19 [z_wr_iss]
     -  TS 11023  39 -20 [z_wr_iss_h]
     -  TS 11024  39 -20 [z_wr_int_0]
     -  TS 11032  39 -20 [z_wr_int_h]
     -  TS 11033  39 -20 [z_fr_iss_0]
     -  TS 11041  39 -20 [z_fr_int]
     -  TS 11042  39 -20 [z_cl_iss]
     -  TS 11043  39 -20 [z_cl_int]
     -  TS 11044  39 -20 [z_ioctl_iss]
     -  TS 11045  39 -20 [z_ioctl_int]
     -  TS 11046  39 -20 [metaslab_group_]
     -  TS 11050  19   0 [z_iput]
     -  TS 11121  38 -19 [z_wr_iss]

Note that under Linux the meaning of a processes priority is inverted
with respect to illumos.  High values on Linux indicate a _low_ priority
while high value on illumos indicate a _high_ priority.

In order to preserve the logical meaning of the minclsyspri and
maxclsyspri macros when they are used by the illumos wrapper functions
their values have been inverted.  This way when changes are merged
from upstream illumos we won't need to remember to invert the macro.
It could also lead to confusion.

This patch depends on https://github.com/zfsonlinux/spl/pull/466.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3607
2015-07-28 13:36:47 -07:00
Brian Behlendorf
c97d30691c Check for NULL in dmu_free_long_range_impl()
A NULL should never be passed as the dnode_t pointer to the function
dmu_free_long_range_impl().  Regardless, because we have a reported
occurrence of this let's add some error handling to catch this.
Better to report a reasonable error to caller than panic the system.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3445
2015-07-28 13:30:53 -07:00
Brian Behlendorf
96c080cb9c Minor style cleanup
Address minor differences in style between upstream and ZoL.  This
patch contains no functional differences and is solely designed to
minimize the delta from upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:54 -07:00
Brian Behlendorf
3056818343 Remove double counting HDR_L2ONLY_SIZE
Commit d962d5d didn't quite properly resolve the HDR_L2ONLY_SIZE
accounting.  Accounting is now performed only in the constructor
and destructor which is a nice simplification.  It should have
been removed the from create and destroy functions.  This brings
up back in sync with upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:44 -07:00
Brian Behlendorf
8c8af9d807 Add hdr_recl() reclaim callback
Originally removed because it wasn't required under Linux.  However,
there may still be some utility in signaling the arc reclaim thread
under Linux via reclaim.  This should already have happened by other
means but it's not harmless and reduces another point of divergence
with upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:40 -07:00
Brian Behlendorf
728d6ae91e Reinstate zfs_arc_p_min_shift
Commit f521ce1 removed the minimum value for "arc_p" allowing it to
drop to zero or grow to "arc_c".  This was done to improve specific
workload which constantly dirties new "metadata" but also frequently
touches a "small" amount of mfu data (e.g. mkdir's).

This change may still be desirable but it needs to be re-investigated.
in the context of the recent ARC changes from upstream.  Therefore
this code is being restored to facilitate benchmarking.  By setting
"zfs_arc_p_min_shift=64" we easily compare the performance.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:32 -07:00
Prakash Surya
36da08ef9b Illumos 5817 - change type of arcs_size from uint64_t to refcount_t
5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/5817
  https://github.com/illumos/illumos-gate/commit/2fd872a

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:28 -07:00
Prakash Surya
500445c046 Illumos 5445 - Add more visibility via arcstats
5445 Add more visibility via arcstats; specifically arc_state_t
stats and differentiate between "data" and "metadata"
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5445
  https://github.com/illumos/illumos-gate/commit/4076b1b

Porting Notes:

This patch is an improved version of cc7f677 which was previously
merged in ZoL.  This patch incorporates the additional improvements
which were made upstream.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3533
2015-07-23 09:42:06 -07:00
Matthew Ahrens
ca67b33aba Illumos 5376 - arc_kmem_reap_now() should not result in clearing arc_no_grow
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5376
  https://github.com/illumos/illumos-gate/commit/2ec99e3

Porting Notes:

The good news is that many of the recent changes made upstream to the
ARC tackled issues previously observed by ZoL with similar solutions.
The bad news is those solution weren't identical to the ones we applied.
This patch is designed to split the difference and apply as much of the
upstream work as possible.

* The arc_available_memory() function was removed previous in ZoL but
due to the upstream changes it makes sense to add it back.  This function
has been customized for Linux so that it can be used to determine a low
memory.  This provides the same basic functionality as the illumos version
allowing us to minimize changes through the rest of the code base.  The
exact mechanism used to detect a low memory state remains unchanged so
this change isn't a significant as it might first appear.

* This patch includes the long standing fix for arc_shrink() which was
originally proposed in #2167.  Since there were related changes to this
function it made sense to include that work.

* The arc_init() function has been re-factored.  As before it sets sane
default values for the ARC but then calls arc_tuning_update() to apply
user specific tuning made via module options.  The arc_tuning_update()
function is then called periodically by the arc_reclaim_thread() to
apply changes to the tunings made during normal operation.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3616
Closes #2167
2015-07-23 09:41:28 -07:00
Brian Behlendorf
53b1d9794e Add logic to try and recover an inode with an invalid mode
When an inode is detected with invalid mode bits the safe thing to
do is panic the system.  This indicates a problem with the contents
of a dnode and it should never be possible.  This is the default
behavior.

Unfortunately, due to flaws in the system attribute (SA) implementation
(on all platforms) it was possible that ZFS could create a damaged dnode.
This was a rare issue which only impacted dnodes which used a spill
block.  Normally only symlinks and files with ACLs would require a
spill block.  However, if the dataset had the xattr=sa property set
and extended attributes were used this problem could occur.

As of the 0.6.4 tag the root cause of this issue has been fixed.  For
pools which are exhibiting this damage the 'zfs_recover=1' module option
may be set.  This will cause ZFS to interpret the dnode with invalid
mode bits as a normal file.  This may allow the files to be accessed
for recovery purposes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3548
2015-07-17 15:33:35 -07:00
Turbo Fredriksson
47a4a6fd5f Support parallel build trees (VPATH builds)
Build products from an out of tree build should be written
relative to the build directory.  Sources should be referred
to by their locations in the source directory.

This is accomplished by adding the 'src' and 'obj' variables
for the module Makefile.am, using relative paths to reference
source files, and by setting VPATH when source files are not
co-located with the Makefile.  This enables the following:

  $ mkdir build
  $ cd build
  $ ../configure \
    --with-spl=$HOME/src/git/spl/ \
    --with-spl-obj=$HOME/src/git/spl/build
  $ make -s

This change also has the advantage of resolving the following
warning which is generated by modern versions of automake.

  Makefile.am:00: warning: source file 'xxx' is in a subdirectory,
  Makefile.am:00: but option 'subdir-objects' is disabled

Signed-off-by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1082
2015-07-17 13:42:51 -07:00
Brian Behlendorf
2a53e2dacc Update inode under range lock
After a successful write the inode must be updated under the range
lock.  If it is updated after dropping the lock there exists a race
where the znode and inode wile disagree about the file size.  This
could result in narrow window of time where read(2) is able to access
data beyond what fstat(2) reports as the file size.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #3601
2015-07-17 09:18:22 -07:00
Brian Behlendorf
bd29109f1a Linux 4.2 compat: follow_link() / put_link()
As of Linux 4.2 the kernel has completely retired the nameidata
structure.  One of the few remaining consumers of this interface
were the follow_link() and put_link() callbacks.

This patch adds the required checks to configure to detect the
interface change and updates the functions accordingly.  Migrating
to the simple_follow_link() interface was considered but was decided
against ironically due to the increased complexity.

It also should be noted that the kernel follow_link() and put_link()
interfaces changes several times after 4.1 and but before 4.2.  This
means there is a narrow range of kernel commits which never appear
in an official tag of the Linux kernel which ZoL will not build.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #3596
2015-07-17 09:18:16 -07:00
Brian Behlendorf
7eb333fbdd Linux 4.2 compat: remove bio->bi_cnt access
Linux 4.2 commit torvalds/linux@dac5621 renamed bio->bi_cnt to
bio->__bi_cnt.  Because this value is only used once in a block of
debug code it simplest just to remove the PANIC.  To my knowledge
this debugging has never been hit or proved useful so this is no
great loss.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #3596
2015-07-17 09:16:08 -07:00
Matthew Ahrens
905edb405d Illumos 5347 - idle pool may run itself out of space
5347 idle pool may run itself out of space
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/231aab8
  https://github.com/illumos/illumos-gate/commit/4a92375 3642
  https://www.illumos.org/issues/5347
  https://github.com/zfsonlinux/zfs/commit/89b1cd6 (partial commit & fix)
  https://github.com/zfsonlinux/zfs/commit/fbeddd6 Illumos 4390
  https://github.com/zfsonlinux/zfs/commit/2696dfa Illumos 3642, 3643

Porting notes:
This is completing the partial fix from FreeBSD

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3586
2015-07-14 10:35:21 -07:00
Alexander Eremin
1cddb8c9ff Illumos 5610 - zfs clone from different source and target pools produces coredump
5610 zfs clone from different source and target pools produces coredump
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://github.com/illumos/illumos-gate/commit/03b1c29
  https://www.illumos.org/issues/5610
  https://www.illumos.org/issues/5824
  https://github.com/zfsonlinux/zfs/issues/2911
  https://github.com/zfsonlinux/zfs/commit/9063f65

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3584
2015-07-14 10:27:46 -07:00
Richard Yao
0de7c552b6 Failure of userland copy should return EFAULT
Many key internal functions pass system return codes that are safe to
return to userland. In the case of ddi_copyin(9F), an error passes -1
and the documentation states very clearly that drivers should pass
EFAULT to userland when this happens.

http://illumos.org/man/9F/ddi_copyin

This does not happen in the ZFS source code. I believe it should be
changed to pass EFAULT. I caught this when writing man pages for the
libzfs_core API.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3575
2015-07-14 10:20:35 -07:00
Boris Protopopov
b39c22b73c Translate sync zio to sync bio
Translate zio requests with ZIO_PRIORITY_SYNC_READ and
ZIO_PRIORITY_SYNC_WRITE into synchronous bio requests by setting
READ_SYNC and WRITE_SYNC flags. Specifically, WRITE_SYNC flag turns
out to have a pronounced effect when writing to an SSD-based SLOG.

When WRITE_SYNC is not set (WRITE is set instead), the block trace
for a SLOG device looks as follows:
...
130,96   0        3     0.008968390     0  C   W 830464 + 136 [0]
130,96   0        4     0.011999161     0  C   W 830720 + 136 [0]
130,96   0        5     0.023955549     0  C   W 831744 + 136 [0]
130,96   0        6     0.024337663 19775  A   W 832000 + 136 <- (130,97) 829952
130,96   0        7     0.024338823 19775  Q   W 832000 + 136 [z_wr_iss/6]
130,96   0        8     0.024340523 19775  G   W 832000 + 136 [z_wr_iss/6]
130,96   0        9     0.024343187 19775  P   N [z_wr_iss/6]
130,96   0       10     0.024344120 19775  I   W 832000 + 136 [z_wr_iss/6]
130,96   0       11     0.026784405     0 UT   N [swapper] 1
130,96   0       12     0.026805339   202  U   N [kblockd/0] 1
130,96   0       13     0.026807199   202  D   W 832000 + 136 [kblockd/0]
130,96   0       14     0.026966948     0  C   W 832000 + 136 [0]
130,96   3        1     0.000449358 19788  A   W 829952 + 136 <- (130,97) 827904
130,96   3        2     0.000450951 19788  Q   W 829952 + 136 [z_wr_iss/19]
130,96   3        3     0.000453212 19788  G   W 829952 + 136 [z_wr_iss/19]
130,96   3        4     0.000455956 19788  P   N [z_wr_iss/19]
130,96   3        5     0.000457076 19788  I   W 829952 + 136 [z_wr_iss/19]
130,96   3        6     0.002786349     0 UT   N [swapper] 1
...

Here the 130,197 is the partition created on the log device when adding it
to the pool, whereas the base device is 130,96. As one can see, the writes
to the SLOG are not marked synchronous (the S is missing next to W), and
the queue unplugs occur based on the timer (UT event) resulting in slightly
over 2 msec latency of writes. This results in a sub-par performance of
single stream synchronous writes (limited by latency of the SLOG).

When the WRITE_SYNC is set, a similar trace looks as follows:
...
130,96   4        1     0.000000000 70714  A  WS 4280576 + 136 <- (130,97) 4278528
130,96   4        2     0.000000832 70714  Q  WS 4280576 + 136 [(null)]
130,96   4        3     0.000002109 70714  G  WS 4280576 + 136 [(null)]
130,96   4        4     0.000003394 70714  P   N [(null)]
130,96   4        5     0.000003846 70714  I  WS 4280576 + 136 [(null)]
130,96   4        6     0.000004854 70714  D  WS 4280576 + 136 [(null)]
130,96   5        1     0.000354487 70713  A  WS 4280832 + 136 <- (130,97) 4278784
130,96   5        2     0.000355072 70713  Q  WS 4280832 + 136 [(null)]
130,96   5        3     0.000356383 70713  G  WS 4280832 + 136 [(null)]
130,96   5        4     0.000357635 70713  P   N [(null)]
130,96   5        5     0.000358088 70713  I  WS 4280832 + 136 [(null)]
130,96   5        6     0.000359191 70713  D  WS 4280832 + 136 [(null)]
130,96   0       76     0.000159539     0  C  WS 4280576 + 136 [0]
130,96  16       85     0.000742108 70718  A  WS 4281088 + 136 <- (130,97) 4279040
130,96  16       86     0.000743197 70718  Q  WS 4281088 + 136 [z_wr_iss/15]
130,96  16       87     0.000744450 70718  G  WS 4281088 + 136 [z_wr_iss/15]
130,96  16       88     0.000745817 70718  P   N [z_wr_iss/15]
130,96  16       89     0.000746705 70718  I  WS 4281088 + 136 [z_wr_iss/15]
130,96  16       90     0.000747848 70718  D  WS 4281088 + 136 [z_wr_iss/15]
130,96   0       77     0.000604063     0  C  WS 4280832 + 136 [0]
130,96   0       78     0.000899858     0  C  WS 4281088 + 136 [0]

As one can see, all the writes are synchronous (WS), and I/O completions
(e.g. from issue I to completion C) take 160-250 usec, or about 10x faster.

Since WRITE_SYNC or READ_SYNC flags are among several factors that are
considered when processing bio requests, it seems prudent to mark all the
zio requests of synchronous priority with the READ/WRITE_SYNC flags to make
them eligible for consideration as such by the Linux block I/O layer.

Signed-off-by: Boris Protopopov <boris.protopopov@actifio.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3529
2015-07-13 14:28:50 -07:00
Brian Behlendorf
2b7b78fa5d Fix switch-bool warning
As of gcc version 5.1.1 a new warning has been added to detect the
use of a boolean in a switch statement (-Wswitch-bool).  Resolve the
warning by explicitly casting the value to an integer type.

  zfs-0.6.4/module/zfs/zvol.c: In function 'zvol_request':
  error: switch condition has boolean value [-Werror=switch-bool]

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-07-13 13:03:01 -07:00
Justin T. Gibbs
99197f034e Illumos 5661 - ZFS: "compression = on" should use lz4 if feature is enabled
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://github.com/illumos/illumos-gate/commit/db1741f
  https://www.illumos.org/issues/5661

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3571
2015-07-10 12:11:45 -07:00
Tim Chase
1cd777340b Prevent reclaim in metaslab preload threads
Reclaim during metaslab preloading can cause deadlocks involving znode
z_lock and ARC buffer header ht_lock.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3532.
2015-07-06 09:36:13 -07:00
Alexander Motin
e16b3fcc61 Illumos 5008 - lock contention (rrw_exit) while running a read only load
5008 lock contention (rrw_exit) while running a read only load
Reviewed by: Matthew Ahrens <matthew.ahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>

Porting notes:

This patch ported perfectly cleanly to ZoL.  During testing 100% cached
small-block reads, extreme contention was noticed on rrl->rr_lock from
rrw_exit() due to the frequent entering and leaving ZPL.  Illumos picked
up this patch from FreeBSD and it also helps under Linux.

On a 1-minute 4K cached read test with 10 fio processes pinned to a single
socket on a 4-socket (10 thread per socket) NUMA system, contentions on
rrl->rr_lock were reduced from 508799 to 43085.

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3555
2015-07-06 09:34:13 -07:00
Matthew Ahrens
4bda3bd0e7 Illumos 5911 - ZFS "hangs" while deleting file
5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

References:
  https://www.illumos.org/issues/5911
  https://github.com/illumos/illumos-gate/commit/46e1baa

Porting notes:

Resolved ISO C90 forbids mixed declarations and code wanting in
the dnode_free_range() function.

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3554
2015-07-06 09:31:42 -07:00
Arne Jansen
5e8cd5d17f Illumos 5981 - Deadlock in dmu_objset_find_dp
5981 Deadlock in dmu_objset_find_dp
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5981
  https://github.com/illumos/illumos-gate/commit/1d3f896

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3553
2015-07-06 09:31:35 -07:00
Andriy Gapon
71e2fe41be Illumos 5946, 5945
5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots
5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

References:
  https://www.illumos.org/issues/5946
  https://www.illumos.org/issues/5945
  https://github.com/illumos/illumos-gate/commit/24218be

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3552
2015-07-06 09:31:30 -07:00
Andriy Gapon
b6640117f0 Illumos 5870 - dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch
5870 dmu_recv_end_check() leaks origin_head hold if error happens in drc_force branch
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5870
  https://github.com/illumos/illumos-gate/commit/beddaa9

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3551
2015-07-06 09:22:18 -07:00
Andriy Gapon
fec417097b Illumos 5909 - ensure that shared snap names don't become too long after promotion
5909 ensure that shared snap names don't become too long after promotion
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5909
  https://github.com/illumos/illumos-gate/commit/cb5842f

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3550
2015-07-06 09:21:30 -07:00
Andriy Gapon
cf50a2b08f Illumos 5912 - full stream can not be force-received into a dataset if it has a snapshot
5912 full stream can not be force-received into a dataset if it has a snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5912
  https://github.com/illumos/illumos-gate/commit/5bae108

Ported-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3549
2015-07-06 09:20:18 -07:00
Alek Pinchuk
a7b10a9319 Illumos 6033 - arc_adjust() should search MFU lists
6033 arc_adjust() should search MFU lists for oldest
buffer when adjusting MFU size
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Xin Li <delphij@delphij.net>
Reviewed by: Prakash Surya <me@prakashsurya.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>

References:
  https://www.illumos.org/issues/6033
  https://github.com/illumos/illumos-gate/commit/31c46cf

Ported-by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3545
2015-07-01 11:09:15 -07:00
Matthew Ahrens
804e050457 Illumos 5175 - implement dmu_read_uio_dbuf() to improve cached read performance
5175 implement dmu_read_uio_dbuf() to improve cached read performance
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  https://www.illumos.org/issues/5175
  https://github.com/illumos/illumos-gate/commit/f8554bb

Porting notes:

This patch doesn't include the changes for the COMSTAR (Common
Multiprotocol SCSI Target) - since it's not available for ZoL.

http://thegreyblog.blogspot.co.at/2010/02/setting-up-solaris-comstar-and.html

Ported by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3392
2015-06-29 14:33:23 -07:00
Matthew Ahrens
c52fca13a0 Illumos 5368 - ARC should cache more metadata
5368 ARC should cache more metadata
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5368
  https://github.com/illumos/illumos-gate/commit/3a5286a

Porting Notes:

The vast majority of this patch was already merged in the context
of the 06358ea changes.  This is just a small hunk which was missed.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-25 08:58:17 -07:00
George Wilson
669dedb33f Illumos 5163 - arc should reap range_seg_cache
5163 arc should reap range_seg_cache
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5163
  https://github.com/illumos/illumos-gate/commit/83803b5

Porting Notes:

Added umem_cache_reap_now() wrapped to suppress unused variable
warning for user space build in arc_kmem_reap_now().

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-25 08:58:16 -07:00
Brian Behlendorf
aa9af22cdf Update all default taskq settings
Over the years the default values for the taskqs used on Linux have
differed slightly from illumos.  In the vast majority of cases this
was done to avoid creating an obnoxious number of idle threads which
would pollute the process listing.

With the addition of support for dynamic taskqs all multi-threaded
queues should be created as dynamic taskqs.  This allows us to get
the best of both worlds.

* The illumos default values for the I/O pipeline can be restored.
These values are known to work well for most workloads.  The only
exception is the zio write interrupt taskq which is changed to
ZTI_P(12, 8).  At least under Linux more threads has been shown
to improve performance, see commit 7e55f4e.

* Reduces the number of idle threads on the system when it's not
under heavy load.  The maximum number of threads will only be
created when they are required.

* Remove the vdev_file_taskq and rely on the system_taskq instead
which is now dynamic and may have up to 64-threads.  Again this
brings us back inline with upstream.

* Tasks dispatched with taskq_dispatch_ent() are allowed to use
dynamic taskqs.  The Linux taskq implementation supports this.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3507
2015-06-25 08:58:16 -07:00
Andriy Gapon
ef56b0780c Account for ashift when gathering buffers to be written to l2arc device
If we don't account for that, then we might end up overwriting disk
area of buffers that have not been evicted yet, because l2arc_evict
operates in terms of disk addresses.

The discrepancy between the write size calculation and the actual
increment to l2ad_hand was introduced in commit 3a17a7a9.

The change that introduced l2ad_hand alignment was almost correct
as the write size was accumulated as a sum of rounded buffer sizes.
See commit illumos/illumos-gate@e14bb32.

Also, we now consistently use asize / a_sz for the allocated size and
psize / p_sz for the physical size.  The latter accounts for a
possible size reduction because of the compression, whereas the
former accounts for a possible subsequent size expansion because of
the alignment requirements.

The code still assumes that either underlying storage subsystems or
hardware is able to do read-modify-write when an L2ARC buffer size is
not a multiple of a disk's block size.  This is true for 4KB sector disks
that provide 512B sector emulation, but may not be true in general.
In other words, we currently do not have any code to make sure that
an L2ARC buffer, whether compressed or not, which is used for physical
I/O has a suitable size.

Note that currently the cache device utilization is calculated based
on the physical size, not the allocated size.  The same applies to
l2_asize kstat. That is wrong, but this commit does not fix that.
The accounting problem was introduced partially in commit 3a17a7a9
and partially in 3038a2b (accounting became consistent but in favour
of the wrong size).

Porting Notes:

Reworked to be C90 compatible and the 'write_psize' variable was
removed because it is now unused.

References:
  https://reviews.csiden.org/r/229/
  https://reviews.freebsd.org/D2764

Ported-by: kernelOfTruth <kerneloftruth@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3400
Closes #3433
Closes #3451
2015-06-25 08:57:16 -07:00
Prakash Surya
d962d5dad9 Illumos 5701 - zpool list reports incorrect "alloc" value for cache devices
5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5701
  https://github.com/illumos/illumos-gate/commit/a52fc31

Porting Notes:

arc_space_return(HDR_L2ONLY_SIZE, ARC_SPACE_L2HDRS);
correctly placed at arc_hdr_l2hdr_destroy(arc_buf_hdr_t *hdr).

Ported by: kernelOfTruth kerneloftruth@gmail.com
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-25 08:51:44 -07:00
Richard Yao
72540ea314 zfsdev_getminor() should check for invalid file handles
Unit testing at ClusterHQ found that passing an invalid file handle to
zfs_ioc_hold results in a NULL pointer dereference on a system without
assertions:

IP: [<ffffffffa0218aa0>] zfsdev_getminor+0x10/0x20 [zfs]
Call Trace:
[<ffffffffa021b4b0>] zfs_onexit_fd_hold+0x20/0x40 [zfs]
[<ffffffffa0214043>] zfs_ioc_hold+0x93/0xd0 [zfs]
[<ffffffffa0215890>] zfsdev_ioctl+0x200/0x500 [zfs]

An assertion would have caught this had they been enabled, but this is
something that the kernel module should handle without failing.  We
resolve this by searching the linked list to ensure that the file
handle's private_data points to a valid zfsdev_state_t.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Andriy Gapon <avg@FreeBSD.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3506
2015-06-22 17:02:13 -07:00
Etienne Dechamps
99b14de421 Make metaslab_aliquot a module parameter.
This seems generally useful. metaslab_aliquot is the ZFS allocation
granularity, which is roughly equivalent to what is called the stripe
size in traditional RAID arrays. It seems relevant to performance
tuning.

Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-22 14:19:38 -07:00
Etienne Dechamps
e8fe6684a5 Document metaslab_aliquot.
Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-22 14:19:31 -07:00
Etienne Dechamps
bb3250d07e Allocate disk space fairly in the presence of vdevs of unequal size.
The metaslab allocator device selection algorithm contains a bias
mechanism whose goal is to achieve roughly equal disk space usage across
all top-level vdevs.

It seems that the initial rationale for this code was to allow newly
added (empty) vdevs to "come up to speed" faster in an attempt to make
the pool quickly converge to a steady state where all vdevs are equally
utilized.

While the code seems to work reasonably well for this use case, there
is another scenario in which this algorithm fails miserably: the case
where top-level vdevs don't have the same sizes (capacities). ZFS
allows this, and it is a good feature to have, so that users who simply
want to build a pool with the disks they happen to have lying around can
do so even if the disks have heteregenous sizes.

Here's a script that simulates a pool with two vdevs, with one 4X larger
than the other:

    dd if=/dev/zero of=/tmp/d1 bs=1 count=1 seek=134217728
    dd if=/dev/zero of=/tmp/d2 bs=1 count=1 seek=536870912
    zpool create testspace /tmp/d1 /tmp/d2
    dd if=/dev/zero of=/testspace/foobar bs=1M count=256
    zpool iostat -v testspace

Before this commit, the script would output the following:

                   capacity
    pool        alloc   free
    ----------  -----  -----
    testspace    252M   375M
      /tmp/d1    104M  18.5M
      /tmp/d2    148M   356M
    ----------  -----  -----

This demonstrates that the current code handles this situation very
poorly: d1 shows 85% usage despite the pool itself being only 40% full.
d1 is quite saturated at this point, and is slowing down the entire pool
due to saturation, fragmentation and the like.

In contrast, here's the result with the code in this commit:

                   capacity
    pool        alloc   free
    ----------  -----  -----
    testspace    252M   375M
      /tmp/d1   56.7M  66.3M
      /tmp/d2    195M   309M
   ----------  -----  ------

This looks much better. d1 is 46% used, which is close to the overall
pool utilization (40%). The code still doesn't result in perfectly
balanced allocation, probably because of the way mg_bias is applied
which does not guarantee perfect accuracy, but this is still much better
than before.

Signed-off-by: Etienne Dechamps <etienne@edechamps.fr>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3389
2015-06-22 14:18:29 -07:00
Brian Behlendorf
218b4e0a76 Add zfs_sb_prune_aliases() function
For kernels which do not implement a per-suberblock shrinker,
those older than Linux 3.1, the shrink_dcache_parent() function
was used to attempt to reclaim dentries.  This was found not be
entirely reliable and could lead to performance issues on older
kernels running meta-data heavy workloads.

To address this issue a zfs_sb_prune_aliases() function has been
added to implement this functionality.  It relies on traversing
the list of znodes for a filesystem and adding them to a private
list with a reference held.  The private list can then be safely
walked outside the z_znodes_lock to prune dentires and drop the
last reference so the inode can be freed.

This provides the same synchronous behavior as the per-filesystem
shrinker and has the advantage of depending on only long standing
interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #3501
2015-06-22 10:22:49 -07:00
Brian Behlendorf
4c6a700910 Increase the number of iput taskq threads
The number of threads in the iput taskq has been increased to speed
up the number of iputs which can be handled.  This has been observed
to improve the  meta data reclaim regardless of zfs_sb_prune()
implementation in use.

The taskq has also been renamed z_iput to for consistency with the
rest of the I/O pipeline taskqs which are all named z_*.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
2015-06-22 10:22:10 -07:00
Matus Kral
57ae840077 Linux 4.1 compat: use read_iter() / write_iter()
Linux 3.15 commit torvalds/linux@293bc98 introduced two new methods.
The ->read_iter() and ->write_iter() methods were designed to replace
the ->aio_read() and ->aio_write() interfaces.  Both interfaces were
preserved for several kernel releases in order to migrate all existing
consumers to the new interfaces.  But as of Linux 4.1 the legacy
interface has been retired and the ZFS code must be updated to use
the new interfaces.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3352
2015-06-18 12:06:59 -07:00
Tim Chase
90947b2357 3.12 compat, NUMA-aware per-superblock shrinker
Kernels >= 3.12 have a NUMA-aware superblock shrinker which is used in
ZoL by zfs_sb_prune().  This patch calls the shrinker for each on-line
NUMA node in order that memory be freed for each one.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3495
2015-06-17 10:43:13 -07:00
Brian Behlendorf
8e70975f90 Wait interruptibly in prefetch thread
The Linux kernel watchdog will automatically dump a backtrace for
any process while sleeps for over 120s in an uninterruptible state.

The solution is for the prefetch thread to sleep in an interruptible
state.  The way the existing code was written this is safe because
when woken it will always reevaluate its conditional.  As a general
rule it is preferable to sleep in an interruptible when possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3450
Closes #3402
2015-06-16 16:18:11 -07:00
Brian Behlendorf
b64ccd6c52 Rename cv_wait_interruptible() to cv_wait_sig()
This is the counterpart to zfsonlinux/spl@2345368 which replaces the
cv_wait_interruptible() function with cv_wait_sig().  There is no
functional change to patch merely brings the function names in to
sync to maximize portability.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3450
Issue #3402
2015-06-11 10:50:47 -07:00
Tim Chase
121b3cae74 Increase arc_c_min to allow safe operation of arc_adapt()
ZoL had lowered the minimum ARC size to 4MiB to better accommodate tiny
systems such as the raspberry pi, however, as of addition of large block
support, the arc_adapt() function depends on arc_c being >= 32MiB (2 *
SPA_MAXBLOCKSIZE).

This patch raises the minimum ARC size to 32MiB and adds a VERIFY test
to arc_adapt() for future-proofing.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Brian Behlendorf
f604673836 Make arc_prune() asynchronous
As described in the comment above arc_adapt_thread() it is critical
that the arc_adapt_thread() function never sleep while holding a hash
lock.  This behavior was possible in the Linux implementation because
the arc_prune() logic was implemented to be synchronous.  Under
illumos the analogous dnlc_reduce_cache() function is asynchronous.

To address this the arc_do_user_prune() function is has been reworked
in to two new functions as follows:

* arc_prune_async() is an asynchronous implementation which dispatches
the prune callback to be run by the system taskq.  This makes it
suitable to use in the context of the arc_adapt_thread().

* arc_prune() is a synchronous implementation which depends on the
arc_prune_async() implementation but blocks until the outstanding
callbacks complete.  This is used in arc_kmem_reap_now() where it
is safe, and expected, that memory will be freed.

This patch additionally adds the zfs_arc_meta_strategy module option
while allows the meta reclaim strategy to be configured.  It defaults
to a balanced strategy which has been proved to work well under Linux
but the illumos meta-only strategy can be enabled.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Brian Behlendorf
c5528b9ba6 Use taskq_wait_outstanding() function
Replace taskq_wait() with taskq_wait_oustanding().  This way callers
will only block until previously submitted tasks have been completed.
This was the previous behavior of task_wait() prior to the introduction
of taskq_wait_outstanding() so this isn't really a functionalty change
for these callers.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-06-11 10:27:25 -07:00
Prakash Surya
ca0bf58d65 Illumos 5497 - lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>

Porting notes and other significant code changes:

The illumos 5368 patch (ARC should cache more metadata), which
was never picked up by ZoL, is mostly reverted by this patch.

Since ZoL relies on the kernel asynchronously calling the shrinker to
actually reap memory, the shrinker wakes up arc_reclaim_waiters_cv every
time it runs.

The arc_adapt_thread() function no longer calls arc_do_user_evicts()
since the newly-added arc_user_evicts_thread() calls it periodically.

Notable conflicting ZoL commits which conflicted with this patch or
whose effects are either duplicated or un-done by this patch:

    302f753 - Integrate ARC more tightly with Linux
    39e055c - Adjust arc_p based on "bytes" in arc_shrink
    f521ce1 - Allow "arc_p" to drop to zero or grow to "arc_c"
    77765b5 - Remove "arc_meta_used" from arc_adjust calculation
    94520ca - Prune metadata from ghost lists in arc_adjust_meta

Trace support for multilist_insert() and multilist_remove() has been
added and produces the following output:

    fio-12498 [077] .... 112936.448324: zfs_multilist__insert: ml { offset 240 numsublists 80 sublistidx 63 }
    fio-12498 [077] .... 112936.448347: zfs_multilist__remove: ml { offset 240 numsublists 80 sublistidx 29 }

The following arcstats have been removed:

    recycle_miss - Used by arcstat.py and arc_summary.py, both of which
    have been updated appropriately.

    l2_writes_hdr_miss

The following arcstats have been added:

    evict_not_enough - Number of times arc_evict_state() was unable to
    evict enough buffers to reach its target amount.

    evict_l2_skip - Number of times arc_evict_hdr() skipped eviction
    because it was being written to the l2arc.

    l2_writes_lock_retry - Replaces l2_writes_hdr_miss.  Number of times
    l2arc_write_done() failed to acquire hash_lock (and re-tries).

    arc_meta_min - Shows the value of the zfs_arc_meta_min module
    parameter (see below).

The "index" column of the "dbuf" kstat has been removed since it doesn't
have a direct analog in the new multilist scheme.  Additional multilist-
related stats could be added in the future but would likely require
extensions to the mulilist API.

The following module parameters have been added:

    zfs_arc_evict_batch_limit - Number of ARC headers to free per sub-list
    before moving on to the next sub-list.

    zfs_arc_meta_min - Enforce a floor on the amount of metadata in
    the ARC.

    zfs_arc_num_sublists_per_state - Number of multilist sub-lists per
    ARC state.

    zfs_arc_overflow_shift - Controls amount by which the ARC must exceed
    the target size to be considered "overflowing".

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov
2015-06-11 10:27:25 -07:00