2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* CDDL HEADER START
|
|
|
|
*
|
|
|
|
* The contents of this file are subject to the terms of the
|
|
|
|
* Common Development and Distribution License (the "License").
|
|
|
|
* You may not use this file except in compliance with the License.
|
|
|
|
*
|
|
|
|
* You can obtain a copy of the license at usr/src/OPENSOLARIS.LICENSE
|
|
|
|
* or http://www.opensolaris.org/os/licensing.
|
|
|
|
* See the License for the specific language governing permissions
|
|
|
|
* and limitations under the License.
|
|
|
|
*
|
|
|
|
* When distributing Covered Code, include this CDDL HEADER in each
|
|
|
|
* file and include the License file at usr/src/OPENSOLARIS.LICENSE.
|
|
|
|
* If applicable, add the following below this CDDL HEADER, with the
|
|
|
|
* fields enclosed by brackets "[]" replaced with your own identifying
|
|
|
|
* information: Portions Copyright [yyyy] [name of copyright owner]
|
|
|
|
*
|
|
|
|
* CDDL HEADER END
|
|
|
|
*/
|
|
|
|
/*
|
2010-05-28 20:45:14 +00:00
|
|
|
* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.
|
2015-05-20 04:14:01 +00:00
|
|
|
* Copyright (c) 2011, 2015 by Delphix. All rights reserved.
|
2015-12-23 20:02:43 +00:00
|
|
|
* Copyright 2015 Nexenta Systems, Inc. All rights reserved.
|
2015-04-02 03:44:32 +00:00
|
|
|
* Copyright (c) 2014 Spectra Logic Corporation, All rights reserved.
|
2008-11-20 20:01:55 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
#include <sys/zfs_context.h>
|
|
|
|
#include <sys/spa_impl.h>
|
|
|
|
#include <sys/zio.h>
|
|
|
|
#include <sys/zio_checksum.h>
|
|
|
|
#include <sys/zio_compress.h>
|
|
|
|
#include <sys/dmu.h>
|
|
|
|
#include <sys/dmu_tx.h>
|
|
|
|
#include <sys/zap.h>
|
|
|
|
#include <sys/zil.h>
|
|
|
|
#include <sys/vdev_impl.h>
|
2014-05-13 02:36:35 +00:00
|
|
|
#include <sys/vdev_file.h>
|
SIMD implementation of vdev_raidz generate and reconstruct routines
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328
2016-04-25 08:04:31 +00:00
|
|
|
#include <sys/vdev_raidz.h>
|
2008-11-20 20:01:55 +00:00
|
|
|
#include <sys/metaslab.h>
|
|
|
|
#include <sys/uberblock_impl.h>
|
|
|
|
#include <sys/txg.h>
|
|
|
|
#include <sys/avl.h>
|
|
|
|
#include <sys/unique.h>
|
|
|
|
#include <sys/dsl_pool.h>
|
|
|
|
#include <sys/dsl_dir.h>
|
|
|
|
#include <sys/dsl_prop.h>
|
2010-08-26 18:42:43 +00:00
|
|
|
#include <sys/fm/util.h>
|
2010-05-28 20:45:14 +00:00
|
|
|
#include <sys/dsl_scan.h>
|
2008-11-20 20:01:55 +00:00
|
|
|
#include <sys/fs/zfs.h>
|
|
|
|
#include <sys/metaslab_impl.h>
|
2008-12-03 20:09:06 +00:00
|
|
|
#include <sys/arc.h>
|
2010-05-28 20:45:14 +00:00
|
|
|
#include <sys/ddt.h>
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-06 23:09:05 +00:00
|
|
|
#include <sys/kstat.h>
|
2008-11-20 20:01:55 +00:00
|
|
|
#include "zfs_prop.h"
|
2012-12-13 23:24:15 +00:00
|
|
|
#include "zfeature_common.h"
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* SPA locking
|
|
|
|
*
|
|
|
|
* There are four basic locks for managing spa_t structures:
|
|
|
|
*
|
|
|
|
* spa_namespace_lock (global mutex)
|
|
|
|
*
|
|
|
|
* This lock must be acquired to do any of the following:
|
|
|
|
*
|
|
|
|
* - Lookup a spa_t by name
|
|
|
|
* - Add or remove a spa_t from the namespace
|
|
|
|
* - Increase spa_refcount from non-zero
|
|
|
|
* - Check if spa_refcount is zero
|
|
|
|
* - Rename a spa_t
|
|
|
|
* - add/remove/attach/detach devices
|
|
|
|
* - Held for the duration of create/destroy/import/export
|
|
|
|
*
|
|
|
|
* It does not need to handle recursion. A create or destroy may
|
|
|
|
* reference objects (files or zvols) in other pools, but by
|
|
|
|
* definition they must have an existing reference, and will never need
|
|
|
|
* to lookup a spa_t by name.
|
|
|
|
*
|
|
|
|
* spa_refcount (per-spa refcount_t protected by mutex)
|
|
|
|
*
|
|
|
|
* This reference count keep track of any active users of the spa_t. The
|
|
|
|
* spa_t cannot be destroyed or freed while this is non-zero. Internally,
|
|
|
|
* the refcount is never really 'zero' - opening a pool implicitly keeps
|
2008-12-03 20:09:06 +00:00
|
|
|
* some references in the DMU. Internally we check against spa_minref, but
|
2008-11-20 20:01:55 +00:00
|
|
|
* present the image of a zero/non-zero value to consumers.
|
|
|
|
*
|
2008-12-03 20:09:06 +00:00
|
|
|
* spa_config_lock[] (per-spa array of rwlocks)
|
2008-11-20 20:01:55 +00:00
|
|
|
*
|
|
|
|
* This protects the spa_t from config changes, and must be held in
|
|
|
|
* the following circumstances:
|
|
|
|
*
|
|
|
|
* - RW_READER to perform I/O to the spa
|
|
|
|
* - RW_WRITER to change the vdev config
|
|
|
|
*
|
|
|
|
* The locking order is fairly straightforward:
|
|
|
|
*
|
|
|
|
* spa_namespace_lock -> spa_refcount
|
|
|
|
*
|
|
|
|
* The namespace lock must be acquired to increase the refcount from 0
|
|
|
|
* or to check if it is zero.
|
|
|
|
*
|
2008-12-03 20:09:06 +00:00
|
|
|
* spa_refcount -> spa_config_lock[]
|
2008-11-20 20:01:55 +00:00
|
|
|
*
|
|
|
|
* There must be at least one valid reference on the spa_t to acquire
|
|
|
|
* the config lock.
|
|
|
|
*
|
2008-12-03 20:09:06 +00:00
|
|
|
* spa_namespace_lock -> spa_config_lock[]
|
2008-11-20 20:01:55 +00:00
|
|
|
*
|
|
|
|
* The namespace lock must always be taken before the config lock.
|
|
|
|
*
|
|
|
|
*
|
2008-12-03 20:09:06 +00:00
|
|
|
* The spa_namespace_lock can be acquired directly and is globally visible.
|
2008-11-20 20:01:55 +00:00
|
|
|
*
|
2008-12-03 20:09:06 +00:00
|
|
|
* The namespace is manipulated using the following functions, all of which
|
|
|
|
* require the spa_namespace_lock to be held.
|
2008-11-20 20:01:55 +00:00
|
|
|
*
|
|
|
|
* spa_lookup() Lookup a spa_t by name.
|
|
|
|
*
|
|
|
|
* spa_add() Create a new spa_t in the namespace.
|
|
|
|
*
|
|
|
|
* spa_remove() Remove a spa_t from the namespace. This also
|
|
|
|
* frees up any memory associated with the spa_t.
|
|
|
|
*
|
|
|
|
* spa_next() Returns the next spa_t in the system, or the
|
|
|
|
* first if NULL is passed.
|
|
|
|
*
|
|
|
|
* spa_evict_all() Shutdown and remove all spa_t structures in
|
|
|
|
* the system.
|
|
|
|
*
|
|
|
|
* spa_guid_exists() Determine whether a pool/device guid exists.
|
|
|
|
*
|
|
|
|
* The spa_refcount is manipulated using the following functions:
|
|
|
|
*
|
|
|
|
* spa_open_ref() Adds a reference to the given spa_t. Must be
|
|
|
|
* called with spa_namespace_lock held if the
|
|
|
|
* refcount is currently zero.
|
|
|
|
*
|
|
|
|
* spa_close() Remove a reference from the spa_t. This will
|
|
|
|
* not free the spa_t or remove it from the
|
|
|
|
* namespace. No locking is required.
|
|
|
|
*
|
|
|
|
* spa_refcount_zero() Returns true if the refcount is currently
|
|
|
|
* zero. Must be called with spa_namespace_lock
|
|
|
|
* held.
|
|
|
|
*
|
2008-12-03 20:09:06 +00:00
|
|
|
* The spa_config_lock[] is an array of rwlocks, ordered as follows:
|
|
|
|
* SCL_CONFIG > SCL_STATE > SCL_ALLOC > SCL_ZIO > SCL_FREE > SCL_VDEV.
|
|
|
|
* spa_config_lock[] is manipulated with spa_config_{enter,exit,held}().
|
|
|
|
*
|
|
|
|
* To read the configuration, it suffices to hold one of these locks as reader.
|
|
|
|
* To modify the configuration, you must hold all locks as writer. To modify
|
|
|
|
* vdev state without altering the vdev tree's topology (e.g. online/offline),
|
|
|
|
* you must hold SCL_STATE and SCL_ZIO as writer.
|
|
|
|
*
|
|
|
|
* We use these distinct config locks to avoid recursive lock entry.
|
|
|
|
* For example, spa_sync() (which holds SCL_CONFIG as reader) induces
|
|
|
|
* block allocations (SCL_ALLOC), which may require reading space maps
|
|
|
|
* from disk (dmu_read() -> zio_read() -> SCL_ZIO).
|
|
|
|
*
|
|
|
|
* The spa config locks cannot be normal rwlocks because we need the
|
|
|
|
* ability to hand off ownership. For example, SCL_ZIO is acquired
|
|
|
|
* by the issuing thread and later released by an interrupt thread.
|
|
|
|
* They do, however, obey the usual write-wanted semantics to prevent
|
|
|
|
* writer (i.e. system administrator) starvation.
|
|
|
|
*
|
|
|
|
* The lock acquisition rules are as follows:
|
|
|
|
*
|
|
|
|
* SCL_CONFIG
|
|
|
|
* Protects changes to the vdev tree topology, such as vdev
|
|
|
|
* add/remove/attach/detach. Protects the dirty config list
|
|
|
|
* (spa_config_dirty_list) and the set of spares and l2arc devices.
|
|
|
|
*
|
|
|
|
* SCL_STATE
|
|
|
|
* Protects changes to pool state and vdev state, such as vdev
|
|
|
|
* online/offline/fault/degrade/clear. Protects the dirty state list
|
|
|
|
* (spa_state_dirty_list) and global pool state (spa_state).
|
|
|
|
*
|
|
|
|
* SCL_ALLOC
|
|
|
|
* Protects changes to metaslab groups and classes.
|
|
|
|
* Held as reader by metaslab_alloc() and metaslab_claim().
|
|
|
|
*
|
|
|
|
* SCL_ZIO
|
|
|
|
* Held by bp-level zios (those which have no io_vd upon entry)
|
|
|
|
* to prevent changes to the vdev tree. The bp-level zio implicitly
|
|
|
|
* protects all of its vdev child zios, which do not hold SCL_ZIO.
|
|
|
|
*
|
|
|
|
* SCL_FREE
|
|
|
|
* Protects changes to metaslab groups and classes.
|
|
|
|
* Held as reader by metaslab_free(). SCL_FREE is distinct from
|
|
|
|
* SCL_ALLOC, and lower than SCL_ZIO, so that we can safely free
|
|
|
|
* blocks in zio_done() while another i/o that holds either
|
|
|
|
* SCL_ALLOC or SCL_ZIO is waiting for this i/o to complete.
|
|
|
|
*
|
|
|
|
* SCL_VDEV
|
|
|
|
* Held as reader to prevent changes to the vdev tree during trivial
|
2010-05-28 20:45:14 +00:00
|
|
|
* inquiries such as bp_get_dsize(). SCL_VDEV is distinct from the
|
2008-12-03 20:09:06 +00:00
|
|
|
* other locks, and lower than all of them, to ensure that it's safe
|
|
|
|
* to acquire regardless of caller context.
|
|
|
|
*
|
|
|
|
* In addition, the following rules apply:
|
|
|
|
*
|
|
|
|
* (a) spa_props_lock protects pool properties, spa_config and spa_config_list.
|
|
|
|
* The lock ordering is SCL_CONFIG > spa_props_lock.
|
|
|
|
*
|
|
|
|
* (b) I/O operations on leaf vdevs. For any zio operation that takes
|
|
|
|
* an explicit vdev_t argument -- such as zio_ioctl(), zio_read_phys(),
|
|
|
|
* or zio_write_phys() -- the caller must ensure that the config cannot
|
|
|
|
* cannot change in the interim, and that the vdev cannot be reopened.
|
|
|
|
* SCL_STATE as reader suffices for both.
|
2008-11-20 20:01:55 +00:00
|
|
|
*
|
|
|
|
* The vdev configuration is protected by spa_vdev_enter() / spa_vdev_exit().
|
|
|
|
*
|
|
|
|
* spa_vdev_enter() Acquire the namespace lock and the config lock
|
|
|
|
* for writing.
|
|
|
|
*
|
|
|
|
* spa_vdev_exit() Release the config lock, wait for all I/O
|
|
|
|
* to complete, sync the updated configs to the
|
|
|
|
* cache, and release the namespace lock.
|
|
|
|
*
|
2008-12-03 20:09:06 +00:00
|
|
|
* vdev state is protected by spa_vdev_state_enter() / spa_vdev_state_exit().
|
|
|
|
* Like spa_vdev_enter/exit, these are convenience wrappers -- the actual
|
|
|
|
* locking is, always, based on spa_namespace_lock and spa_config_lock[].
|
|
|
|
*
|
2012-12-13 23:24:15 +00:00
|
|
|
* spa_rename() is also implemented within this file since it requires
|
2008-12-03 20:09:06 +00:00
|
|
|
* manipulation of the namespace.
|
2008-11-20 20:01:55 +00:00
|
|
|
*/
|
|
|
|
|
|
|
|
static avl_tree_t spa_namespace_avl;
|
|
|
|
kmutex_t spa_namespace_lock;
|
|
|
|
static kcondvar_t spa_namespace_cv;
|
|
|
|
int spa_max_replication_override = SPA_DVAS_PER_BP;
|
|
|
|
|
|
|
|
static kmutex_t spa_spare_lock;
|
|
|
|
static avl_tree_t spa_spare_avl;
|
|
|
|
static kmutex_t spa_l2cache_lock;
|
|
|
|
static avl_tree_t spa_l2cache_avl;
|
|
|
|
|
|
|
|
kmem_cache_t *spa_buffer_pool;
|
2009-01-15 21:59:39 +00:00
|
|
|
int spa_mode_global;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
Swap DTRACE_PROBE* with Linux tracepoints
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-06-13 17:54:48 +00:00
|
|
|
#ifdef ZFS_DEBUG
|
|
|
|
/* Everything except dprintf and spa is on by default in debug builds */
|
|
|
|
int zfs_flags = ~(ZFS_DEBUG_DPRINTF | ZFS_DEBUG_SPA);
|
|
|
|
#else
|
|
|
|
int zfs_flags = 0;
|
|
|
|
#endif
|
|
|
|
|
|
|
|
/*
|
|
|
|
* zfs_recover can be set to nonzero to attempt to recover from
|
|
|
|
* otherwise-fatal errors, typically caused by on-disk corruption. When
|
|
|
|
* set, calls to zfs_panic_recover() will turn into warning messages.
|
|
|
|
* This should only be used as a last resort, as it typically results
|
|
|
|
* in leaked space, or worse.
|
|
|
|
*/
|
|
|
|
int zfs_recover = B_FALSE;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If destroy encounters an EIO while reading metadata (e.g. indirect
|
|
|
|
* blocks), space referenced by the missing metadata can not be freed.
|
|
|
|
* Normally this causes the background destroy to become "stalled", as
|
|
|
|
* it is unable to make forward progress. While in this stalled state,
|
|
|
|
* all remaining space to free from the error-encountering filesystem is
|
|
|
|
* "temporarily leaked". Set this flag to cause it to ignore the EIO,
|
|
|
|
* permanently leak the space from indirect blocks that can not be read,
|
|
|
|
* and continue to free everything else that it can.
|
|
|
|
*
|
|
|
|
* The default, "stalling" behavior is useful if the storage partially
|
|
|
|
* fails (i.e. some but not all i/os fail), and then later recovers. In
|
|
|
|
* this case, we will be able to continue pool operations while it is
|
|
|
|
* partially failed, and when it recovers, we can continue to free the
|
|
|
|
* space, with no leaks. However, note that this case is actually
|
|
|
|
* fairly rare.
|
|
|
|
*
|
|
|
|
* Typically pools either (a) fail completely (but perhaps temporarily,
|
|
|
|
* e.g. a top-level vdev going offline), or (b) have localized,
|
|
|
|
* permanent errors (e.g. disk returns the wrong data due to bit flip or
|
|
|
|
* firmware bug). In case (a), this setting does not matter because the
|
|
|
|
* pool will be suspended and the sync thread will not be able to make
|
|
|
|
* forward progress regardless. In case (b), because the error is
|
|
|
|
* permanent, the best we can do is leak the minimum amount of space,
|
|
|
|
* which is what setting this flag will do. Therefore, it is reasonable
|
|
|
|
* for this flag to normally be set, but we chose the more conservative
|
|
|
|
* approach of not setting it, so that there is no possibility of
|
|
|
|
* leaking space in the "partial temporary" failure case.
|
|
|
|
*/
|
|
|
|
int zfs_free_leak_on_eio = B_FALSE;
|
|
|
|
|
2013-04-29 22:49:23 +00:00
|
|
|
/*
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 03:01:20 +00:00
|
|
|
* Expiration time in milliseconds. This value has two meanings. First it is
|
|
|
|
* used to determine when the spa_deadman() logic should fire. By default the
|
|
|
|
* spa_deadman() will fire if spa_sync() has not completed in 1000 seconds.
|
|
|
|
* Secondly, the value determines if an I/O is considered "hung". Any I/O that
|
|
|
|
* has not completed in zfs_deadman_synctime_ms is considered "hung" resulting
|
|
|
|
* in a system panic.
|
2013-04-29 22:49:23 +00:00
|
|
|
*/
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 03:01:20 +00:00
|
|
|
unsigned long zfs_deadman_synctime_ms = 1000000ULL;
|
2013-04-29 22:49:23 +00:00
|
|
|
|
|
|
|
/*
|
|
|
|
* By default the deadman is enabled.
|
|
|
|
*/
|
|
|
|
int zfs_deadman_enabled = 1;
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 03:01:20 +00:00
|
|
|
/*
|
|
|
|
* The worst case is single-sector max-parity RAID-Z blocks, in which
|
|
|
|
* case the space requirement is exactly (VDEV_RAIDZ_MAXPARITY + 1)
|
|
|
|
* times the size; so just assume that. Add to this the fact that
|
|
|
|
* we can have up to 3 DVAs per bp, and one more factor of 2 because
|
|
|
|
* the block may be dittoed with up to 3 DVAs by ddt_sync(). All together,
|
|
|
|
* the worst case is:
|
|
|
|
* (VDEV_RAIDZ_MAXPARITY + 1) * SPA_DVAS_PER_BP * 2 == 24
|
|
|
|
*/
|
|
|
|
int spa_asize_inflation = 24;
|
|
|
|
|
2014-11-03 20:28:43 +00:00
|
|
|
/*
|
|
|
|
* Normally, we don't allow the last 3.2% (1/(2^spa_slop_shift)) of space in
|
|
|
|
* the pool to be consumed. This ensures that we don't run the pool
|
|
|
|
* completely out of space, due to unaccounted changes (e.g. to the MOS).
|
|
|
|
* It also limits the worst-case time to allocate space. If we have
|
|
|
|
* less than this amount of free space, most ZPL operations (e.g. write,
|
|
|
|
* create) will return ENOSPC.
|
|
|
|
*
|
|
|
|
* Certain operations (e.g. file removal, most administrative actions) can
|
|
|
|
* use half the slop space. They will only return ENOSPC if less than half
|
|
|
|
* the slop space is free. Typically, once the pool has less than the slop
|
|
|
|
* space free, the user will use these operations to free up space in the pool.
|
|
|
|
* These are the operations that call dsl_pool_adjustedsize() with the netfree
|
|
|
|
* argument set to TRUE.
|
|
|
|
*
|
|
|
|
* A very restricted set of operations are always permitted, regardless of
|
|
|
|
* the amount of free space. These are the operations that call
|
|
|
|
* dsl_sync_task(ZFS_SPACE_CHECK_NONE), e.g. "zfs destroy". If these
|
|
|
|
* operations result in a net increase in the amount of space used,
|
|
|
|
* it is possible to run the pool completely out of space, causing it to
|
|
|
|
* be permanently read-only.
|
|
|
|
*
|
|
|
|
* See also the comments in zfs_space_check_t.
|
|
|
|
*/
|
|
|
|
int spa_slop_shift = 5;
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA config locking
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
static void
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_init(spa_t *spa)
|
|
|
|
{
|
2010-08-26 16:52:39 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < SCL_LOCKS; i++) {
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_t *scl = &spa->spa_config_lock[i];
|
|
|
|
mutex_init(&scl->scl_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&scl->scl_cv, NULL, CV_DEFAULT, NULL);
|
2013-09-04 12:00:57 +00:00
|
|
|
refcount_create_untracked(&scl->scl_count);
|
2008-12-03 20:09:06 +00:00
|
|
|
scl->scl_writer = NULL;
|
|
|
|
scl->scl_write_wanted = 0;
|
|
|
|
}
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
static void
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_destroy(spa_t *spa)
|
|
|
|
{
|
2010-08-26 16:52:39 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < SCL_LOCKS; i++) {
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_t *scl = &spa->spa_config_lock[i];
|
|
|
|
mutex_destroy(&scl->scl_lock);
|
|
|
|
cv_destroy(&scl->scl_cv);
|
|
|
|
refcount_destroy(&scl->scl_count);
|
|
|
|
ASSERT(scl->scl_writer == NULL);
|
|
|
|
ASSERT(scl->scl_write_wanted == 0);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
spa_config_tryenter(spa_t *spa, int locks, void *tag, krw_t rw)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-08-26 16:52:39 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = 0; i < SCL_LOCKS; i++) {
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_t *scl = &spa->spa_config_lock[i];
|
|
|
|
if (!(locks & (1 << i)))
|
|
|
|
continue;
|
|
|
|
mutex_enter(&scl->scl_lock);
|
|
|
|
if (rw == RW_READER) {
|
|
|
|
if (scl->scl_writer || scl->scl_write_wanted) {
|
|
|
|
mutex_exit(&scl->scl_lock);
|
2015-12-23 20:02:43 +00:00
|
|
|
spa_config_exit(spa, locks & ((1 << i) - 1),
|
|
|
|
tag);
|
2008-12-03 20:09:06 +00:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ASSERT(scl->scl_writer != curthread);
|
|
|
|
if (!refcount_is_zero(&scl->scl_count)) {
|
|
|
|
mutex_exit(&scl->scl_lock);
|
2015-12-23 20:02:43 +00:00
|
|
|
spa_config_exit(spa, locks & ((1 << i) - 1),
|
|
|
|
tag);
|
2008-12-03 20:09:06 +00:00
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
scl->scl_writer = curthread;
|
|
|
|
}
|
|
|
|
(void) refcount_add(&scl->scl_count, tag);
|
|
|
|
mutex_exit(&scl->scl_lock);
|
|
|
|
}
|
|
|
|
return (1);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_enter(spa_t *spa, int locks, void *tag, krw_t rw)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2009-08-18 18:43:27 +00:00
|
|
|
int wlocks_held = 0;
|
2010-08-26 16:52:39 +00:00
|
|
|
int i;
|
2009-08-18 18:43:27 +00:00
|
|
|
|
2013-09-04 12:00:57 +00:00
|
|
|
ASSERT3U(SCL_LOCKS, <, sizeof (wlocks_held) * NBBY);
|
|
|
|
|
2010-08-26 16:52:39 +00:00
|
|
|
for (i = 0; i < SCL_LOCKS; i++) {
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_t *scl = &spa->spa_config_lock[i];
|
2009-08-18 18:43:27 +00:00
|
|
|
if (scl->scl_writer == curthread)
|
|
|
|
wlocks_held |= (1 << i);
|
2008-12-03 20:09:06 +00:00
|
|
|
if (!(locks & (1 << i)))
|
|
|
|
continue;
|
|
|
|
mutex_enter(&scl->scl_lock);
|
|
|
|
if (rw == RW_READER) {
|
|
|
|
while (scl->scl_writer || scl->scl_write_wanted) {
|
|
|
|
cv_wait(&scl->scl_cv, &scl->scl_lock);
|
|
|
|
}
|
|
|
|
} else {
|
|
|
|
ASSERT(scl->scl_writer != curthread);
|
|
|
|
while (!refcount_is_zero(&scl->scl_count)) {
|
|
|
|
scl->scl_write_wanted++;
|
|
|
|
cv_wait(&scl->scl_cv, &scl->scl_lock);
|
|
|
|
scl->scl_write_wanted--;
|
|
|
|
}
|
|
|
|
scl->scl_writer = curthread;
|
|
|
|
}
|
|
|
|
(void) refcount_add(&scl->scl_count, tag);
|
|
|
|
mutex_exit(&scl->scl_lock);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
2009-08-18 18:43:27 +00:00
|
|
|
ASSERT(wlocks_held <= locks);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_exit(spa_t *spa, int locks, void *tag)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-08-26 16:52:39 +00:00
|
|
|
int i;
|
|
|
|
|
|
|
|
for (i = SCL_LOCKS - 1; i >= 0; i--) {
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_t *scl = &spa->spa_config_lock[i];
|
|
|
|
if (!(locks & (1 << i)))
|
|
|
|
continue;
|
|
|
|
mutex_enter(&scl->scl_lock);
|
|
|
|
ASSERT(!refcount_is_zero(&scl->scl_count));
|
|
|
|
if (refcount_remove(&scl->scl_count, tag) == 0) {
|
|
|
|
ASSERT(scl->scl_writer == NULL ||
|
|
|
|
scl->scl_writer == curthread);
|
|
|
|
scl->scl_writer = NULL; /* OK in either case */
|
|
|
|
cv_broadcast(&scl->scl_cv);
|
|
|
|
}
|
|
|
|
mutex_exit(&scl->scl_lock);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
int
|
|
|
|
spa_config_held(spa_t *spa, int locks, krw_t rw)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-08-26 16:52:39 +00:00
|
|
|
int i, locks_held = 0;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-08-26 16:52:39 +00:00
|
|
|
for (i = 0; i < SCL_LOCKS; i++) {
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_t *scl = &spa->spa_config_lock[i];
|
|
|
|
if (!(locks & (1 << i)))
|
|
|
|
continue;
|
|
|
|
if ((rw == RW_READER && !refcount_is_zero(&scl->scl_count)) ||
|
|
|
|
(rw == RW_WRITER && scl->scl_writer == curthread))
|
|
|
|
locks_held |= 1 << i;
|
|
|
|
}
|
|
|
|
|
|
|
|
return (locks_held);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA namespace functions
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup the named spa_t in the AVL tree. The spa_namespace_lock must be held.
|
|
|
|
* Returns NULL if no matching spa_t is found.
|
|
|
|
*/
|
|
|
|
spa_t *
|
|
|
|
spa_lookup(const char *name)
|
|
|
|
{
|
2008-12-03 20:09:06 +00:00
|
|
|
static spa_t search; /* spa_t is large; don't allocate on stack */
|
|
|
|
spa_t *spa;
|
2008-11-20 20:01:55 +00:00
|
|
|
avl_index_t where;
|
|
|
|
char *cp;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
2013-09-04 12:00:57 +00:00
|
|
|
(void) strlcpy(search.spa_name, name, sizeof (search.spa_name));
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* If it's a full dataset name, figure out the pool name and
|
|
|
|
* just use that.
|
|
|
|
*/
|
2013-12-11 22:33:41 +00:00
|
|
|
cp = strpbrk(search.spa_name, "/@#");
|
2013-09-04 12:00:57 +00:00
|
|
|
if (cp != NULL)
|
2008-11-20 20:01:55 +00:00
|
|
|
*cp = '\0';
|
|
|
|
|
|
|
|
spa = avl_find(&spa_namespace_avl, &search, &where);
|
|
|
|
|
|
|
|
return (spa);
|
|
|
|
}
|
|
|
|
|
2013-04-29 22:49:23 +00:00
|
|
|
/*
|
|
|
|
* Fires when spa_sync has not completed within zfs_deadman_synctime_ms.
|
|
|
|
* If the zfs_deadman_enabled flag is set then it inspects all vdev queues
|
|
|
|
* looking for potentially hung I/Os.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_deadman(void *arg)
|
|
|
|
{
|
|
|
|
spa_t *spa = arg;
|
|
|
|
|
|
|
|
zfs_dbgmsg("slow spa_sync: started %llu seconds ago, calls %llu",
|
|
|
|
(gethrtime() - spa->spa_sync_starttime) / NANOSEC,
|
|
|
|
++spa->spa_deadman_calls);
|
|
|
|
if (zfs_deadman_enabled)
|
|
|
|
vdev_deadman(spa->spa_root_vdev);
|
|
|
|
|
|
|
|
spa->spa_deadman_tqid = taskq_dispatch_delay(system_taskq,
|
2016-03-07 13:35:29 +00:00
|
|
|
spa_deadman, spa, TQ_SLEEP, ddi_get_lbolt() +
|
2013-04-29 22:49:23 +00:00
|
|
|
NSEC_TO_TICK(spa->spa_deadman_synctime));
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* Create an uninitialized spa_t with the given name. Requires
|
|
|
|
* spa_namespace_lock. The caller must ensure that the spa_t doesn't already
|
|
|
|
* exist by calling spa_lookup() first.
|
|
|
|
*/
|
|
|
|
spa_t *
|
2010-05-28 20:45:14 +00:00
|
|
|
spa_add(const char *name, nvlist_t *config, const char *altroot)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_dirent_t *dp;
|
2010-08-26 16:52:39 +00:00
|
|
|
int t;
|
2013-12-09 18:37:51 +00:00
|
|
|
int i;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
2014-11-21 00:09:39 +00:00
|
|
|
spa = kmem_zalloc(sizeof (spa_t), KM_SLEEP);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
mutex_init(&spa->spa_async_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&spa->spa_errlist_lock, NULL, MUTEX_DEFAULT, NULL);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_init(&spa->spa_errlog_lock, NULL, MUTEX_DEFAULT, NULL);
|
2015-04-02 03:44:32 +00:00
|
|
|
mutex_init(&spa->spa_evicting_os_lock, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_init(&spa->spa_history_lock, NULL, MUTEX_DEFAULT, NULL);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_init(&spa->spa_proc_lock, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_init(&spa->spa_props_lock, NULL, MUTEX_DEFAULT, NULL);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_init(&spa->spa_scrub_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&spa->spa_suspend_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&spa->spa_vdev_top_lock, NULL, MUTEX_DEFAULT, NULL);
|
2015-04-23 19:32:59 +00:00
|
|
|
mutex_init(&spa->spa_feat_stats_lock, NULL, MUTEX_DEFAULT, NULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
cv_init(&spa->spa_async_cv, NULL, CV_DEFAULT, NULL);
|
2015-04-02 03:44:32 +00:00
|
|
|
cv_init(&spa->spa_evicting_os_cv, NULL, CV_DEFAULT, NULL);
|
2010-05-28 20:45:14 +00:00
|
|
|
cv_init(&spa->spa_proc_cv, NULL, CV_DEFAULT, NULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
cv_init(&spa->spa_scrub_io_cv, NULL, CV_DEFAULT, NULL);
|
2008-12-03 20:09:06 +00:00
|
|
|
cv_init(&spa->spa_suspend_cv, NULL, CV_DEFAULT, NULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-08-26 16:52:39 +00:00
|
|
|
for (t = 0; t < TXG_SIZE; t++)
|
2010-05-28 20:45:14 +00:00
|
|
|
bplist_create(&spa->spa_free_bplist[t]);
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
(void) strlcpy(spa->spa_name, name, sizeof (spa->spa_name));
|
2008-11-20 20:01:55 +00:00
|
|
|
spa->spa_state = POOL_STATE_UNINITIALIZED;
|
|
|
|
spa->spa_freeze_txg = UINT64_MAX;
|
|
|
|
spa->spa_final_txg = UINT64_MAX;
|
2010-05-28 20:45:14 +00:00
|
|
|
spa->spa_load_max_txg = UINT64_MAX;
|
|
|
|
spa->spa_proc = &p0;
|
|
|
|
spa->spa_proc_state = SPA_PROC_NONE;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 03:01:20 +00:00
|
|
|
spa->spa_deadman_synctime = MSEC2NSEC(zfs_deadman_synctime_ms);
|
2013-04-29 22:49:23 +00:00
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
refcount_create(&spa->spa_refcount);
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_init(spa);
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-06 23:09:05 +00:00
|
|
|
spa_stats_init(spa);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
avl_add(&spa_namespace_avl, spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Set the alternate root, if there is one.
|
|
|
|
*/
|
2015-04-26 04:25:45 +00:00
|
|
|
if (altroot)
|
2008-11-20 20:01:55 +00:00
|
|
|
spa->spa_root = spa_strdup(altroot);
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
/*
|
|
|
|
* Every pool starts with the default cachefile
|
|
|
|
*/
|
|
|
|
list_create(&spa->spa_config_list, sizeof (spa_config_dirent_t),
|
|
|
|
offsetof(spa_config_dirent_t, scd_link));
|
|
|
|
|
2014-11-21 00:09:39 +00:00
|
|
|
dp = kmem_zalloc(sizeof (spa_config_dirent_t), KM_SLEEP);
|
2010-05-28 20:45:14 +00:00
|
|
|
dp->scd_path = altroot ? NULL : spa_strdup(spa_config_path);
|
2008-12-03 20:09:06 +00:00
|
|
|
list_insert_head(&spa->spa_config_list, dp);
|
|
|
|
|
2010-08-26 21:24:34 +00:00
|
|
|
VERIFY(nvlist_alloc(&spa->spa_load_info, NV_UNIQUE_NAME,
|
2014-11-21 00:09:39 +00:00
|
|
|
KM_SLEEP) == 0);
|
2010-08-26 21:24:34 +00:00
|
|
|
|
2012-12-13 23:24:15 +00:00
|
|
|
if (config != NULL) {
|
|
|
|
nvlist_t *features;
|
|
|
|
|
|
|
|
if (nvlist_lookup_nvlist(config, ZPOOL_CONFIG_FEATURES_FOR_READ,
|
|
|
|
&features) == 0) {
|
|
|
|
VERIFY(nvlist_dup(features, &spa->spa_label_features,
|
|
|
|
0) == 0);
|
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
VERIFY(nvlist_dup(config, &spa->spa_config, 0) == 0);
|
2012-12-13 23:24:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
if (spa->spa_label_features == NULL) {
|
|
|
|
VERIFY(nvlist_alloc(&spa->spa_label_features, NV_UNIQUE_NAME,
|
2014-11-21 00:09:39 +00:00
|
|
|
KM_SLEEP) == 0);
|
2012-12-13 23:24:15 +00:00
|
|
|
}
|
2010-05-28 20:45:14 +00:00
|
|
|
|
2013-09-04 12:00:57 +00:00
|
|
|
spa->spa_debug = ((zfs_flags & ZFS_DEBUG_SPA) != 0);
|
|
|
|
|
2015-05-20 04:14:01 +00:00
|
|
|
spa->spa_min_ashift = INT_MAX;
|
|
|
|
spa->spa_max_ashift = 0;
|
|
|
|
|
2013-12-09 18:37:51 +00:00
|
|
|
/*
|
|
|
|
* As a pool is being created, treat all features as disabled by
|
|
|
|
* setting SPA_FEATURE_DISABLED for all entries in the feature
|
|
|
|
* refcount cache.
|
|
|
|
*/
|
|
|
|
for (i = 0; i < SPA_FEATURES; i++) {
|
|
|
|
spa->spa_feat_refcount_cache[i] = SPA_FEATURE_DISABLED;
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
return (spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Removes a spa_t from the namespace, freeing up any memory used. Requires
|
|
|
|
* spa_namespace_lock. This is called only after the spa_t has been closed and
|
|
|
|
* deactivated.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_remove(spa_t *spa)
|
|
|
|
{
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_dirent_t *dp;
|
2010-08-26 16:52:39 +00:00
|
|
|
int t;
|
2008-12-03 20:09:06 +00:00
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
ASSERT(spa->spa_state == POOL_STATE_UNINITIALIZED);
|
2015-04-02 03:44:32 +00:00
|
|
|
ASSERT3U(refcount_count(&spa->spa_refcount), ==, 0);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
nvlist_free(spa->spa_config_splitting);
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
avl_remove(&spa_namespace_avl, spa);
|
|
|
|
cv_broadcast(&spa_namespace_cv);
|
|
|
|
|
2015-04-26 04:25:45 +00:00
|
|
|
if (spa->spa_root)
|
2008-11-20 20:01:55 +00:00
|
|
|
spa_strfree(spa->spa_root);
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
while ((dp = list_head(&spa->spa_config_list)) != NULL) {
|
|
|
|
list_remove(&spa->spa_config_list, dp);
|
|
|
|
if (dp->scd_path != NULL)
|
|
|
|
spa_strfree(dp->scd_path);
|
|
|
|
kmem_free(dp, sizeof (spa_config_dirent_t));
|
|
|
|
}
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
list_destroy(&spa->spa_config_list);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2012-12-13 23:24:15 +00:00
|
|
|
nvlist_free(spa->spa_label_features);
|
2010-08-26 21:24:34 +00:00
|
|
|
nvlist_free(spa->spa_load_info);
|
2015-02-26 20:24:11 +00:00
|
|
|
nvlist_free(spa->spa_feat_stats);
|
2008-11-20 20:01:55 +00:00
|
|
|
spa_config_set(spa, NULL);
|
|
|
|
|
|
|
|
refcount_destroy(&spa->spa_refcount);
|
|
|
|
|
Add visibility in to arc_read
This change is an attempt to add visibility into the arc_read calls
occurring on a system, in real time. To do this, a list was added to the
in memory SPA data structure for a pool, with each element on the list
corresponding to a call to arc_read. These entries are then exported
through the kstat interface, which can then be interpreted in userspace.
For each arc_read call, the following information is exported:
* A unique identifier (uint64_t)
* The time the entry was added to the list (hrtime_t)
(*not* wall clock time; relative to the other entries on the list)
* The objset ID (uint64_t)
* The object number (uint64_t)
* The indirection level (uint64_t)
* The block ID (uint64_t)
* The name of the function originating the arc_read call (char[24])
* The arc_flags from the arc_read call (uint32_t)
* The PID of the reading thread (pid_t)
* The command or name of thread originating read (char[16])
From this exported information one can see, in real time, exactly what
is being read, what function is generating the read, and whether or not
the read was found to be already cached.
There is still some work to be done, but this should serve as a good
starting point.
Specifically, dbuf_read's are not accounted for in the currently
exported information. Thus, a follow up patch should probably be added
to export these calls that never call into arc_read (they only hit the
dbuf hash table). In addition, it might be nice to create a utility
similar to "arcstat.py" to digest the exported information and display
it in a more readable format. Or perhaps, log the information and allow
for it to be "replayed" at a later time.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-09-06 23:09:05 +00:00
|
|
|
spa_stats_destroy(spa);
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_lock_destroy(spa);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-08-26 16:52:39 +00:00
|
|
|
for (t = 0; t < TXG_SIZE; t++)
|
2010-05-28 20:45:14 +00:00
|
|
|
bplist_destroy(&spa->spa_free_bplist[t]);
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
cv_destroy(&spa->spa_async_cv);
|
2015-04-02 03:44:32 +00:00
|
|
|
cv_destroy(&spa->spa_evicting_os_cv);
|
2010-05-28 20:45:14 +00:00
|
|
|
cv_destroy(&spa->spa_proc_cv);
|
2008-11-20 20:01:55 +00:00
|
|
|
cv_destroy(&spa->spa_scrub_io_cv);
|
2008-12-03 20:09:06 +00:00
|
|
|
cv_destroy(&spa->spa_suspend_cv);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
mutex_destroy(&spa->spa_async_lock);
|
|
|
|
mutex_destroy(&spa->spa_errlist_lock);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_destroy(&spa->spa_errlog_lock);
|
2015-04-02 03:44:32 +00:00
|
|
|
mutex_destroy(&spa->spa_evicting_os_lock);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_destroy(&spa->spa_history_lock);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_destroy(&spa->spa_proc_lock);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_destroy(&spa->spa_props_lock);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_destroy(&spa->spa_scrub_lock);
|
2008-12-03 20:09:06 +00:00
|
|
|
mutex_destroy(&spa->spa_suspend_lock);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_destroy(&spa->spa_vdev_top_lock);
|
2015-04-23 19:32:59 +00:00
|
|
|
mutex_destroy(&spa->spa_feat_stats_lock);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
kmem_free(spa, sizeof (spa_t));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Given a pool, return the next pool in the namespace, or NULL if there is
|
|
|
|
* none. If 'prev' is NULL, return the first pool.
|
|
|
|
*/
|
|
|
|
spa_t *
|
|
|
|
spa_next(spa_t *prev)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
|
|
if (prev)
|
|
|
|
return (AVL_NEXT(&spa_namespace_avl, prev));
|
|
|
|
else
|
|
|
|
return (avl_first(&spa_namespace_avl));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA refcount functions
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Add a reference to the given spa_t. Must have at least one reference, or
|
|
|
|
* have the namespace lock held.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_open_ref(spa_t *spa, void *tag)
|
|
|
|
{
|
2008-12-03 20:09:06 +00:00
|
|
|
ASSERT(refcount_count(&spa->spa_refcount) >= spa->spa_minref ||
|
2008-11-20 20:01:55 +00:00
|
|
|
MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
(void) refcount_add(&spa->spa_refcount, tag);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Remove a reference to the given spa_t. Must have at least one reference, or
|
|
|
|
* have the namespace lock held.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_close(spa_t *spa, void *tag)
|
|
|
|
{
|
2008-12-03 20:09:06 +00:00
|
|
|
ASSERT(refcount_count(&spa->spa_refcount) > spa->spa_minref ||
|
2008-11-20 20:01:55 +00:00
|
|
|
MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
(void) refcount_remove(&spa->spa_refcount, tag);
|
|
|
|
}
|
|
|
|
|
2015-04-02 03:44:32 +00:00
|
|
|
/*
|
|
|
|
* Remove a reference to the given spa_t held by a dsl dir that is
|
|
|
|
* being asynchronously released. Async releases occur from a taskq
|
|
|
|
* performing eviction of dsl datasets and dirs. The namespace lock
|
|
|
|
* isn't held and the hold by the object being evicted may contribute to
|
|
|
|
* spa_minref (e.g. dataset or directory released during pool export),
|
|
|
|
* so the asserts in spa_close() do not apply.
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_async_close(spa_t *spa, void *tag)
|
|
|
|
{
|
|
|
|
(void) refcount_remove(&spa->spa_refcount, tag);
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* Check to see if the spa refcount is zero. Must be called with
|
2008-12-03 20:09:06 +00:00
|
|
|
* spa_namespace_lock held. We really compare against spa_minref, which is the
|
2008-11-20 20:01:55 +00:00
|
|
|
* number of references acquired when opening a pool
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
spa_refcount_zero(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
return (refcount_count(&spa->spa_refcount) == spa->spa_minref);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA spare and l2cache tracking
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Hot spares and cache devices are tracked using the same code below,
|
|
|
|
* for 'auxiliary' devices.
|
|
|
|
*/
|
|
|
|
|
|
|
|
typedef struct spa_aux {
|
|
|
|
uint64_t aux_guid;
|
|
|
|
uint64_t aux_pool;
|
|
|
|
avl_node_t aux_avl;
|
|
|
|
int aux_count;
|
|
|
|
} spa_aux_t;
|
|
|
|
|
2016-08-27 18:12:53 +00:00
|
|
|
static inline int
|
2008-11-20 20:01:55 +00:00
|
|
|
spa_aux_compare(const void *a, const void *b)
|
|
|
|
{
|
2016-08-27 18:12:53 +00:00
|
|
|
const spa_aux_t *sa = (const spa_aux_t *)a;
|
|
|
|
const spa_aux_t *sb = (const spa_aux_t *)b;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2016-08-27 18:12:53 +00:00
|
|
|
return (AVL_CMP(sa->aux_guid, sb->aux_guid));
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_aux_add(vdev_t *vd, avl_tree_t *avl)
|
|
|
|
{
|
|
|
|
avl_index_t where;
|
|
|
|
spa_aux_t search;
|
|
|
|
spa_aux_t *aux;
|
|
|
|
|
|
|
|
search.aux_guid = vd->vdev_guid;
|
|
|
|
if ((aux = avl_find(avl, &search, &where)) != NULL) {
|
|
|
|
aux->aux_count++;
|
|
|
|
} else {
|
2014-11-21 00:09:39 +00:00
|
|
|
aux = kmem_zalloc(sizeof (spa_aux_t), KM_SLEEP);
|
2008-11-20 20:01:55 +00:00
|
|
|
aux->aux_guid = vd->vdev_guid;
|
|
|
|
aux->aux_count = 1;
|
|
|
|
avl_insert(avl, aux, where);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_aux_remove(vdev_t *vd, avl_tree_t *avl)
|
|
|
|
{
|
|
|
|
spa_aux_t search;
|
|
|
|
spa_aux_t *aux;
|
|
|
|
avl_index_t where;
|
|
|
|
|
|
|
|
search.aux_guid = vd->vdev_guid;
|
|
|
|
aux = avl_find(avl, &search, &where);
|
|
|
|
|
|
|
|
ASSERT(aux != NULL);
|
|
|
|
|
|
|
|
if (--aux->aux_count == 0) {
|
|
|
|
avl_remove(avl, aux);
|
|
|
|
kmem_free(aux, sizeof (spa_aux_t));
|
|
|
|
} else if (aux->aux_pool == spa_guid(vd->vdev_spa)) {
|
|
|
|
aux->aux_pool = 0ULL;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
boolean_t
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_aux_exists(uint64_t guid, uint64_t *pool, int *refcnt, avl_tree_t *avl)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
|
|
|
spa_aux_t search, *found;
|
|
|
|
|
|
|
|
search.aux_guid = guid;
|
2008-12-03 20:09:06 +00:00
|
|
|
found = avl_find(avl, &search, NULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
if (pool) {
|
|
|
|
if (found)
|
|
|
|
*pool = found->aux_pool;
|
|
|
|
else
|
|
|
|
*pool = 0ULL;
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
if (refcnt) {
|
|
|
|
if (found)
|
|
|
|
*refcnt = found->aux_count;
|
|
|
|
else
|
|
|
|
*refcnt = 0;
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
return (found != NULL);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_aux_activate(vdev_t *vd, avl_tree_t *avl)
|
|
|
|
{
|
|
|
|
spa_aux_t search, *found;
|
|
|
|
avl_index_t where;
|
|
|
|
|
|
|
|
search.aux_guid = vd->vdev_guid;
|
|
|
|
found = avl_find(avl, &search, &where);
|
|
|
|
ASSERT(found != NULL);
|
|
|
|
ASSERT(found->aux_pool == 0ULL);
|
|
|
|
|
|
|
|
found->aux_pool = spa_guid(vd->vdev_spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Spares are tracked globally due to the following constraints:
|
|
|
|
*
|
|
|
|
* - A spare may be part of multiple pools.
|
|
|
|
* - A spare may be added to a pool even if it's actively in use within
|
|
|
|
* another pool.
|
|
|
|
* - A spare in use in any pool can only be the source of a replacement if
|
|
|
|
* the target is a spare in the same pool.
|
|
|
|
*
|
|
|
|
* We keep track of all spares on the system through the use of a reference
|
|
|
|
* counted AVL tree. When a vdev is added as a spare, or used as a replacement
|
|
|
|
* spare, then we bump the reference count in the AVL tree. In addition, we set
|
|
|
|
* the 'vdev_isspare' member to indicate that the device is a spare (active or
|
|
|
|
* inactive). When a spare is made active (used to replace a device in the
|
|
|
|
* pool), we also keep track of which pool its been made a part of.
|
|
|
|
*
|
|
|
|
* The 'spa_spare_lock' protects the AVL tree. These functions are normally
|
|
|
|
* called under the spa_namespace lock as part of vdev reconfiguration. The
|
|
|
|
* separate spare lock exists for the status query path, which does not need to
|
|
|
|
* be completely consistent with respect to other vdev configuration changes.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_spare_compare(const void *a, const void *b)
|
|
|
|
{
|
|
|
|
return (spa_aux_compare(a, b));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_spare_add(vdev_t *vd)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa_spare_lock);
|
|
|
|
ASSERT(!vd->vdev_isspare);
|
|
|
|
spa_aux_add(vd, &spa_spare_avl);
|
|
|
|
vd->vdev_isspare = B_TRUE;
|
|
|
|
mutex_exit(&spa_spare_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_spare_remove(vdev_t *vd)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa_spare_lock);
|
|
|
|
ASSERT(vd->vdev_isspare);
|
|
|
|
spa_aux_remove(vd, &spa_spare_avl);
|
|
|
|
vd->vdev_isspare = B_FALSE;
|
|
|
|
mutex_exit(&spa_spare_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
boolean_t
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_spare_exists(uint64_t guid, uint64_t *pool, int *refcnt)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
|
|
|
boolean_t found;
|
|
|
|
|
|
|
|
mutex_enter(&spa_spare_lock);
|
2008-12-03 20:09:06 +00:00
|
|
|
found = spa_aux_exists(guid, pool, refcnt, &spa_spare_avl);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_exit(&spa_spare_lock);
|
|
|
|
|
|
|
|
return (found);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_spare_activate(vdev_t *vd)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa_spare_lock);
|
|
|
|
ASSERT(vd->vdev_isspare);
|
|
|
|
spa_aux_activate(vd, &spa_spare_avl);
|
|
|
|
mutex_exit(&spa_spare_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Level 2 ARC devices are tracked globally for the same reasons as spares.
|
|
|
|
* Cache devices currently only support one pool per cache device, and so
|
|
|
|
* for these devices the aux reference count is currently unused beyond 1.
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_l2cache_compare(const void *a, const void *b)
|
|
|
|
{
|
|
|
|
return (spa_aux_compare(a, b));
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_l2cache_add(vdev_t *vd)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa_l2cache_lock);
|
|
|
|
ASSERT(!vd->vdev_isl2cache);
|
|
|
|
spa_aux_add(vd, &spa_l2cache_avl);
|
|
|
|
vd->vdev_isl2cache = B_TRUE;
|
|
|
|
mutex_exit(&spa_l2cache_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_l2cache_remove(vdev_t *vd)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa_l2cache_lock);
|
|
|
|
ASSERT(vd->vdev_isl2cache);
|
|
|
|
spa_aux_remove(vd, &spa_l2cache_avl);
|
|
|
|
vd->vdev_isl2cache = B_FALSE;
|
|
|
|
mutex_exit(&spa_l2cache_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
boolean_t
|
|
|
|
spa_l2cache_exists(uint64_t guid, uint64_t *pool)
|
|
|
|
{
|
|
|
|
boolean_t found;
|
|
|
|
|
|
|
|
mutex_enter(&spa_l2cache_lock);
|
2008-12-03 20:09:06 +00:00
|
|
|
found = spa_aux_exists(guid, pool, NULL, &spa_l2cache_avl);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_exit(&spa_l2cache_lock);
|
|
|
|
|
|
|
|
return (found);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_l2cache_activate(vdev_t *vd)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa_l2cache_lock);
|
|
|
|
ASSERT(vd->vdev_isl2cache);
|
|
|
|
spa_aux_activate(vd, &spa_l2cache_avl);
|
|
|
|
mutex_exit(&spa_l2cache_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* SPA vdev locking
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lock the given spa_t for the purpose of adding or removing a vdev.
|
|
|
|
* Grabs the global spa_namespace_lock plus the spa config lock for writing.
|
|
|
|
* It returns the next transaction group for the spa_t.
|
|
|
|
*/
|
|
|
|
uint64_t
|
|
|
|
spa_vdev_enter(spa_t *spa)
|
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_enter(&spa->spa_vdev_top_lock);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_enter(&spa_namespace_lock);
|
2010-05-28 20:45:14 +00:00
|
|
|
return (spa_vdev_config_enter(spa));
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Internal implementation for spa_vdev_enter(). Used when a vdev
|
|
|
|
* operation requires multiple syncs (i.e. removing a device) while
|
|
|
|
* keeping the spa_namespace_lock held.
|
|
|
|
*/
|
|
|
|
uint64_t
|
|
|
|
spa_vdev_config_enter(spa_t *spa)
|
|
|
|
{
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_enter(spa, SCL_ALL, spa, RW_WRITER);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
return (spa_last_synced_txg(spa) + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-05-28 20:45:14 +00:00
|
|
|
* Used in combination with spa_vdev_config_enter() to allow the syncing
|
|
|
|
* of multiple transactions without releasing the spa_namespace_lock.
|
2008-11-20 20:01:55 +00:00
|
|
|
*/
|
2010-05-28 20:45:14 +00:00
|
|
|
void
|
|
|
|
spa_vdev_config_exit(spa_t *spa, vdev_t *vd, uint64_t txg, int error, char *tag)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
|
|
|
int config_changed = B_FALSE;
|
|
|
|
|
2010-08-26 16:52:39 +00:00
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
2008-11-20 20:01:55 +00:00
|
|
|
ASSERT(txg > spa_last_synced_txg(spa));
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
spa->spa_pending_vdev = NULL;
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* Reassess the DTLs.
|
|
|
|
*/
|
|
|
|
vdev_dtl_reassess(spa->spa_root_vdev, 0, 0, B_FALSE);
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
if (error == 0 && !list_is_empty(&spa->spa_config_dirty_list)) {
|
2008-11-20 20:01:55 +00:00
|
|
|
config_changed = B_TRUE;
|
2010-05-28 20:45:14 +00:00
|
|
|
spa->spa_config_generation++;
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
/*
|
|
|
|
* Verify the metaslab classes.
|
|
|
|
*/
|
|
|
|
ASSERT(metaslab_class_validate(spa_normal_class(spa)) == 0);
|
|
|
|
ASSERT(metaslab_class_validate(spa_log_class(spa)) == 0);
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_exit(spa, SCL_ALL, spa);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
/*
|
|
|
|
* Panic the system if the specified tag requires it. This
|
|
|
|
* is useful for ensuring that configurations are updated
|
|
|
|
* transactionally.
|
|
|
|
*/
|
|
|
|
if (zio_injection_enabled)
|
|
|
|
zio_handle_panic_injection(spa, tag, 0);
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* Note: this txg_wait_synced() is important because it ensures
|
|
|
|
* that there won't be more than one config change per txg.
|
|
|
|
* This allows us to use the txg as the generation number.
|
|
|
|
*/
|
|
|
|
if (error == 0)
|
|
|
|
txg_wait_synced(spa->spa_dsl_pool, txg);
|
|
|
|
|
|
|
|
if (vd != NULL) {
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-01 21:25:53 +00:00
|
|
|
ASSERT(!vd->vdev_detached || vd->vdev_dtl_sm == NULL);
|
2009-01-15 21:59:39 +00:00
|
|
|
spa_config_enter(spa, SCL_ALL, spa, RW_WRITER);
|
2008-11-20 20:01:55 +00:00
|
|
|
vdev_free(vd);
|
2009-01-15 21:59:39 +00:00
|
|
|
spa_config_exit(spa, SCL_ALL, spa);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* If the config changed, update the config cache.
|
|
|
|
*/
|
|
|
|
if (config_changed)
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_sync(spa, B_FALSE, B_TRUE);
|
2010-05-28 20:45:14 +00:00
|
|
|
}
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
/*
|
|
|
|
* Unlock the spa_t after adding or removing a vdev. Besides undoing the
|
|
|
|
* locking of spa_vdev_enter(), we also want make sure the transactions have
|
|
|
|
* synced to disk, and then update the global configuration cache with the new
|
|
|
|
* information.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_vdev_exit(spa_t *spa, vdev_t *vd, uint64_t txg, int error)
|
|
|
|
{
|
|
|
|
spa_vdev_config_exit(spa, vd, txg, error, FTAG);
|
2008-11-20 20:01:55 +00:00
|
|
|
mutex_exit(&spa_namespace_lock);
|
2010-05-28 20:45:14 +00:00
|
|
|
mutex_exit(&spa->spa_vdev_top_lock);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
/*
|
|
|
|
* Lock the given spa_t for the purpose of changing vdev state.
|
|
|
|
*/
|
|
|
|
void
|
2010-05-28 20:45:14 +00:00
|
|
|
spa_vdev_state_enter(spa_t *spa, int oplocks)
|
2008-12-03 20:09:06 +00:00
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
int locks = SCL_STATE_ALL | oplocks;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Root pools may need to read of the underlying devfs filesystem
|
|
|
|
* when opening up a vdev. Unfortunately if we're holding the
|
|
|
|
* SCL_ZIO lock it will result in a deadlock when we try to issue
|
|
|
|
* the read from the root filesystem. Instead we "prefetch"
|
|
|
|
* the associated vnodes that we need prior to opening the
|
|
|
|
* underlying devices and cache them so that we can prevent
|
|
|
|
* any I/O when we are doing the actual open.
|
|
|
|
*/
|
|
|
|
if (spa_is_root(spa)) {
|
|
|
|
int low = locks & ~(SCL_ZIO - 1);
|
|
|
|
int high = locks & ~low;
|
|
|
|
|
|
|
|
spa_config_enter(spa, high, spa, RW_WRITER);
|
|
|
|
vdev_hold(spa->spa_root_vdev);
|
|
|
|
spa_config_enter(spa, low, spa, RW_WRITER);
|
|
|
|
} else {
|
|
|
|
spa_config_enter(spa, locks, spa, RW_WRITER);
|
|
|
|
}
|
|
|
|
spa->spa_vdev_locks = locks;
|
2008-12-03 20:09:06 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
spa_vdev_state_exit(spa_t *spa, vdev_t *vd, int error)
|
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
boolean_t config_changed = B_FALSE;
|
|
|
|
|
|
|
|
if (vd != NULL || error == 0)
|
|
|
|
vdev_dtl_reassess(vd ? vd->vdev_top : spa->spa_root_vdev,
|
|
|
|
0, 0, B_FALSE);
|
|
|
|
|
|
|
|
if (vd != NULL) {
|
2008-12-03 20:09:06 +00:00
|
|
|
vdev_state_dirty(vd->vdev_top);
|
2010-05-28 20:45:14 +00:00
|
|
|
config_changed = B_TRUE;
|
|
|
|
spa->spa_config_generation++;
|
|
|
|
}
|
2008-12-03 20:09:06 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
if (spa_is_root(spa))
|
|
|
|
vdev_rele(spa->spa_root_vdev);
|
|
|
|
|
|
|
|
ASSERT3U(spa->spa_vdev_locks, >=, SCL_STATE_ALL);
|
|
|
|
spa_config_exit(spa, spa->spa_vdev_locks, spa);
|
2008-12-03 20:09:06 +00:00
|
|
|
|
2009-01-15 21:59:39 +00:00
|
|
|
/*
|
|
|
|
* If anything changed, wait for it to sync. This ensures that,
|
|
|
|
* from the system administrator's perspective, zpool(1M) commands
|
|
|
|
* are synchronous. This is important for things like zpool offline:
|
|
|
|
* when the command completes, you expect no further I/O from ZFS.
|
|
|
|
*/
|
|
|
|
if (vd != NULL)
|
|
|
|
txg_wait_synced(spa->spa_dsl_pool, 0);
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
/*
|
|
|
|
* If the config changed, update the config cache.
|
|
|
|
*/
|
|
|
|
if (config_changed) {
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
spa_config_sync(spa, B_FALSE, B_TRUE);
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
return (error);
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* Miscellaneous functions
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
2012-12-13 23:24:15 +00:00
|
|
|
void
|
2013-12-09 18:37:51 +00:00
|
|
|
spa_activate_mos_feature(spa_t *spa, const char *feature, dmu_tx_t *tx)
|
2012-12-13 23:24:15 +00:00
|
|
|
{
|
2013-10-08 17:13:05 +00:00
|
|
|
if (!nvlist_exists(spa->spa_label_features, feature)) {
|
|
|
|
fnvlist_add_boolean(spa->spa_label_features, feature);
|
2013-12-09 18:37:51 +00:00
|
|
|
/*
|
|
|
|
* When we are creating the pool (tx_txg==TXG_INITIAL), we can't
|
|
|
|
* dirty the vdev config because lock SCL_CONFIG is not held.
|
|
|
|
* Thankfully, in this case we don't need to dirty the config
|
|
|
|
* because it will be written out anyway when we finish
|
|
|
|
* creating the pool.
|
|
|
|
*/
|
|
|
|
if (tx->tx_txg != TXG_INITIAL)
|
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
2013-10-08 17:13:05 +00:00
|
|
|
}
|
2012-12-13 23:24:15 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_deactivate_mos_feature(spa_t *spa, const char *feature)
|
|
|
|
{
|
2013-10-08 17:13:05 +00:00
|
|
|
if (nvlist_remove_all(spa->spa_label_features, feature) == 0)
|
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
2012-12-13 23:24:15 +00:00
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* Rename a spa_t.
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_rename(const char *name, const char *newname)
|
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
int err;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Lookup the spa_t and grab the config lock for writing. We need to
|
|
|
|
* actually open the pool so that we can sync out the necessary labels.
|
|
|
|
* It's OK to call spa_open() with the namespace lock held because we
|
|
|
|
* allow recursive calls for other reasons.
|
|
|
|
*/
|
|
|
|
mutex_enter(&spa_namespace_lock);
|
|
|
|
if ((err = spa_open(name, &spa, FTAG)) != 0) {
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
return (err);
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
avl_remove(&spa_namespace_avl, spa);
|
2008-12-03 20:09:06 +00:00
|
|
|
(void) strlcpy(spa->spa_name, newname, sizeof (spa->spa_name));
|
2008-11-20 20:01:55 +00:00
|
|
|
avl_add(&spa_namespace_avl, spa);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sync all labels to disk with the new names by marking the root vdev
|
|
|
|
* dirty and waiting for it to sync. It will pick up the new pool name
|
|
|
|
* during the sync.
|
|
|
|
*/
|
|
|
|
vdev_config_dirty(spa->spa_root_vdev);
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
txg_wait_synced(spa->spa_dsl_pool, 0);
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Sync the updated config cache.
|
|
|
|
*/
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_sync(spa, B_FALSE, B_TRUE);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
spa_close(spa, FTAG);
|
|
|
|
|
|
|
|
mutex_exit(&spa_namespace_lock);
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
2010-08-26 21:24:34 +00:00
|
|
|
* Return the spa_t associated with given pool_guid, if it exists. If
|
|
|
|
* device_guid is non-zero, determine whether the pool exists *and* contains
|
|
|
|
* a device with the specified device_guid.
|
2008-11-20 20:01:55 +00:00
|
|
|
*/
|
2010-08-26 21:24:34 +00:00
|
|
|
spa_t *
|
|
|
|
spa_by_guid(uint64_t pool_guid, uint64_t device_guid)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
|
|
|
spa_t *spa;
|
|
|
|
avl_tree_t *t = &spa_namespace_avl;
|
|
|
|
|
|
|
|
ASSERT(MUTEX_HELD(&spa_namespace_lock));
|
|
|
|
|
|
|
|
for (spa = avl_first(t); spa != NULL; spa = AVL_NEXT(t, spa)) {
|
|
|
|
if (spa->spa_state == POOL_STATE_UNINITIALIZED)
|
|
|
|
continue;
|
|
|
|
if (spa->spa_root_vdev == NULL)
|
|
|
|
continue;
|
|
|
|
if (spa_guid(spa) == pool_guid) {
|
|
|
|
if (device_guid == 0)
|
|
|
|
break;
|
|
|
|
|
|
|
|
if (vdev_lookup_by_guid(spa->spa_root_vdev,
|
|
|
|
device_guid) != NULL)
|
|
|
|
break;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Check any devices we may be in the process of adding.
|
|
|
|
*/
|
|
|
|
if (spa->spa_pending_vdev) {
|
|
|
|
if (vdev_lookup_by_guid(spa->spa_pending_vdev,
|
|
|
|
device_guid) != NULL)
|
|
|
|
break;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2010-08-26 21:24:34 +00:00
|
|
|
return (spa);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Determine whether a pool with the given pool_guid exists.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
spa_guid_exists(uint64_t pool_guid, uint64_t device_guid)
|
|
|
|
{
|
|
|
|
return (spa_by_guid(pool_guid, device_guid) != NULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
char *
|
|
|
|
spa_strdup(const char *s)
|
|
|
|
{
|
|
|
|
size_t len;
|
|
|
|
char *new;
|
|
|
|
|
|
|
|
len = strlen(s);
|
2014-11-21 00:09:39 +00:00
|
|
|
new = kmem_alloc(len + 1, KM_SLEEP);
|
2008-11-20 20:01:55 +00:00
|
|
|
bcopy(s, new, len);
|
|
|
|
new[len] = '\0';
|
|
|
|
|
|
|
|
return (new);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_strfree(char *s)
|
|
|
|
{
|
|
|
|
kmem_free(s, strlen(s) + 1);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
spa_get_random(uint64_t range)
|
|
|
|
{
|
|
|
|
uint64_t r;
|
|
|
|
|
|
|
|
ASSERT(range != 0);
|
|
|
|
|
|
|
|
(void) random_get_pseudo_bytes((void *)&r, sizeof (uint64_t));
|
|
|
|
|
|
|
|
return (r % range);
|
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
uint64_t
|
|
|
|
spa_generate_guid(spa_t *spa)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
uint64_t guid = spa_get_random(-1ULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
if (spa != NULL) {
|
|
|
|
while (guid == 0 || spa_guid_exists(spa_guid(spa), guid))
|
|
|
|
guid = spa_get_random(-1ULL);
|
|
|
|
} else {
|
|
|
|
while (guid == 0 || spa_guid_exists(guid, 0))
|
|
|
|
guid = spa_get_random(-1ULL);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
return (guid);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2013-12-09 18:37:51 +00:00
|
|
|
snprintf_blkptr(char *buf, size_t buflen, const blkptr_t *bp)
|
2010-05-28 20:45:14 +00:00
|
|
|
{
|
2012-12-13 23:24:15 +00:00
|
|
|
char type[256];
|
2010-05-28 20:45:14 +00:00
|
|
|
char *checksum = NULL;
|
|
|
|
char *compress = NULL;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
if (bp != NULL) {
|
2012-12-13 23:24:15 +00:00
|
|
|
if (BP_GET_TYPE(bp) & DMU_OT_NEWTYPE) {
|
|
|
|
dmu_object_byteswap_t bswap =
|
|
|
|
DMU_OT_BYTESWAP(BP_GET_TYPE(bp));
|
|
|
|
(void) snprintf(type, sizeof (type), "bswap %s %s",
|
|
|
|
DMU_OT_IS_METADATA(BP_GET_TYPE(bp)) ?
|
|
|
|
"metadata" : "data",
|
|
|
|
dmu_ot_byteswap[bswap].ob_name);
|
|
|
|
} else {
|
|
|
|
(void) strlcpy(type, dmu_ot[BP_GET_TYPE(bp)].ot_name,
|
|
|
|
sizeof (type));
|
|
|
|
}
|
2014-06-05 21:19:08 +00:00
|
|
|
if (!BP_IS_EMBEDDED(bp)) {
|
|
|
|
checksum =
|
|
|
|
zio_checksum_table[BP_GET_CHECKSUM(bp)].ci_name;
|
|
|
|
}
|
2010-05-28 20:45:14 +00:00
|
|
|
compress = zio_compress_table[BP_GET_COMPRESS(bp)].ci_name;
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
2013-12-09 18:37:51 +00:00
|
|
|
SNPRINTF_BLKPTR(snprintf, ' ', buf, buflen, bp, type, checksum,
|
|
|
|
compress);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_freeze(spa_t *spa)
|
|
|
|
{
|
|
|
|
uint64_t freeze_txg = 0;
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_enter(spa, SCL_ALL, FTAG, RW_WRITER);
|
2008-11-20 20:01:55 +00:00
|
|
|
if (spa->spa_freeze_txg == UINT64_MAX) {
|
|
|
|
freeze_txg = spa_last_synced_txg(spa) + TXG_SIZE;
|
|
|
|
spa->spa_freeze_txg = freeze_txg;
|
|
|
|
}
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_exit(spa, SCL_ALL, FTAG);
|
2008-11-20 20:01:55 +00:00
|
|
|
if (freeze_txg != 0)
|
|
|
|
txg_wait_synced(spa_get_dsl(spa), freeze_txg);
|
|
|
|
}
|
|
|
|
|
Swap DTRACE_PROBE* with Linux tracepoints
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-06-13 17:54:48 +00:00
|
|
|
void
|
|
|
|
zfs_panic_recover(const char *fmt, ...)
|
|
|
|
{
|
|
|
|
va_list adx;
|
|
|
|
|
|
|
|
va_start(adx, fmt);
|
|
|
|
vcmn_err(zfs_recover ? CE_WARN : CE_PANIC, fmt, adx);
|
|
|
|
va_end(adx);
|
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
/*
|
|
|
|
* This is a stripped-down version of strtoull, suitable only for converting
|
2013-06-11 17:12:34 +00:00
|
|
|
* lowercase hexadecimal numbers that don't overflow.
|
2010-05-28 20:45:14 +00:00
|
|
|
*/
|
|
|
|
uint64_t
|
|
|
|
strtonum(const char *str, char **nptr)
|
|
|
|
{
|
|
|
|
uint64_t val = 0;
|
|
|
|
char c;
|
|
|
|
int digit;
|
|
|
|
|
|
|
|
while ((c = *str) != '\0') {
|
|
|
|
if (c >= '0' && c <= '9')
|
|
|
|
digit = c - '0';
|
|
|
|
else if (c >= 'a' && c <= 'f')
|
|
|
|
digit = 10 + c - 'a';
|
|
|
|
else
|
|
|
|
break;
|
|
|
|
|
|
|
|
val *= 16;
|
|
|
|
val += digit;
|
|
|
|
|
|
|
|
str++;
|
|
|
|
}
|
|
|
|
|
|
|
|
if (nptr)
|
|
|
|
*nptr = (char *)str;
|
|
|
|
|
|
|
|
return (val);
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* Accessor functions
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
boolean_t
|
|
|
|
spa_shutting_down(spa_t *spa)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2008-12-03 20:09:06 +00:00
|
|
|
return (spa->spa_async_suspended);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
dsl_pool_t *
|
|
|
|
spa_get_dsl(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_dsl_pool);
|
|
|
|
}
|
|
|
|
|
2012-12-13 23:24:15 +00:00
|
|
|
boolean_t
|
|
|
|
spa_is_initializing(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_is_initializing);
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
blkptr_t *
|
|
|
|
spa_get_rootblkptr(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (&spa->spa_ubsync.ub_rootbp);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_set_rootblkptr(spa_t *spa, const blkptr_t *bp)
|
|
|
|
{
|
|
|
|
spa->spa_uberblock.ub_rootbp = *bp;
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_altroot(spa_t *spa, char *buf, size_t buflen)
|
|
|
|
{
|
|
|
|
if (spa->spa_root == NULL)
|
|
|
|
buf[0] = '\0';
|
|
|
|
else
|
|
|
|
(void) strncpy(buf, spa->spa_root, buflen);
|
|
|
|
}
|
|
|
|
|
|
|
|
int
|
|
|
|
spa_sync_pass(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_sync_pass);
|
|
|
|
}
|
|
|
|
|
|
|
|
char *
|
|
|
|
spa_name(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_name);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
spa_guid(spa_t *spa)
|
|
|
|
{
|
2012-12-14 20:38:04 +00:00
|
|
|
dsl_pool_t *dp = spa_get_dsl(spa);
|
|
|
|
uint64_t guid;
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
/*
|
|
|
|
* If we fail to parse the config during spa_load(), we can go through
|
|
|
|
* the error path (which posts an ereport) and end up here with no root
|
2011-11-11 22:07:54 +00:00
|
|
|
* vdev. We stash the original pool guid in 'spa_config_guid' to handle
|
2008-11-20 20:01:55 +00:00
|
|
|
* this case.
|
|
|
|
*/
|
2012-12-14 20:38:04 +00:00
|
|
|
if (spa->spa_root_vdev == NULL)
|
|
|
|
return (spa->spa_config_guid);
|
|
|
|
|
|
|
|
guid = spa->spa_last_synced_guid != 0 ?
|
|
|
|
spa->spa_last_synced_guid : spa->spa_root_vdev->vdev_guid;
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the most recently synced out guid unless we're
|
|
|
|
* in syncing context.
|
|
|
|
*/
|
|
|
|
if (dp && dsl_pool_sync_context(dp))
|
2008-11-20 20:01:55 +00:00
|
|
|
return (spa->spa_root_vdev->vdev_guid);
|
|
|
|
else
|
2012-12-14 20:38:04 +00:00
|
|
|
return (guid);
|
2011-11-11 22:07:54 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
spa_load_guid(spa_t *spa)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* This is a GUID that exists solely as a reference for the
|
|
|
|
* purposes of the arc. It is generated at load time, and
|
|
|
|
* is never written to persistent storage.
|
|
|
|
*/
|
|
|
|
return (spa->spa_load_guid);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
spa_last_synced_txg(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_ubsync.ub_txg);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
spa_first_txg(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_first_txg);
|
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
uint64_t
|
|
|
|
spa_syncing_txg(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_syncing_txg);
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
pool_state_t
|
2008-11-20 20:01:55 +00:00
|
|
|
spa_state(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_state);
|
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
spa_load_state_t
|
|
|
|
spa_load_state(spa_t *spa)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
return (spa->spa_load_state);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
2010-05-28 20:45:14 +00:00
|
|
|
spa_freeze_txg(spa_t *spa)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
return (spa->spa_freeze_txg);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
/* ARGSUSED */
|
2008-11-20 20:01:55 +00:00
|
|
|
uint64_t
|
2010-05-28 20:45:14 +00:00
|
|
|
spa_get_asize(spa_t *spa, uint64_t lsize)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 03:01:20 +00:00
|
|
|
return (lsize * spa_asize_inflation);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
2014-11-03 20:28:43 +00:00
|
|
|
/*
|
|
|
|
* Return the amount of slop space in bytes. It is 1/32 of the pool (3.2%),
|
|
|
|
* or at least 32MB.
|
|
|
|
*
|
|
|
|
* See the comment above spa_slop_shift for details.
|
|
|
|
*/
|
|
|
|
uint64_t
|
|
|
|
spa_get_slop_space(spa_t *spa) {
|
|
|
|
uint64_t space = spa_get_dspace(spa);
|
|
|
|
return (MAX(space >> spa_slop_shift, SPA_MINDEVSIZE >> 1));
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
uint64_t
|
|
|
|
spa_get_dspace(spa_t *spa)
|
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
return (spa->spa_dspace);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
void
|
|
|
|
spa_update_dspace(spa_t *spa)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
spa->spa_dspace = metaslab_class_get_dspace(spa_normal_class(spa)) +
|
|
|
|
ddt_get_dedup_dspace(spa);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return the failure mode that has been set to this pool. The default
|
|
|
|
* behavior will be to block all I/Os when a complete failure occurs.
|
|
|
|
*/
|
|
|
|
uint8_t
|
|
|
|
spa_get_failmode(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_failmode);
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
boolean_t
|
|
|
|
spa_suspended(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_suspended);
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
uint64_t
|
|
|
|
spa_version(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_ubsync.ub_version);
|
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
boolean_t
|
|
|
|
spa_deflate(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_deflate);
|
|
|
|
}
|
|
|
|
|
|
|
|
metaslab_class_t *
|
|
|
|
spa_normal_class(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_normal_class);
|
|
|
|
}
|
|
|
|
|
|
|
|
metaslab_class_t *
|
|
|
|
spa_log_class(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_log_class);
|
|
|
|
}
|
|
|
|
|
2015-04-02 03:44:32 +00:00
|
|
|
void
|
|
|
|
spa_evicting_os_register(spa_t *spa, objset_t *os)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa->spa_evicting_os_lock);
|
|
|
|
list_insert_head(&spa->spa_evicting_os_list, os);
|
|
|
|
mutex_exit(&spa->spa_evicting_os_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_evicting_os_deregister(spa_t *spa, objset_t *os)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa->spa_evicting_os_lock);
|
|
|
|
list_remove(&spa->spa_evicting_os_list, os);
|
|
|
|
cv_broadcast(&spa->spa_evicting_os_cv);
|
|
|
|
mutex_exit(&spa->spa_evicting_os_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_evicting_os_wait(spa_t *spa)
|
|
|
|
{
|
|
|
|
mutex_enter(&spa->spa_evicting_os_lock);
|
|
|
|
while (!list_is_empty(&spa->spa_evicting_os_list))
|
|
|
|
cv_wait(&spa->spa_evicting_os_cv, &spa->spa_evicting_os_lock);
|
|
|
|
mutex_exit(&spa->spa_evicting_os_lock);
|
|
|
|
|
|
|
|
dmu_buf_user_evict_wait();
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
int
|
|
|
|
spa_max_replication(spa_t *spa)
|
|
|
|
{
|
|
|
|
/*
|
|
|
|
* As of SPA_VERSION == SPA_VERSION_DITTO_BLOCKS, we are able to
|
|
|
|
* handle BPs with more than one DVA allocated. Set our max
|
|
|
|
* replication level accordingly.
|
|
|
|
*/
|
|
|
|
if (spa_version(spa) < SPA_VERSION_DITTO_BLOCKS)
|
|
|
|
return (1);
|
|
|
|
return (MIN(SPA_DVAS_PER_BP, spa_max_replication_override));
|
|
|
|
}
|
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
int
|
|
|
|
spa_prev_software_version(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_prev_software_version);
|
|
|
|
}
|
|
|
|
|
2013-04-29 22:49:23 +00:00
|
|
|
uint64_t
|
|
|
|
spa_deadman_synctime(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_deadman_synctime);
|
|
|
|
}
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
uint64_t
|
2010-05-28 20:45:14 +00:00
|
|
|
dva_get_dsize_sync(spa_t *spa, const dva_t *dva)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
2010-05-28 20:45:14 +00:00
|
|
|
uint64_t asize = DVA_GET_ASIZE(dva);
|
|
|
|
uint64_t dsize = asize;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
ASSERT(spa_config_held(spa, SCL_ALL, RW_READER) != 0);
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
if (asize != 0 && spa->spa_deflate) {
|
|
|
|
vdev_t *vd = vdev_lookup_top(spa, DVA_GET_VDEV(dva));
|
2014-05-05 18:28:12 +00:00
|
|
|
if (vd != NULL)
|
|
|
|
dsize = (asize >> SPA_MINBLOCKSHIFT) *
|
|
|
|
vd->vdev_deflate_ratio;
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
2010-05-28 20:45:14 +00:00
|
|
|
|
|
|
|
return (dsize);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
bp_get_dsize_sync(spa_t *spa, const blkptr_t *bp)
|
|
|
|
{
|
|
|
|
uint64_t dsize = 0;
|
2010-08-26 16:52:39 +00:00
|
|
|
int d;
|
2010-05-28 20:45:14 +00:00
|
|
|
|
2014-06-05 21:19:08 +00:00
|
|
|
for (d = 0; d < BP_GET_NDVAS(bp); d++)
|
2010-05-28 20:45:14 +00:00
|
|
|
dsize += dva_get_dsize_sync(spa, &bp->blk_dva[d]);
|
|
|
|
|
|
|
|
return (dsize);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
bp_get_dsize(spa_t *spa, const blkptr_t *bp)
|
|
|
|
{
|
|
|
|
uint64_t dsize = 0;
|
2010-08-26 16:52:39 +00:00
|
|
|
int d;
|
2010-05-28 20:45:14 +00:00
|
|
|
|
|
|
|
spa_config_enter(spa, SCL_VDEV, FTAG, RW_READER);
|
|
|
|
|
2014-06-05 21:19:08 +00:00
|
|
|
for (d = 0; d < BP_GET_NDVAS(bp); d++)
|
2010-05-28 20:45:14 +00:00
|
|
|
dsize += dva_get_dsize_sync(spa, &bp->blk_dva[d]);
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
spa_config_exit(spa, SCL_VDEV, FTAG);
|
2010-05-28 20:45:14 +00:00
|
|
|
|
|
|
|
return (dsize);
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* ==========================================================================
|
|
|
|
* Initialization and Termination
|
|
|
|
* ==========================================================================
|
|
|
|
*/
|
|
|
|
|
|
|
|
static int
|
|
|
|
spa_name_compare(const void *a1, const void *a2)
|
|
|
|
{
|
|
|
|
const spa_t *s1 = a1;
|
|
|
|
const spa_t *s2 = a2;
|
|
|
|
int s;
|
|
|
|
|
|
|
|
s = strcmp(s1->spa_name, s2->spa_name);
|
2016-08-27 18:12:53 +00:00
|
|
|
|
|
|
|
return (AVL_ISIGN(s));
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
2010-08-26 16:52:41 +00:00
|
|
|
spa_boot_init(void)
|
2008-11-20 20:01:55 +00:00
|
|
|
{
|
|
|
|
spa_config_load();
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_init(int mode)
|
|
|
|
{
|
|
|
|
mutex_init(&spa_namespace_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&spa_spare_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
mutex_init(&spa_l2cache_lock, NULL, MUTEX_DEFAULT, NULL);
|
|
|
|
cv_init(&spa_namespace_cv, NULL, CV_DEFAULT, NULL);
|
|
|
|
|
|
|
|
avl_create(&spa_namespace_avl, spa_name_compare, sizeof (spa_t),
|
|
|
|
offsetof(spa_t, spa_avl));
|
|
|
|
|
|
|
|
avl_create(&spa_spare_avl, spa_spare_compare, sizeof (spa_aux_t),
|
|
|
|
offsetof(spa_aux_t, aux_avl));
|
|
|
|
|
|
|
|
avl_create(&spa_l2cache_avl, spa_l2cache_compare, sizeof (spa_aux_t),
|
|
|
|
offsetof(spa_aux_t, aux_avl));
|
|
|
|
|
2009-01-15 21:59:39 +00:00
|
|
|
spa_mode_global = mode;
|
2008-11-20 20:01:55 +00:00
|
|
|
|
2013-05-16 21:18:06 +00:00
|
|
|
#ifndef _KERNEL
|
|
|
|
if (spa_mode_global != FREAD && dprintf_find_string("watch")) {
|
|
|
|
struct sigaction sa;
|
|
|
|
|
|
|
|
sa.sa_flags = SA_SIGINFO;
|
|
|
|
sigemptyset(&sa.sa_mask);
|
|
|
|
sa.sa_sigaction = arc_buf_sigsegv;
|
|
|
|
|
|
|
|
if (sigaction(SIGSEGV, &sa, NULL) == -1) {
|
|
|
|
perror("could not enable watchpoints: "
|
|
|
|
"sigaction(SIGSEGV, ...) = ");
|
|
|
|
} else {
|
|
|
|
arc_watch = B_TRUE;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
#endif
|
|
|
|
|
2010-08-26 18:42:43 +00:00
|
|
|
fm_init();
|
2008-11-20 20:01:55 +00:00
|
|
|
refcount_init();
|
|
|
|
unique_init();
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-01 21:25:53 +00:00
|
|
|
range_tree_init();
|
2013-11-19 21:34:46 +00:00
|
|
|
ddt_init();
|
2008-11-20 20:01:55 +00:00
|
|
|
zio_init();
|
|
|
|
dmu_init();
|
|
|
|
zil_init();
|
|
|
|
vdev_cache_stat_init();
|
SIMD implementation of vdev_raidz generate and reconstruct routines
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328
2016-04-25 08:04:31 +00:00
|
|
|
vdev_raidz_math_init();
|
2008-11-20 20:01:55 +00:00
|
|
|
zfs_prop_init();
|
|
|
|
zpool_prop_init();
|
2012-12-13 23:24:15 +00:00
|
|
|
zpool_feature_init();
|
2008-11-20 20:01:55 +00:00
|
|
|
spa_config_load();
|
2008-12-03 20:09:06 +00:00
|
|
|
l2arc_start();
|
2008-11-20 20:01:55 +00:00
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_fini(void)
|
|
|
|
{
|
2008-12-03 20:09:06 +00:00
|
|
|
l2arc_stop();
|
|
|
|
|
2008-11-20 20:01:55 +00:00
|
|
|
spa_evict_all();
|
|
|
|
|
|
|
|
vdev_cache_stat_fini();
|
SIMD implementation of vdev_raidz generate and reconstruct routines
This is a new implementation of RAIDZ1/2/3 routines using x86_64
scalar, SSE, and AVX2 instruction sets. Included are 3 parity
generation routines (P, PQ, and PQR) and 7 reconstruction routines,
for all RAIDZ level. On module load, a quick benchmark of supported
routines will select the fastest for each operation and they will
be used at runtime. Original implementation is still present and
can be selected via module parameter.
Patch contains:
- specialized gen/rec routines for all RAIDZ levels,
- new scalar raidz implementation (unrolled),
- two x86_64 SIMD implementations (SSE and AVX2 instructions sets),
- fastest routines selected on module load (benchmark).
- cmd/raidz_test - verify and benchmark all implementations
- added raidz_test to the ZFS Test Suite
New zfs module parameters:
- zfs_vdev_raidz_impl (str): selects the implementation to use. On
module load, the parameter will only accept first 3 options, and
the other implementations can be set once module is finished
loading. Possible values for this option are:
"fastest" - use the fastest math available
"original" - use the original raidz code
"scalar" - new scalar impl
"sse" - new SSE impl if available
"avx2" - new AVX2 impl if available
See contents of `/sys/module/zfs/parameters/zfs_vdev_raidz_impl` to
get the list of supported values. If an implementation is not supported
on the system, it will not be shown. Currently selected option is
enclosed in `[]`.
Signed-off-by: Gvozden Neskovic <neskovic@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #4328
2016-04-25 08:04:31 +00:00
|
|
|
vdev_raidz_math_fini();
|
2008-11-20 20:01:55 +00:00
|
|
|
zil_fini();
|
|
|
|
dmu_fini();
|
|
|
|
zio_fini();
|
2013-11-19 21:34:46 +00:00
|
|
|
ddt_fini();
|
Illumos #4101, #4102, #4103, #4105, #4106
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Sebastien Roy <seb@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Prior to this patch, space_maps were preferred solely based on the
amount of free space left in each. Unfortunately, this heuristic didn't
contain any information about the make-up of that free space, which
meant we could keep preferring and loading a highly fragmented space map
that wouldn't actually have enough contiguous space to satisfy the
allocation; then unloading that space_map and repeating the process.
This change modifies the space_map's to store additional information
about the contiguous space in the space_map, so that we can use this
information to make a better decision about which space_map to load.
This requires reallocating all space_map objects to increase their
bonus buffer size sizes enough to fit the new metadata.
The above feature can be enabled via a new feature flag introduced by
this change: com.delphix:spacemap_histogram
In addition to the above, this patch allows the space_map block size to
be increase. Currently the block size is set to be 4K in size, which has
certain implications including the following:
* 4K sector devices will not see any compression benefit
* large space_maps require more metadata on-disk
* large space_maps require more time to load (typically random reads)
Now the space_map block size can adjust as needed up to the maximum size
set via the space_map_max_blksz variable.
A bug was fixed which resulted in potentially leaking an object when
removing a mirrored log device. The previous logic for vdev_remove() did
not deal with removing top-level vdevs that are interior vdevs (i.e.
mirror) correctly. The problem would occur when removing a mirrored log
device, and result in the DTL space map object being leaked; because
top-level vdevs don't have DTL space map objects associated with them.
References:
https://www.illumos.org/issues/4101
https://www.illumos.org/issues/4102
https://www.illumos.org/issues/4103
https://www.illumos.org/issues/4105
https://www.illumos.org/issues/4106
https://github.com/illumos/illumos-gate/commit/0713e23
Porting notes:
A handful of kmem_alloc() calls were converted to kmem_zalloc(). Also,
the KM_PUSHPAGE and TQ_PUSHPAGE flags were used as necessary.
Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2488
2013-10-01 21:25:53 +00:00
|
|
|
range_tree_fini();
|
2008-11-20 20:01:55 +00:00
|
|
|
unique_fini();
|
|
|
|
refcount_fini();
|
2010-08-26 18:42:43 +00:00
|
|
|
fm_fini();
|
2008-11-20 20:01:55 +00:00
|
|
|
|
|
|
|
avl_destroy(&spa_namespace_avl);
|
|
|
|
avl_destroy(&spa_spare_avl);
|
|
|
|
avl_destroy(&spa_l2cache_avl);
|
|
|
|
|
|
|
|
cv_destroy(&spa_namespace_cv);
|
|
|
|
mutex_destroy(&spa_namespace_lock);
|
|
|
|
mutex_destroy(&spa_spare_lock);
|
|
|
|
mutex_destroy(&spa_l2cache_lock);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Return whether this pool has slogs. No locking needed.
|
|
|
|
* It's not a problem if the wrong answer is returned as it's only for
|
|
|
|
* performance and not correctness
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
spa_has_slogs(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_log_class->mc_rotor != NULL);
|
|
|
|
}
|
2008-12-03 20:09:06 +00:00
|
|
|
|
2010-05-28 20:45:14 +00:00
|
|
|
spa_log_state_t
|
|
|
|
spa_get_log_state(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_log_state);
|
|
|
|
}
|
|
|
|
|
|
|
|
void
|
|
|
|
spa_set_log_state(spa_t *spa, spa_log_state_t state)
|
|
|
|
{
|
|
|
|
spa->spa_log_state = state;
|
|
|
|
}
|
|
|
|
|
2008-12-03 20:09:06 +00:00
|
|
|
boolean_t
|
|
|
|
spa_is_root(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_is_root);
|
|
|
|
}
|
2009-01-15 21:59:39 +00:00
|
|
|
|
|
|
|
boolean_t
|
|
|
|
spa_writeable(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (!!(spa->spa_mode & FWRITE));
|
|
|
|
}
|
|
|
|
|
2014-07-18 15:08:31 +00:00
|
|
|
/*
|
|
|
|
* Returns true if there is a pending sync task in any of the current
|
|
|
|
* syncing txg, the current quiescing txg, or the current open txg.
|
|
|
|
*/
|
|
|
|
boolean_t
|
|
|
|
spa_has_pending_synctask(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (!txg_all_lists_empty(&spa->spa_dsl_pool->dp_sync_tasks));
|
|
|
|
}
|
|
|
|
|
2009-01-15 21:59:39 +00:00
|
|
|
int
|
|
|
|
spa_mode(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_mode);
|
|
|
|
}
|
2010-05-28 20:45:14 +00:00
|
|
|
|
|
|
|
uint64_t
|
|
|
|
spa_bootfs(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_bootfs);
|
|
|
|
}
|
|
|
|
|
|
|
|
uint64_t
|
|
|
|
spa_delegation(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_delegation);
|
|
|
|
}
|
|
|
|
|
|
|
|
objset_t *
|
|
|
|
spa_meta_objset(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_meta_objset);
|
|
|
|
}
|
|
|
|
|
|
|
|
enum zio_checksum
|
|
|
|
spa_dedup_checksum(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_dedup_checksum);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Reset pool scan stat per scan pass (or reboot).
|
|
|
|
*/
|
|
|
|
void
|
|
|
|
spa_scan_stat_init(spa_t *spa)
|
|
|
|
{
|
|
|
|
/* data not stored on disk */
|
|
|
|
spa->spa_scan_pass_start = gethrestime_sec();
|
|
|
|
spa->spa_scan_pass_exam = 0;
|
|
|
|
vdev_scan_stat_init(spa->spa_root_vdev);
|
|
|
|
}
|
|
|
|
|
|
|
|
/*
|
|
|
|
* Get scan stats for zpool status reports
|
|
|
|
*/
|
|
|
|
int
|
|
|
|
spa_scan_get_stats(spa_t *spa, pool_scan_stat_t *ps)
|
|
|
|
{
|
|
|
|
dsl_scan_t *scn = spa->spa_dsl_pool ? spa->spa_dsl_pool->dp_scan : NULL;
|
|
|
|
|
|
|
|
if (scn == NULL || scn->scn_phys.scn_func == POOL_SCAN_NONE)
|
2013-03-08 18:41:28 +00:00
|
|
|
return (SET_ERROR(ENOENT));
|
2010-05-28 20:45:14 +00:00
|
|
|
bzero(ps, sizeof (pool_scan_stat_t));
|
|
|
|
|
|
|
|
/* data stored on disk */
|
|
|
|
ps->pss_func = scn->scn_phys.scn_func;
|
|
|
|
ps->pss_start_time = scn->scn_phys.scn_start_time;
|
|
|
|
ps->pss_end_time = scn->scn_phys.scn_end_time;
|
|
|
|
ps->pss_to_examine = scn->scn_phys.scn_to_examine;
|
|
|
|
ps->pss_examined = scn->scn_phys.scn_examined;
|
|
|
|
ps->pss_to_process = scn->scn_phys.scn_to_process;
|
|
|
|
ps->pss_processed = scn->scn_phys.scn_processed;
|
|
|
|
ps->pss_errors = scn->scn_phys.scn_errors;
|
|
|
|
ps->pss_state = scn->scn_phys.scn_state;
|
|
|
|
|
|
|
|
/* data not stored on disk */
|
|
|
|
ps->pss_pass_start = spa->spa_scan_pass_start;
|
|
|
|
ps->pss_pass_exam = spa->spa_scan_pass_exam;
|
|
|
|
|
|
|
|
return (0);
|
|
|
|
}
|
2010-08-26 18:49:16 +00:00
|
|
|
|
2011-07-26 19:08:52 +00:00
|
|
|
boolean_t
|
|
|
|
spa_debug_enabled(spa_t *spa)
|
|
|
|
{
|
|
|
|
return (spa->spa_debug);
|
|
|
|
}
|
|
|
|
|
2014-11-03 20:15:08 +00:00
|
|
|
int
|
|
|
|
spa_maxblocksize(spa_t *spa)
|
|
|
|
{
|
|
|
|
if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_BLOCKS))
|
|
|
|
return (SPA_MAXBLOCKSIZE);
|
|
|
|
else
|
|
|
|
return (SPA_OLD_MAXBLOCKSIZE);
|
|
|
|
}
|
|
|
|
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 01:25:34 +00:00
|
|
|
int
|
|
|
|
spa_maxdnodesize(spa_t *spa)
|
|
|
|
{
|
|
|
|
if (spa_feature_is_enabled(spa, SPA_FEATURE_LARGE_DNODE))
|
|
|
|
return (DNODE_MAX_SIZE);
|
|
|
|
else
|
|
|
|
return (DNODE_MIN_SIZE);
|
|
|
|
}
|
|
|
|
|
2010-08-26 18:49:16 +00:00
|
|
|
#if defined(_KERNEL) && defined(HAVE_SPL)
|
|
|
|
/* Namespace manipulation */
|
|
|
|
EXPORT_SYMBOL(spa_lookup);
|
|
|
|
EXPORT_SYMBOL(spa_add);
|
|
|
|
EXPORT_SYMBOL(spa_remove);
|
|
|
|
EXPORT_SYMBOL(spa_next);
|
|
|
|
|
|
|
|
/* Refcount functions */
|
|
|
|
EXPORT_SYMBOL(spa_open_ref);
|
|
|
|
EXPORT_SYMBOL(spa_close);
|
|
|
|
EXPORT_SYMBOL(spa_refcount_zero);
|
|
|
|
|
|
|
|
/* Pool configuration lock */
|
|
|
|
EXPORT_SYMBOL(spa_config_tryenter);
|
|
|
|
EXPORT_SYMBOL(spa_config_enter);
|
|
|
|
EXPORT_SYMBOL(spa_config_exit);
|
|
|
|
EXPORT_SYMBOL(spa_config_held);
|
|
|
|
|
|
|
|
/* Pool vdev add/remove lock */
|
|
|
|
EXPORT_SYMBOL(spa_vdev_enter);
|
|
|
|
EXPORT_SYMBOL(spa_vdev_exit);
|
|
|
|
|
|
|
|
/* Pool vdev state change lock */
|
|
|
|
EXPORT_SYMBOL(spa_vdev_state_enter);
|
|
|
|
EXPORT_SYMBOL(spa_vdev_state_exit);
|
|
|
|
|
|
|
|
/* Accessor functions */
|
|
|
|
EXPORT_SYMBOL(spa_shutting_down);
|
|
|
|
EXPORT_SYMBOL(spa_get_dsl);
|
|
|
|
EXPORT_SYMBOL(spa_get_rootblkptr);
|
|
|
|
EXPORT_SYMBOL(spa_set_rootblkptr);
|
|
|
|
EXPORT_SYMBOL(spa_altroot);
|
|
|
|
EXPORT_SYMBOL(spa_sync_pass);
|
|
|
|
EXPORT_SYMBOL(spa_name);
|
|
|
|
EXPORT_SYMBOL(spa_guid);
|
|
|
|
EXPORT_SYMBOL(spa_last_synced_txg);
|
|
|
|
EXPORT_SYMBOL(spa_first_txg);
|
|
|
|
EXPORT_SYMBOL(spa_syncing_txg);
|
|
|
|
EXPORT_SYMBOL(spa_version);
|
|
|
|
EXPORT_SYMBOL(spa_state);
|
|
|
|
EXPORT_SYMBOL(spa_load_state);
|
|
|
|
EXPORT_SYMBOL(spa_freeze_txg);
|
|
|
|
EXPORT_SYMBOL(spa_get_asize);
|
|
|
|
EXPORT_SYMBOL(spa_get_dspace);
|
|
|
|
EXPORT_SYMBOL(spa_update_dspace);
|
|
|
|
EXPORT_SYMBOL(spa_deflate);
|
|
|
|
EXPORT_SYMBOL(spa_normal_class);
|
|
|
|
EXPORT_SYMBOL(spa_log_class);
|
|
|
|
EXPORT_SYMBOL(spa_max_replication);
|
|
|
|
EXPORT_SYMBOL(spa_prev_software_version);
|
|
|
|
EXPORT_SYMBOL(spa_get_failmode);
|
|
|
|
EXPORT_SYMBOL(spa_suspended);
|
|
|
|
EXPORT_SYMBOL(spa_bootfs);
|
|
|
|
EXPORT_SYMBOL(spa_delegation);
|
|
|
|
EXPORT_SYMBOL(spa_meta_objset);
|
2014-11-03 20:15:08 +00:00
|
|
|
EXPORT_SYMBOL(spa_maxblocksize);
|
Implement large_dnode pool feature
Justification
-------------
This feature adds support for variable length dnodes. Our motivation is
to eliminate the overhead associated with using spill blocks. Spill
blocks are used to store system attribute data (i.e. file metadata) that
does not fit in the dnode's bonus buffer. By allowing a larger bonus
buffer area the use of a spill block can be avoided. Spill blocks
potentially incur an additional read I/O for every dnode in a dnode
block. As a worst case example, reading 32 dnodes from a 16k dnode block
and all of the spill blocks could issue 33 separate reads. Now suppose
those dnodes have size 1024 and therefore don't need spill blocks. Then
the worst case number of blocks read is reduced to from 33 to two--one
per dnode block. In practice spill blocks may tend to be co-located on
disk with the dnode blocks so the reduction in I/O would not be this
drastic. In a badly fragmented pool, however, the improvement could be
significant.
ZFS-on-Linux systems that make heavy use of extended attributes would
benefit from this feature. In particular, ZFS-on-Linux supports the
xattr=sa dataset property which allows file extended attribute data
to be stored in the dnode bonus buffer as an alternative to the
traditional directory-based format. Workloads such as SELinux and the
Lustre distributed filesystem often store enough xattr data to force
spill bocks when xattr=sa is in effect. Large dnodes may therefore
provide a performance benefit to such systems.
Other use cases that may benefit from this feature include files with
large ACLs and symbolic links with long target names. Furthermore,
this feature may be desirable on other platforms in case future
applications or features are developed that could make use of a
larger bonus buffer area.
Implementation
--------------
The size of a dnode may be a multiple of 512 bytes up to the size of
a dnode block (currently 16384 bytes). A dn_extra_slots field was
added to the current on-disk dnode_phys_t structure to describe the
size of the physical dnode on disk. The 8 bits for this field were
taken from the zero filled dn_pad2 field. The field represents how
many "extra" dnode_phys_t slots a dnode consumes in its dnode block.
This convention results in a value of 0 for 512 byte dnodes which
preserves on-disk format compatibility with older software.
Similarly, the in-memory dnode_t structure has a new dn_num_slots field
to represent the total number of dnode_phys_t slots consumed on disk.
Thus dn->dn_num_slots is 1 greater than the corresponding
dnp->dn_extra_slots. This difference in convention was adopted
because, unlike on-disk structures, backward compatibility is not a
concern for in-memory objects, so we used a more natural way to
represent size for a dnode_t.
The default size for newly created dnodes is determined by the value of
a new "dnodesize" dataset property. By default the property is set to
"legacy" which is compatible with older software. Setting the property
to "auto" will allow the filesystem to choose the most suitable dnode
size. Currently this just sets the default dnode size to 1k, but future
code improvements could dynamically choose a size based on observed
workload patterns. Dnodes of varying sizes can coexist within the same
dataset and even within the same dnode block. For example, to enable
automatically-sized dnodes, run
# zfs set dnodesize=auto tank/fish
The user can also specify literal values for the dnodesize property.
These are currently limited to powers of two from 1k to 16k. The
power-of-2 limitation is only for simplicity of the user interface.
Internally the implementation can handle any multiple of 512 up to 16k,
and consumers of the DMU API can specify any legal dnode value.
The size of a new dnode is determined at object allocation time and
stored as a new field in the znode in-memory structure. New DMU
interfaces are added to allow the consumer to specify the dnode size
that a newly allocated object should use. Existing interfaces are
unchanged to avoid having to update every call site and to preserve
compatibility with external consumers such as Lustre. The new
interfaces names are given below. The versions of these functions that
don't take a dnodesize parameter now just call the _dnsize() versions
with a dnodesize of 0, which means use the legacy dnode size.
New DMU interfaces:
dmu_object_alloc_dnsize()
dmu_object_claim_dnsize()
dmu_object_reclaim_dnsize()
New ZAP interfaces:
zap_create_dnsize()
zap_create_norm_dnsize()
zap_create_flags_dnsize()
zap_create_claim_norm_dnsize()
zap_create_link_dnsize()
The constant DN_MAX_BONUSLEN is renamed to DN_OLD_MAX_BONUSLEN. The
spa_maxdnodesize() function should be used to determine the maximum
bonus length for a pool.
These are a few noteworthy changes to key functions:
* The prototype for dnode_hold_impl() now takes a "slots" parameter.
When the DNODE_MUST_BE_FREE flag is set, this parameter is used to
ensure the hole at the specified object offset is large enough to
hold the dnode being created. The slots parameter is also used
to ensure a dnode does not span multiple dnode blocks. In both of
these cases, if a failure occurs, ENOSPC is returned. Keep in mind,
these failure cases are only possible when using DNODE_MUST_BE_FREE.
If the DNODE_MUST_BE_ALLOCATED flag is set, "slots" must be 0.
dnode_hold_impl() will check if the requested dnode is already
consumed as an extra dnode slot by an large dnode, in which case
it returns ENOENT.
* The function dmu_object_alloc() advances to the next dnode block
if dnode_hold_impl() returns an error for a requested object.
This is because the beginning of the next dnode block is the only
location it can safely assume to either be a hole or a valid
starting point for a dnode.
* dnode_next_offset_level() and other functions that iterate
through dnode blocks may no longer use a simple array indexing
scheme. These now use the current dnode's dn_num_slots field to
advance to the next dnode in the block. This is to ensure we
properly skip the current dnode's bonus area and don't interpret it
as a valid dnode.
zdb
---
The zdb command was updated to display a dnode's size under the
"dnsize" column when the object is dumped.
For ZIL create log records, zdb will now display the slot count for
the object.
ztest
-----
Ztest chooses a random dnodesize for every newly created object. The
random distribution is more heavily weighted toward small dnodes to
better simulate real-world datasets.
Unused bonus buffer space is filled with non-zero values computed from
the object number, dataset id, offset, and generation number. This
helps ensure that the dnode traversal code properly skips the interior
regions of large dnodes, and that these interior regions are not
overwritten by data belonging to other dnodes. A new test visits each
object in a dataset. It verifies that the actual dnode size matches what
was stored in the ztest block tag when it was created. It also verifies
that the unused bonus buffer space is filled with the expected data
patterns.
ZFS Test Suite
--------------
Added six new large dnode-specific tests, and integrated the dnodesize
property into existing tests for zfs allow and send/recv.
Send/Receive
------------
ZFS send streams for datasets containing large dnodes cannot be received
on pools that don't support the large_dnode feature. A send stream with
large dnodes sets a DMU_BACKUP_FEATURE_LARGE_DNODE flag which will be
unrecognized by an incompatible receiving pool so that the zfs receive
will fail gracefully.
While not implemented here, it may be possible to generate a
backward-compatible send stream from a dataset containing large
dnodes. The implementation may be tricky, however, because the send
object record for a large dnode would need to be resized to a 512
byte dnode, possibly kicking in a spill block in the process. This
means we would need to construct a new SA layout and possibly
register it in the SA layout object. The SA layout is normally just
sent as an ordinary object record. But if we are constructing new
layouts while generating the send stream we'd have to build the SA
layout object dynamically and send it at the end of the stream.
For sending and receiving between pools that do support large dnodes,
the drr_object send record type is extended with a new field to store
the dnode slot count. This field was repurposed from unused padding
in the structure.
ZIL Replay
----------
The dnode slot count is stored in the uppermost 8 bits of the lr_foid
field. The bits were unused as the object id is currently capped at
48 bits.
Resizing Dnodes
---------------
It should be possible to resize a dnode when it is dirtied if the
current dnodesize dataset property differs from the dnode's size, but
this functionality is not currently implemented. Clearly a dnode can
only grow if there are sufficient contiguous unused slots in the
dnode block, but it should always be possible to shrink a dnode.
Growing dnodes may be useful to reduce fragmentation in a pool with
many spill blocks in use. Shrinking dnodes may be useful to allow
sending a dataset to a pool that doesn't support the large_dnode
feature.
Feature Reference Counting
--------------------------
The reference count for the large_dnode pool feature tracks the
number of datasets that have ever contained a dnode of size larger
than 512 bytes. The first time a large dnode is created in a dataset
the dataset is converted to an extensible dataset. This is a one-way
operation and the only way to decrement the feature count is to
destroy the dataset, even if the dataset no longer contains any large
dnodes. The complexity of reference counting on a per-dnode basis was
too high, so we chose to track it on a per-dataset basis similarly to
the large_block feature.
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3542
2016-03-17 01:25:34 +00:00
|
|
|
EXPORT_SYMBOL(spa_maxdnodesize);
|
2010-08-26 18:49:16 +00:00
|
|
|
|
|
|
|
/* Miscellaneous support routines */
|
|
|
|
EXPORT_SYMBOL(spa_rename);
|
|
|
|
EXPORT_SYMBOL(spa_guid_exists);
|
|
|
|
EXPORT_SYMBOL(spa_strdup);
|
|
|
|
EXPORT_SYMBOL(spa_strfree);
|
|
|
|
EXPORT_SYMBOL(spa_get_random);
|
|
|
|
EXPORT_SYMBOL(spa_generate_guid);
|
2013-12-09 18:37:51 +00:00
|
|
|
EXPORT_SYMBOL(snprintf_blkptr);
|
2010-08-26 18:49:16 +00:00
|
|
|
EXPORT_SYMBOL(spa_freeze);
|
|
|
|
EXPORT_SYMBOL(spa_upgrade);
|
|
|
|
EXPORT_SYMBOL(spa_evict_all);
|
|
|
|
EXPORT_SYMBOL(spa_lookup_by_guid);
|
|
|
|
EXPORT_SYMBOL(spa_has_spare);
|
|
|
|
EXPORT_SYMBOL(dva_get_dsize_sync);
|
|
|
|
EXPORT_SYMBOL(bp_get_dsize_sync);
|
|
|
|
EXPORT_SYMBOL(bp_get_dsize);
|
|
|
|
EXPORT_SYMBOL(spa_has_slogs);
|
|
|
|
EXPORT_SYMBOL(spa_is_root);
|
|
|
|
EXPORT_SYMBOL(spa_writeable);
|
|
|
|
EXPORT_SYMBOL(spa_mode);
|
|
|
|
|
|
|
|
EXPORT_SYMBOL(spa_namespace_lock);
|
2013-04-29 22:49:23 +00:00
|
|
|
|
2014-12-23 00:54:43 +00:00
|
|
|
module_param(zfs_flags, uint, 0644);
|
Swap DTRACE_PROBE* with Linux tracepoints
This patch leverages Linux tracepoints from within the ZFS on Linux
code base. It also refactors the debug code to bring it back in sync
with Illumos.
The information exported via tracepoints can be used for a variety of
reasons (e.g. debugging, tuning, general exploration/understanding,
etc). It is advantageous to use Linux tracepoints as the mechanism to
export this kind of information (as opposed to something else) for a
number of reasons:
* A number of external tools can make use of our tracepoints
"automatically" (e.g. perf, systemtap)
* Tracepoints are designed to be extremely cheap when disabled
* It's one of the "accepted" ways to export this kind of
information; many other kernel subsystems use tracepoints too.
Unfortunately, though, there are a few caveats as well:
* Linux tracepoints appear to only be available to GPL licensed
modules due to the way certain kernel functions are exported.
Thus, to actually make use of the tracepoints introduced by this
patch, one might have to patch and re-compile the kernel;
exporting the necessary functions to non-GPL modules.
* Prior to upstream kernel version v3.14-rc6-30-g66cc69e, Linux
tracepoints are not available for unsigned kernel modules
(tracepoints will get disabled due to the module's 'F' taint).
Thus, one either has to sign the zfs kernel module prior to
loading it, or use a kernel versioned v3.14-rc6-30-g66cc69e or
newer.
Assuming the above two requirements are satisfied, lets look at an
example of how this patch can be used and what information it exposes
(all commands run as 'root'):
# list all zfs tracepoints available
$ ls /sys/kernel/debug/tracing/events/zfs
enable filter zfs_arc__delete
zfs_arc__evict zfs_arc__hit zfs_arc__miss
zfs_l2arc__evict zfs_l2arc__hit zfs_l2arc__iodone
zfs_l2arc__miss zfs_l2arc__read zfs_l2arc__write
zfs_new_state__mfu zfs_new_state__mru
# enable all zfs tracepoints, clear the tracepoint ring buffer
$ echo 1 > /sys/kernel/debug/tracing/events/zfs/enable
$ echo 0 > /sys/kernel/debug/tracing/trace
# import zpool called 'tank', inspect tracepoint data (each line was
# truncated, they're too long for a commit message otherwise)
$ zpool import tank
$ cat /sys/kernel/debug/tracing/trace | head -n35
# tracer: nop
#
# entries-in-buffer/entries-written: 1219/1219 #P:8
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss: hdr...
z_rd_int/0-30156 [003] .... 91344.200611: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201173: zfs_arc__miss: hdr...
z_rd_int/1-30157 [003] .... 91344.201756: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.201795: zfs_arc__miss: hdr...
z_rd_int/2-30158 [003] .... 91344.202099: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202126: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202130: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202134: zfs_arc__hit: hdr ...
lt-zpool-30132 [003] .... 91344.202146: zfs_arc__miss: hdr...
z_rd_int/3-30159 [003] .... 91344.202457: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202484: zfs_arc__miss: hdr...
z_rd_int/4-30160 [003] .... 91344.202866: zfs_new_state__mru...
lt-zpool-30132 [003] .... 91344.202891: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203034: zfs_arc__miss: hdr...
z_rd_iss/1-30149 [001] .... 91344.203749: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.203789: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.203878: zfs_arc__miss: hdr...
z_rd_iss/3-30151 [001] .... 91344.204315: zfs_new_state__mru...
lt-zpool-30132 [001] .... 91344.204332: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204337: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204352: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204356: zfs_arc__hit: hdr ...
lt-zpool-30132 [001] .... 91344.204360: zfs_arc__hit: hdr ...
To highlight the kind of detailed information that is being exported
using this infrastructure, I've taken the first tracepoint line from the
output above and reformatted it such that it fits in 80 columns:
lt-zpool-30132 [003] .... 91344.200050: zfs_arc__miss:
hdr {
dva 0x1:0x40082
birth 15491
cksum0 0x163edbff3a
flags 0x640
datacnt 1
type 1
size 2048
spa 3133524293419867460
state_type 0
access 0
mru_hits 0
mru_ghost_hits 0
mfu_hits 0
mfu_ghost_hits 0
l2_hits 0
refcount 1
} bp {
dva0 0x1:0x40082
dva1 0x1:0x3000e5
dva2 0x1:0x5a006e
cksum 0x163edbff3a:0x75af30b3dd6:0x1499263ff5f2b:0x288bd118815e00
lsize 2048
} zb {
objset 0
object 0
level -1
blkid 0
}
For the specific tracepoint shown here, 'zfs_arc__miss', data is
exported detailing the arc_buf_hdr_t (hdr), blkptr_t (bp), and
zbookmark_t (zb) that caused the ARC miss (down to the exact DVA!).
This kind of precise and detailed information can be extremely valuable
when trying to answer certain kinds of questions.
For anybody unfamiliar but looking to build on this, I found the XFS
source code along with the following three web links to be extremely
helpful:
* http://lwn.net/Articles/379903/
* http://lwn.net/Articles/381064/
* http://lwn.net/Articles/383362/
I should also node the more "boring" aspects of this patch:
* The ZFS_LINUX_COMPILE_IFELSE autoconf macro was modified to
support a sixth paramter. This parameter is used to populate the
contents of the new conftest.h file. If no sixth parameter is
provided, conftest.h will be empty.
* The ZFS_LINUX_TRY_COMPILE_HEADER autoconf macro was introduced.
This macro is nearly identical to the ZFS_LINUX_TRY_COMPILE macro,
except it has support for a fifth option that is then passed as
the sixth parameter to ZFS_LINUX_COMPILE_IFELSE.
These autoconf changes were needed to test the availability of the Linux
tracepoint macros. Due to the odd nature of the Linux tracepoint macro
API, a separate ".h" must be created (the path and filename is used
internally by the kernel's define_trace.h file).
* The HAVE_DECLARE_EVENT_CLASS autoconf macro was introduced. This
is to determine if we can safely enable the Linux tracepoint
functionality. We need to selectively disable the tracepoint code
due to the kernel exporting certain functions as GPL only. Without
this check, the build process will fail at link time.
In addition, the SET_ERROR macro was modified into a tracepoint as well.
To do this, the 'sdt.h' file was moved into the 'include/sys' directory
and now contains a userspace portion and a kernel space portion. The
dprintf and zfs_dbgmsg* interfaces are now implemented as tracepoint as
well.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-06-13 17:54:48 +00:00
|
|
|
MODULE_PARM_DESC(zfs_flags, "Set additional debugging flags");
|
|
|
|
|
|
|
|
module_param(zfs_recover, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_recover, "Set to attempt to recover from fatal errors");
|
|
|
|
|
|
|
|
module_param(zfs_free_leak_on_eio, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_free_leak_on_eio,
|
|
|
|
"Set to ignore IO errors during free and permanently leak the space");
|
|
|
|
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 03:01:20 +00:00
|
|
|
module_param(zfs_deadman_synctime_ms, ulong, 0644);
|
2013-11-01 19:26:11 +00:00
|
|
|
MODULE_PARM_DESC(zfs_deadman_synctime_ms, "Expiration time in milliseconds");
|
2013-04-29 22:49:23 +00:00
|
|
|
|
|
|
|
module_param(zfs_deadman_enabled, int, 0644);
|
|
|
|
MODULE_PARM_DESC(zfs_deadman_enabled, "Enable deadman timer");
|
Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work
1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver. The scheduler
issues a number of concurrent i/os from each class to the device. Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes). The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is. See the block comment in vdev_queue.c (reproduced
below) for more details.
2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load. The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system. When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount. This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens. One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync(). Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes. See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.
This diff has several other effects, including:
* the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.
* the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently. There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.
* zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc. This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).
--matt
APPENDIX: problems with the current i/o scheduler
The current ZFS i/o scheduler (vdev_queue.c) is deadline based. The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.
For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due". One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).
If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os. This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future. If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due. Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).
Notes on porting to ZFS on Linux:
- zio_t gained new members io_physdone and io_phys_children. Because
object caches in the Linux port call the constructor only once at
allocation time, objects may contain residual data when retrieved
from the cache. Therefore zio_create() was updated to zero out the two
new fields.
- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
(vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
This tree has been replaced by vq->vq_active_tree which is now used
for the same purpose.
- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
the number of vdev I/O buffers to pre-allocate. That global no longer
exists, so we instead use the sum of the *_max_active values for each of
the five I/O classes described above.
- The Illumos implementation of dmu_tx_delay() delays a transaction by
sleeping in condition variable embedded in the thread
(curthread->t_delay_cv). We do not have an equivalent CV to use in
Linux, so this change replaced the delay logic with a wrapper called
zfs_sleep_until(). This wrapper could be adopted upstream and in other
downstream ports to abstract away operating system-specific delay logic.
- These tunables are added as module parameters, and descriptions added
to the zfs-module-parameters.5 man page.
spa_asize_inflation
zfs_deadman_synctime_ms
zfs_vdev_max_active
zfs_vdev_async_write_active_min_dirty_percent
zfs_vdev_async_write_active_max_dirty_percent
zfs_vdev_async_read_max_active
zfs_vdev_async_read_min_active
zfs_vdev_async_write_max_active
zfs_vdev_async_write_min_active
zfs_vdev_scrub_max_active
zfs_vdev_scrub_min_active
zfs_vdev_sync_read_max_active
zfs_vdev_sync_read_min_active
zfs_vdev_sync_write_max_active
zfs_vdev_sync_write_min_active
zfs_dirty_data_max_percent
zfs_delay_min_dirty_percent
zfs_dirty_data_max_max_percent
zfs_dirty_data_max
zfs_dirty_data_max_max
zfs_dirty_data_sync
zfs_delay_scale
The latter four have type unsigned long, whereas they are uint64_t in
Illumos. This accommodates Linux's module_param() supported types, but
means they may overflow on 32-bit architectures.
The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
likely to overflow on 32-bit systems, since they express physical RAM
sizes in bytes. In fact, Illumos initializes zfs_dirty_data_max_max to
2^32 which does overflow. To resolve that, this port instead initializes
it in arc_init() to 25% of physical RAM, and adds the tunable
zfs_dirty_data_max_max_percent to override that percentage. While this
solution doesn't completely avoid the overflow issue, it should be a
reasonable default for most systems, and the minority of affected
systems can work around the issue by overriding the defaults.
- Fixed reversed logic in comment above zfs_delay_scale declaration.
- Clarified comments in vdev_queue.c regarding when per-queue minimums take
effect.
- Replaced dmu_tx_write_limit in the dmu_tx kstat file
with dmu_tx_dirty_delay and dmu_tx_dirty_over_max. The first counts
how many times a transaction has been delayed because the pool dirty
data has exceeded zfs_delay_min_dirty_percent. The latter counts how
many times the pool dirty data has exceeded zfs_dirty_data_max (which
we expect to never happen).
- The original patch would have regressed the bug fixed in
zfsonlinux/zfs@c418410, which prevented users from setting the
zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
A similar fix is added to vdev_queue_aggregate().
- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
heap instead of the stack. In Linux we can't afford such large
structures on the stack.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>
References:
http://www.illumos.org/issues/4045
illumos/illumos-gate@69962b5647e4a8b9b14998733b765925381b727e
Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-08-29 03:01:20 +00:00
|
|
|
|
|
|
|
module_param(spa_asize_inflation, int, 0644);
|
|
|
|
MODULE_PARM_DESC(spa_asize_inflation,
|
2013-11-01 19:26:11 +00:00
|
|
|
"SPA size estimate multiplication factor");
|
2015-09-01 16:45:10 +00:00
|
|
|
|
|
|
|
module_param(spa_slop_shift, int, 0644);
|
|
|
|
MODULE_PARM_DESC(spa_slop_shift, "Reserved free space in pool");
|
2010-08-26 18:49:16 +00:00
|
|
|
#endif
|