illumos/illumos-gate@0c94e1af670c94e1af67https://www.illumos.org/issues/7256
error = dmu_sync(zio, lr->lr_common.lrc_txg,
zfs_get_done, zgd);
ASSERT(error || lr->lr_length <= zp->z_blksz);
It's possible, although extremely rare, that the zfs_get_done() callback is
executed before dmu_sync() returns.
In that case the znode's range lock is dropped and the znode is unreferenced.
Thus, the assertion can access some invalid or wrong data via the zp pointer.
size variable caches the correct value of z_blksz and can be safely used here.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
MFC after: 1 week
illumos/illumos-gate@7f0bdb42577f0bdb4257https://www.illumos.org/issues/8061
sa_find_idx_tab() is declared as taking and returning "void *" parameters.
These can be declared to be the specific types.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 1 week
illumos/illumos-gate@b127fe3c05b127fe3c05https://www.illumos.org/issues/6101
lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to
create a filesystem as a child of a volume, instead it proceeds to a crash.
A crash stack obtained on FreeBSD:
page fault while in kernel mode
zap_leaf_lookup()
fzap_lookup()
zap_lookup_norm()
zap_lookup()
zfs_get_zplprop()
zfs_fill_zplprops_impl()
zfs_ioc_create()
zfsdev_ioctl()
devfs_ioctl_f()
kern_ioctl()
sys_ioctl()
This crash happened with a kernel without debugging assertions.
The immediate cause of crash appears to an attempt to interpret a zvol object
as a zap object.
For filesystems:
#define MASTER_NODE_OBJ 1
For zvols:
#define ZVOL_OBJ 1ULL
#define ZVOL_ZAP_OBJ 2ULL
So, I see two problems here:
1. an attempt to create a filesystem under a zvol should be rejected as
early as possible, maybe in zfs_fill_zplprops()
2. maybe zap_lookup / zap_lockdir should reject objects that are not of one
of the zap object types
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@6b036259816b03625981https://www.illumos.org/issues/8026
zfs_throttle_delay and zfs_throttle_resolution became disused since the new
write throttling mechanism was introduced.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 1 week
illumos/illumos-gate@313ae1e182313ae1e182https://www.illumos.org/issues/8027
dsl_pool_dirty_delta() should not wake up waiters when dp->dp_dirty_total ==
zfs_dirty_data_max, because they wait for dp_dirty_total to fall strictly below
the threshold.
It's probably very rare for that condition to occur, but it's better to have
more accurate code.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 1 week
illumos/illumos-gate@94c2d0eb2294c2d0eb22https://www.illumos.org/issues/7968
spa_sync() iterates over all the dirty dnodes and processes each of them by
calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
or removed a lot of files), the single thread of spa_sync() calling
dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied
concurrently in open context (e.g. due to concurrent file creation), the
os_lock will experience lock contention via dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
use separate threads to process each of the sublists in the multilist.
On the concurrent file creation microbenchmark, the performance improvement
from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
reward, once the other bottlenecks are addressed, fixing this bug will provide
a medium-large performance gain and require a medium amount of effort to
implement.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@10fbdecb0510fbdecb05https://www.illumos.org/issues/7970
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@b0c42cd470b0c42cd470https://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: 0eef1bde31
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>
MFC after: 24 days
illumos/illumos-gate@a3905a4592a3905a4592https://www.illumos.org/issues/7869
The issue fixed by this patch is a race condition in the deadlist code.
A thread executing an administrative command that uses
`dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to
protect the access of all its entries that the deadlist contains in an
avl tree.
Sync threads trying to insert a new entry in the deadlist
(through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the
deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our
sentinel value), we close and reopen it. Between these two operations,
it is possible for the `dsl_deadlist_space_range()` thread to dereference
that bpobj which is `NULL` during that window.
Threads should hold the a deadlist's `dl_lock` when they manipulate its
internal data so scenarios like the one above are avoided. In addition,
threads should also hold the bpobj lock whenever they are allocating the
subobj list of a bpobj, and not just when they actually insert the subobj
to the list. This way we can avoid potential memory leaks.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
MFC after: 2 weeks
illumos/illumos-gate@61e255ce7261e255ce72https://www.illumos.org/issues/7793
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@1c17160ac51c17160ac5https://www.illumos.org/issues/1300
FreeBSD note: recent FreeBSD was not affected by the issue fixed as the
name cache is completely bypassed when normalization is enabled.
The change is imported for the sake of ZAP infrastructure modifications.
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>
MFC after: 3 weeks
Extend the ino_t, dev_t, nlink_t types to 64-bit ints. Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment. Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.
ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks. Unfortunately, not everything can be
fixed, especially outside the base system. For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.
Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.
Struct xvnode changed layout, no compat shims are provided.
For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.
Update note: strictly follow the instructions in UPDATING. Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.
Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb). Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver. Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem). Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).
Sponsored by: The FreeBSD Foundation (emaste, kib)
Differential revision: https://reviews.freebsd.org/D10439
The idle thread may process callouts while reloading the timer in
cpu_activeclock(). In this case, provide a representative value, &cpu_idle,
instead of 0 for args[0] so that the active thread can be more easily
identified from the probe.
This addresses intermittent failures of the profile-n/tst.argtest.d test.
MFC after: 2 weeks
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D10651
vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.
Fix this by storing a linked list of vdevs in g_consumer's private field.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
* g_consumer.private now stores a linked list of vdev pointers associated
with the consumer instead of just a single vdev pointer.
* Change vdev_geom_set_physpath's signature to more closely match
vdev_geom_set_rotation_rate
* Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed
that we've already accessed the consumer by the time we get here.
* Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it
in vdev_geom_open, after we know that the open has succeeded.
PR: 218634
Reviewed by: gibbs
MFC after: 1 week
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10391
The current method only sort of works, and usually doesn't work reliably.
Also, on Book-E the return address from DEBUG exceptions is not the sentinel
addresses, so it won't exit the loop correctly.
Fix this by better handling trap frames during unwinding, and using the
common trap handler for debug traps, as the code in that segment is
identical between the two.
MFC after: 1 week
r314370 changed EXC_DTRACE to a different instruction, but neglected to
make the same change to fbt, so dtrace didn't actually pick it up,
resulting in entering KDB instead of trapping for dtrace.
MFC after: 1 week
7740 fix for 6513 only works in hole punching case, not truncation
illumos/illumos-gate@7de35a3ed07de35a3ed0https://www.illumos.org/issues/7740
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
To verify, create a large file, truncate it to a short length, and then
write beyond the end. Check with zdb to make sure that there are no
holes with birth time zero (will appear as gaps).
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>
7743 per-vdev-zaps have no initialize path on upgrade
illumos/illumos-gate@555da5111b555da5111bhttps://www.illumos.org/issues/7743
When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.
This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.
The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
7613 ms_freetree[4] is only used in syncing context
illumos/illumos-gate@5f145778015f14577801https://www.illumos.org/issues/7613
metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should
replace it with two trees: the freeing tree (ranges that we are freeing this
syncing txg) and the freed tree (ranges which have been freed this txg).
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
7586 remove #ifdef __lint hack from dmu.h
illumos/illumos-gate@4ba5b961634ba5b96163https://www.illumos.org/issues/7586
The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if
we add other inline functions into header files in ZFS, especially since it is
difficult to make any other solution work across all compilation targets. We
should switch to disabling the lint flags that are failing instead.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>
7580 ztest failure in dbuf_read_impl
illumos/illumos-gate@1a01181fdc1a01181fdchttps://www.illumos.org/issues/7580
We need to prevent any reader whenever we're about the zero out all the
blkptrs. To do this we need to grab the dn_struct_rwlock as writer in
dbuf_write_children_ready and free_children just prior to calling bzero.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
7606 dmu_objset_find_dp() takes a long time while importing pool
illumos/illumos-gate@7588687e6b7588687e6bhttps://www.illumos.org/issues/7606
When importing a pool with a large number of filesystems within the same
parent filesystem, we see that dmu_objset_find_dp() takes a long time.
It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
and spa_load_verify().
There are several ways to improve performance here:
1. We don't really need to do spa_check_logs() or
spa_ld_claim_log_blocks() if the pool was closed cleanly.
2. spa_load_verify() uses dmu_objset_find_dp() to check that no
datasets have too long of names.
3. dmu_objset_find_dp() is slow because it's doing
zap_value_search() (which is O(N sibling datasets)) to determine
the name of each dsl_dir when it's opened. In this case we
actually know the name when we are opening it, so we can provide
it and avoid the lookup.
This change implements fix#3 from the above list; i.e. make
dmu_objset_find_dp() provide the name of the dataset so that we don't
have to search for it.
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
7252 7628 compressed zfs send / receive
illumos/illumos-gate@5602294fda5602294fdahttps://www.illumos.org/issues/7252
This feature includes code to allow a system with compressed ARC enabled to
send data in its compressed form straight out of the ARC, and receive data in
its compressed form directly into the ARC.
https://www.illumos.org/issues/7628
We should have longer, more readable versions of the ZFS send / recv options.
7628 create long versions of ZFS send / receive options
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>
7386 zfs get does not work properly with bookmarks
illumos/illumos-gate@edb901aab9edb901aab9https://www.illumos.org/issues/7386
The zfs get command does not work with the bookmark parameter while it works
properly with both filesystem and snapshot:
# zfs get -t all -r creation rpool/test
NAME PROPERTY VALUE SOURCE
rpool/test creation Fri Sep 16 15:00 2016 -
rpool/test@snap creation Fri Sep 16 15:00 2016 -
rpool/test#bkmark creation Fri Sep 16 15:00 2016 -
# zfs get -t all -r creation rpool/test@snap
NAME PROPERTY VALUE SOURCE
rpool/test@snap creation Fri Sep 16 15:00 2016 -
# zfs get -t all -r creation rpool/test#bkmark
cannot open 'rpool/test#bkmark': invalid dataset name
#
The zfs get command should be modified to work properly with bookmarks too.
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Marcel Telka <marcel@telka.sk>
7490 real checksum errors are silenced when zinject is on
illumos/illumos-gate@6cedfc397d6cedfc397dhttps://www.illumos.org/issues/7490
When zinject is on, error codes from zfs_checksum_error() can be overwritten
due to an incorrect and overly-complex if condition.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
7448 ZFS doesn't notice when disk vdevs have no write cache
illumos/illumos-gate@295438ba32295438ba32https://www.illumos.org/issues/7448
I built a SmartOS image with all the NVMe commits including 7372
(support NVMe volatile write cache) and repeated my dd testing:
> #!/bin/bash
> for i in `seq 1 1000`; do
> dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
> dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
> wait
> rm file00 file01
> done
>
Previously each dd command took ~145 seconds to finish, now it takes
~400 seconds.
Eventually I figured out it is 7372 that causes unnecessary
nvme_bd_sync() executions which wasted CPU cycles.
If a NVMe device doesn't support a write cache, the nvme_bd_sync function will
return ENOTSUP to indicate this to upper layers.
It seems this returned value is ignored by ZFS, and as such this bug is not
really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the
disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled
in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the
nvme driver in general does support cache flush. Instead it will issue an
asynchronous flush to nvme and immediately return 0, and hence ZFS will not set
vdev_nowritecache here. The nvme driver will at some point process the cache
flush command, and if there is no write cache on the device it will return
ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This
function will not check the error code and not set nowritecache.
The right place to check the error code from the cache flush is in
zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous
cache flushes. This would also be independent of the implementation detail that
some drivers can return ENOTSUP immediately.
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Obtained from: Illumos
7430 Backfill metadnode more intelligently
illumos/illumos-gate@af346df588af346df588https://www.illumos.org/issues/7430
Description and patch from brought over from the following ZoL commit: https://
github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates. As summarized by
@mahrens in #4636:
"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects). When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don?t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can?t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.
The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."
Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Ned Bass <bass6@llnl.gov>
Obtained from: Illumos
in place. To do per-cpu stats, convert all fields that previously were
maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
before we have set up UMA and we can do counter_u64_alloc(), provide an
early counter mechanism:
o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
so that at early stages of boot, before counters are allocated we already
point to a counter that can be safely written to.
o For sparc64 that required a whole dummy pcpu[MAXCPU] array.
Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.
This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html
Reviewed by: kib, gallatin, marius, lidl
Differential Revision: https://reviews.freebsd.org/D10156
While the former name is easier to read, the "_flags" suffix has a special
meaning for loader(8) and, thus, it was impossible to set the knob via
loader.conf(5). The loader interpreted the setting as flags that should
be passed to a kernel module named "vfs.zfs.debug".
Discussed with: smh
MFC after: 2 weeks
When opening a vdev whose path is unknown, vdev_geom must find a geom
provider with a label whose guids match the desired vdev. However, due to
partitioning, it is possible that two non-synonomous providers will share
some labels. For example, if the first partition starts at the beginning of
the drive, then ada0 and ada0p1 will share the first label. More troubling,
if the last partition runs to the end of the drive, then ada0p3 and ada0
will share the last label. If vdev_geom opens ada0 when it should've opened
ada0p3, then the pool won't be readable. If it opens ada0 when it should've
opened ada0p1, then it will corrupt some other partition when it writes the
3rd and 4th labels.
The easiest way to reproduce this problem is to install a mirrored root pool
with the default partition layout, then swap the positions of the two boot
drives and reboot. Whether the bug manifests depends on the order in which
geom lists its providers, which is arbitrary.
Fix this situation by modifying the search algorithm to prefer geom
providers that have all four labels intact. If no such provider exists, then
open whichever provider has the most.
Reviewed by: mav
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D10365
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().
Reviewed by: gnn, kib
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D10313
When a member of a RAIDZ has been replaced with a device smaller than the
original, then the top level vdev can report its expand size as 16.0E.
The reduced child asize causes the RAIDZ to have a vdev_asize lower than its
vdev_max_asize which then results in an underflow during the calculation of
the parents expand size.
Fix this by updating the vdev_asize if it shrinks, which is already
protected by a check against vdev_min_asize so should always be safe.
Also for RAIDZ vdevs, ensure that the sum of their child vdev_min_asize is
always greater than the parents vdev_min_size.
Fixes: https://www.illumos.org/issues/7885
MFC after: 2 weeks
Sponsored by: Multiplay
7603 xuio_stat_wbuf_* should be declared (void)
illumos/illumos-gate@99aa8b550599aa8b5505https://www.illumos.org/issues/7603
The funcs are declared k&r style, where the args are not specified:
void xuio_stat_wbuf_copied();
They should be declared to take no arguments:
void xuio_stat_wbuf_copied(void);
Need to change both .c and .h.
Author: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
illumos/illumos-gate@8363e80ae7https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18https://www.illumos.org/issues/7303
This change introduces a new weighting algorithm to improve metaslab selection.
The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result,
the metaslab weight now encodes the type of weighting algorithm used
(size-based vs segment-based).
This also introduce a new allocation tracing facility and two new dcmds to help
debug allocation problems. Each zio now contains a zio_alloc_list_t structure
that is populated as the zio goes through the allocations stage. Here's an
example of how to use the tracing facility:
> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 ENOSPC ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 ENOSPC ztest.0a
1 0 400 1 x 8M 17b1a00 ztest.0a
> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 200 0 NOT_ALLOCATABLE mirror-2
- 0 200 0 NOT_ALLOCATABLE mirror-0
1 0 200 1 x 4M 112ae00 mirror-1
- 1 200 0 NOT_ALLOCATABLE mirror-2
- 1 200 0 NOT_ALLOCATABLE mirror-0
1 1 200 1 x 4M 112b000 mirror-1
- 2 200 0 NOT_ALLOCATABLE mirror-2
If the metaslab is using segment-based weighting then the WEIGHT column will
display the number of segments available in the bucket where the allocation
attempt was made.
Author: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
This way we can avoid blocking the whole queue in the low memory
situations. It's better to sacrifice some I/O performance by not doing
the aggregation than to add an indefinite wait for more memory.
Reviewed by: smh
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D9999
As ZFS can request up to SPA_MAXBLOCKSIZE memory block e.g. during zfs recv,
update the threshold at which we start agressive reclamation to use
SPA_MAXBLOCKSIZE (16M) instead of the lower zfs_max_recordsize which
defaults to 1M.
PR: 194513
Reviewed by: avg, mav
MFC after: 1 month
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D10012
illumos/illumos-gate@6de76ce2a96de76ce2a9https://www.illumos.org/issues/7867
It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we
would fail to update the ARC space statistics.
In the normal case those statistics are updated in arc_free_data_buf(). But in
the arc_hdr_free_on_write() path we don't do that.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 10 days
illumos/illumos-gate@c5bde7273ec5bde7273ehttps://www.illumos.org/issues/7843
get_clones_stat() could be very slow if a snapshot has many (thousands) clones.
Clone names are added to an nvlist that's created with NV_UNIQUE_NAME.
So, each time a new name is appended to the list, the whole list is searched
linearly to see if that name is not already in the list. That results in the
quadratic complexity.
That should be easy to fix as we know in advance that we should not get any
duplicate names, so we can drop NV_UNIQUE_NAME when creating the list.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 1 week
Sponsored by: ClusterHQ
For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction. For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.