illumos/illumos-gate@6cedfc397d6cedfc397dhttps://www.illumos.org/issues/7490
When zinject is on, error codes from zfs_checksum_error() can be overwritten
due to an incorrect and overly-complex if condition.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
illumos/illumos-gate@295438ba32295438ba32https://www.illumos.org/issues/7448
I built a SmartOS image with all the NVMe commits including 7372
(support NVMe volatile write cache) and repeated my dd testing:
> #!/bin/bash
> for i in `seq 1 1000`; do
> dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
> dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
> wait
> rm file00 file01
> done
>
Previously each dd command took ~145 seconds to finish, now it takes
~400 seconds.
Eventually I figured out it is 7372 that causes unnecessary
nvme_bd_sync() executions which wasted CPU cycles.
If a NVMe device doesn't support a write cache, the nvme_bd_sync function will
return ENOTSUP to indicate this to upper layers.
It seems this returned value is ignored by ZFS, and as such this bug is not
really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the
disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled
in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the
nvme driver in general does support cache flush. Instead it will issue an
asynchronous flush to nvme and immediately return 0, and hence ZFS will not set
vdev_nowritecache here. The nvme driver will at some point process the cache
flush command, and if there is no write cache on the device it will return
ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This
function will not check the error code and not set nowritecache.
The right place to check the error code from the cache flush is in
zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous
cache flushes. This would also be independent of the implementation detail that
some drivers can return ENOTSUP immediately.
Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
illumos/illumos-gate@af346df588af346df588https://www.illumos.org/issues/7430
Description and patch from brought over from the following ZoL commit: https://
github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6
Only attempt to backfill lower metadnode object numbers if at least
4096 objects have been freed since the last rescan, and at most once
per transaction group. This avoids a pathology in dmu_object_alloc()
that caused O(N^2) behavior for create-heavy workloads and
substantially improves object creation rates. As summarized by
@mahrens in #4636:
"Normally, the object allocator simply checks to see if the next
object is available. The slow calls happened when dmu_object_alloc()
checks to see if it can backfill lower object numbers. This happens
every time we move on to a new L1 indirect block (i.e. every 32 *
128 = 4096 objects). When re-checking lower object numbers, we use
the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
indirect blocks that don’t have enough free dnodes (defined as an L2
with at least 393,216 of 524,288 dnodes free). Therefore, we may
find that a block of dnodes has a low (or zero) fill count, and yet
we can’t allocate any of its dnodes, because they've been allocated
in memory but not yet written to disk. In this case we have to hold
each of the dnodes and then notice that it has been allocated in
memory.
The end result is that allocating N objects in the same TXG can
require CPU usage proportional to N^2."
Add a tunable dmu_rescan_dnode_threshold to define the number of
objects that must be freed before a rescan is performed. Don't bother
to export this as a module option because testing doesn't show a
compelling reason to change it. The vast majority of the performance
gain comes from limit the rescan to at most once per TXG.
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Ned Bass <bass6@llnl.gov>
illumos/illumos-gate@99aa8b550599aa8b5505https://www.illumos.org/issues/7603
The funcs are declared k&r style, where the args are not specified:
void xuio_stat_wbuf_copied();
They should be declared to take no arguments:
void xuio_stat_wbuf_copied(void);
Need to change both .c and .h.
Author: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
illumos/illumos-gate@8363e80ae7https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18aa53f0https://www.illumos.org/issues/7303
This change introduces a new weighting algorithm to improve metaslab selection.
The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result,
the metaslab weight now encodes the type of weighting algorithm used
(size-based vs segment-based).
This also introduce a new allocation tracing facility and two new dcmds to help
debug allocation problems. Each zio now contains a zio_alloc_list_t structure
that is populated as the zio goes through the allocations stage. Here's an
example of how to use the tracing facility:
> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 NOT_ALLOCATABLE ztest.0a
- 0 400 0 ENOSPC ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 NOT_ALLOCATABLE ztest.0a
- 0 200 0 ENOSPC ztest.0a
1 0 400 1 x 8M 17b1a00 ztest.0a
> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
MSID DVA ASIZE WEIGHT RESULT VDEV
- 0 200 0 NOT_ALLOCATABLE mirror-2
- 0 200 0 NOT_ALLOCATABLE mirror-0
1 0 200 1 x 4M 112ae00 mirror-1
- 1 200 0 NOT_ALLOCATABLE mirror-2
- 1 200 0 NOT_ALLOCATABLE mirror-0
1 1 200 1 x 4M 112b000 mirror-1
- 2 200 0 NOT_ALLOCATABLE mirror-2
If the metaslab is using segment-based weighting then the WEIGHT column will
display the number of segments available in the bucket where the allocation
attempt was made.
Author: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
illumos/illumos-gate@6de76ce2a96de76ce2a9https://www.illumos.org/issues/7867
It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we
would fail to update the ARC space statistics.
In the normal case those statistics are updated in arc_free_data_buf(). But in
the arc_hdr_free_on_write() path we don't do that.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@c5bde7273ec5bde7273ehttps://www.illumos.org/issues/7843
get_clones_stat() could be very slow if a snapshot has many (thousands) clones.
Clone names are added to an nvlist that's created with NV_UNIQUE_NAME.
So, each time a new name is appended to the list, the whole list is searched
linearly to see if that name is not already in the list. That results in the
quadratic complexity.
That should be easy to fix as we know in advance that we should not get any
duplicate names, so we can drop NV_UNIQUE_NAME when creating the list.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@1c9272b8611c9272b861https://www.illumos.org/issues/7570
Based on the discovery that every unmap waits for the commit of the txn to the ZIL,
introducing a very high latency to unmap commands, this behavior was made into a
tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is
that by default SCSI unmap commands will result in space being freed within the zvol
(today they are ignored and returned with good status). However, unlike the code
today, instead of 18+ms per unmap, they take about 30us.
With the testing done on NTFS against a Win2k12 target, the new behavior should work
seamlessly. Files on the zvol that have already been set with the zfree application
will continue to write 0's when deleted, and any new files created since zvol
creation will send unmap commands when deleted. This behavior exists today, but with
this change the unmap commands will be processed and result in reclaim of space.
Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
illumos/illumos-gate@40510e8eba40510e8ebahttps://www.illumos.org/issues/6676
The fsid of zfs filesystems might change after reboot or remount. The problem seems to
be caused by a race between unique_insert() and unique_remove(). The unique_remove()
is called from dsl_dataset_evict() which is now an asynchronous thread. In a case the
dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end
up with changed fsid on zfs mount.
This problem is very likely caused by #5056.
Steps to Reproduce
Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore
machines it is not so easy to reproduce.
# uname -a
SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris
# zfs create rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0x54d7028a, 0x70311508 ]
}
# zfs umount rpool/TEST
# zfs mount rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ]
}
#
Impact
The persistent fsid (filesystem id) is essential for proper NFS functionality.
If the fsid of a filesystem changes on remount (or after reboot) the NFS
clients might not be able to automatically recover from such event and the
manual remount of the NFS filesystems on every NFS client might be needed.
Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
illumos/illumos-gate@405a5a0f5chttps://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80https://www.illumos.org/issues/7504
We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some
other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(),
which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding
dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv.
Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds.
Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
illumos/illumos-gate@653af1b809653af1b809https://www.illumos.org/issues/7500
With the integration of:
commit 0f6d88aded0d165f5954688a9b13bac76c38da84
Author: Alex Reece <alex@delphix.com>
Date: Sat Jul 26 13:40:04 2014 -0800
4873 zvol unmap calls can take a very long time for larger datasets
the dnode's dn_bufs field was changed from a list to a tree. As a result,
the dn_unlisted_l0_blkid field is no longer necessary.
Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
illumos/illumos-gate@ff5177ee8bff5177ee8bhttps://www.illumos.org/issues/6569
The core issue I've found is that there is no throttle for how many
deletes get assigned to one TXG. As a results when deleting large files
we end up filling consecutive TXGs with deletes/frees, then write
throttling other (more important) ops.
There is an easy test case for this problem. Try deleting several
large files (at least 1/2 TB) while you do write ops on the same
pool. What we've seen is performance of these write ops (let's
call it sideload I/O) would drop to zero.
More specifically the problem is that dmu_free_long_range_impl()
can/will fill up all of the dirty data in the pool "instantly",
before many of the sideload ops can get in. So sideload
performance will be impacted until all the files are freed.
The solution we have tested at Nexenta (with positive results)
creates a relatively simple throttle for how many "free" ops we let
into one TXG.
However this solution exposes other problems that should also be
addressed. If we are to slow down freeing of data that means one
has to wait even longer (assuming vnode ref count of 1) to get shell
back after an rm or for NFS thread to finish the free-ing op.
To avoid this the proposed solution is to call zfs_inactive() async
for "large" files. Async freeing then begs for the reclaimed space
to be accounted for in the zpool's "freeing" prop.
The other issue with having a longer delete is the inability to
export/unmount for a longer period of time. The proposed solution
is to interrupt freeing of blocks when a fs is unmounted.
Author: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
illumos/illumos-gate@43297f973a43297f973ahttps://www.illumos.org/issues/3821
We recently had nodes with some of the latest zfs bits panic on us in a
rollback-heavy environment. The following is from my preliminary analysis:
Let's look at where we died:
> $C
ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1)
ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1)
ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1)
ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1)
ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680)
ffffff01ea6b9c30 thread_start+8()
If we dig in we can find that this dataset corresponds to a zone:
> ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname
zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233-
9e378e89877b" ]
Okay so we have a null taskq pointer. That only happens during the calls to
zil_open and zil_close. If we poke around we can see that we're actually in
midst of a rollback:
> ::pgrep zfs | ::printf "0x%x %s\\n" proc_t . p_user.u_psargs
0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d-
4b8ab06213c2@marlin_init
0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233-
9e378e89877b@marlin_init
0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc-
4ac4ba8fe9f8@marlin_init
0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e-
10cd45ff8e9f@marlin_init
0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c-
6d1b2751e6f5@marlin_init
0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24-
2ed45cd0da70@marlin_init
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@90f2c094b390f2c094b3https://www.illumos.org/issues/7181
zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
before the setup is actually completed,
thus an under-constructed zfsvfs becomes visible.
Additionally, there is nothing to serialize the two call paths. As a result two
threads can step on each other's toes.
assertion failed: zilog->zl_clean_taskq == NULL, file:
../../common/fs/zfs/zil.c, line: 1772
> $c
vpanic()
0xfffffffffbdf6928()
zil_open+0x45(ffffff1bbc5dd000, fffffffff7993880)
zfsvfs_setup+0x84(ffffffb378d77000, 0)
zfs_resume_fs+0x132(ffffffb378d77000, ffffffb37ddcf000)
zfs_ioc_rollback+0x96(ffffffb37ddcf000, ffffff01dcdc4cd0, ffffff01aa091000)
zfsdev_ioctl+0x215(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
ffffff0004b59e58)
cdev_ioctl+0x39(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
ffffff0004b59e58)
spec_ioctl+0x60(ffffff0197737700, 5a19, 80465f8, 100003,
ffffff01ab318368, ffffff0004b59e58)
fop_ioctl+0x55(ffffff0197737700, 5a19, 80465f8, 100003,
ffffff01ab318368, ffffff0004b59e58)
ioctl+0x9b(7, 5a19, 80465f8)
sys_syscall32+0x1f7()
> ffffff1bbc5dd000::print objset_t os_zil
os_zil = 0xffffff1c053cf7c0
> 0xffffff1c053cf7c0::print zilog_t zl_clean_taskq
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
7199 dsl_dataset_rollback_sync may try to free already free blocks
7200 no blocks must be born in a txg after a snaphot is created
illumos/illumos-gate@bfaed0b91ebfaed0b91ehttps://www.illumos.org/issues/7199
dsl_dataset_rollback_sync may try to free already freed blocks when it calls
dsl_destroy_head_sync_impl to destroy a temporary clone.
That happens if a snapshot to which we are rolling back and from which the
clone is created has some ZIL records.
https://www.illumos.org/issues/7200
No new blocks must be born in a dataset in the same TXG after a snapshot of the
dataset is taken.
Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
and as such they would be presumed to belong o the snapshot while in fact they
do not.
All the datasets must be clean before sync tasks are run, so the described
scenario may happen only if one of the sync tasks dirties the dataset and
another sync task takes its snapshot.
Then, there will be another sync pass because of the dirty data and the new
blocks will be born in the same TXG when the data is written out.
It seems that almost all of the existing sync tasks modify only MOS and do not
dirty any objsets.
The only exception that I've been able to identify so far is the rollback which
can modify an objset when it zeroes out the objset's ZIL.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
illumos/illumos-gate@690041b9ca690041b9cahttps://www.illumos.org/issues/7180
If a filesystem is not unmounted while the rename is being performed, then, for
example, a concurrect zfs rollback may call zfs_suspend_fs followed by
zfs_resume_fs on the same filesystem.
The latter takes the filesystem's name as an argument. If the filesystem name
changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
call in zfs_resume_fs would fail resulting in a kernel panic.
So far I have been able to reproduce this problem on FreeBSD where zfs rename
has -u option that skips the unmounting before doing the renaming.
But I think that in theory the same problem can occur on illumos as well,
because the unmounting is done in userland before invoking the rename ioctl and
there could be a race with, e.g., zfs mount.
panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
== 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
zfs/zfs_vfsops.c, line: 2210
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
vpanic() at vpanic+0x182/frame 0xfffffe004df30790
panic() at panic+0x43/frame 0xfffffe004df307f0
assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
illumos/illumos-gate@260af64db7260af64db7https://www.illumos.org/issues/3746
From the original change log:
It was possible for a reference to be added even with the lock held, and
for references added just after a lock release to be lost.
This bug was also independently found and reported in wesunsolve.net
issues 6985013 6995524.
In zrl_add(), always use an atomic operation to update the refcount.
The mutex in the ZRL only guarantees that wakeups occur for waiters on the
lock. It offers no protection against concurrent updates of the refcount.
The only refcount transition that is safe to perform without an atomic
operation is from ZRL_LOCKED back to 0, since this can only be performed
by the thread which has the ZRL locked.
Authored by: Will Andrews <will@freebsd.org>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Youzhong Yang <yyang@mathworks.com>
Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().
The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.
We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".
This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.
Closes#107
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@5fc46359c5
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>
openzfs/openzfs@c8811bd3e2
Until we can resolve the numerous hole_birth bugs that have cropped up
recently, and come up with a way going forwards to protect users from
corruption, we should disable the hole_birth feature. Using a tunable
allows those who are confident that their data is correct to continue to
take advantage of the feature.
Closes#188
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.
Closes#180
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch:
commit ca0cc3918a1789fa839194af2a9245f801a06b1a
Author: Matthew Ahrens <mahrens@delphix.com>
Date: Fri Jul 24 09:53:55 2015 -0700
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
This patch simply removes this macro from dsl_dataset.h.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
When changing zfs_arc_max (e.g. as zdb does), it may be set to less
than the default arc_c_min. arc_c_min should decrease to not be more than
arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@608764bead
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).
dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:
dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()
This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.
Sponsored by: Intel Corp.
Closes#109
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@d3e523d489
This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.
Closes#159
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>
openzfs/openzfs@b09697c8c1
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.
Sponsored by: Intel Corp.
Closes#108
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@0780b3eab5
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
illumos/illumos-gate@b72b6bb10ab72b6bb10ahttps://www.illumos.org/issues/7136
6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents
whenever an aux device gets removed from a pool. However, those sysevents will
be created without the vdev_guid and vdev_path fields. It would be better to
always populate those fields.
https://www.illumos.org/issues/7115
The addition of spa_event_notify in vdev removal code (see #6922) causes events
to be generated even if the spare failed to be removed with EBUSY.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@12b90ee2d312b90ee2d3https://www.illumos.org/issues/7230
A test failure occurred where a send stream had only a BEGIN record. This
should not be possible if the send returns without error. Prevented this from
happening in the future by adding an assertion to dmu_send_impl() to verify
that if the function returns 0 (success) both a BEGIN and END record are
present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN or
END records sent), having dump_record() set flags appropriately, adding VERIFY
statement to dmu_send_impl().
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>
illumos/illumos-gate@bd56f80007bd56f80007https://www.illumos.org/issues/7235
The function dsl_dataset_set_blkptr() is unused. We should remove it.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@0f7643c7370f7643c737https://www.illumos.org/issues/7090
When write I/Os are issued, they are issued in block order but the ZIO pipeline
will drive them asynchronously through the allocation stage which can result in
blocks being allocated out-of-order. It would be nice to preserve as much of
the logical order as possible.
In addition, the allocations are equally scattered across all top-level VDEVs
but not all top-level VDEVs are created equally. The pipeline should be able to
detect devices that are more capable of handling allocations and should
allocate more blocks to those devices. This allows for dynamic allocation
distribution when devices are imbalanced as fuller devices will tend to be
slower than empty devices.
The change includes a new pool-wide allocation queue which would throttle and
order allocations in the ZIO pipeline. The queue would be ordered by issued
time and offset and would provide an initial amount of allocation of work to
each top-level vdev. The allocation logic utilizes a reservation system to
reserve allocations that will be performed by the allocator. Once an allocation
is successfully completed it's scheduled on a given top-level vdev. Each top-
level vdev maintains a maximum number of allocations that it can handle
(mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
groups and round robin across all eligible metaslab groups to distribute the
work. As top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@926549256b926549256bhttps://www.illumos.org/issues/7086
In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the
db_blkptr, to prevent it from being changed by syncing context.
Otherwise we may see that ztest got a segfault from this stack:
libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20, 0)
libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0)
libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530)
libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1)
libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1)
libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780)
libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a)
ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0)
ztest_remove+0x9f(ea294588, ea293f50, 4, 3)
ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1)
ztest_dmu_object_alloc_free+0x71(ea294588, 13)
ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0)
ztest_execute+0x89(a, 807c720, 13, 0)
ztest_thread+0xea(13, 0, 0, 0)
libc.so.1`_thrp_setup+0x88(f0983240)
libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0)
Looking into it a bit, we see that this is an embedded blockpointer, so
BP_GET_NDVAS should have returned 0:
b32b240::blkptr
EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L
Instead, it looks like another thread is modifying this blockpointer:
b32b240::ugrep | ::whatis
f47a0e0c is in [ stack tid=0x19f ]
ebd6ec40 is in [ stack tid=0x226 ]
ea293bd0 is in [ stack tid=0x244 ]
ea293be4 is in [ stack tid=0x244 ]
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@c39a2aae1ec39a2aae1ehttps://www.illumos.org/issues/7072
upstream:
38733 zfs fails to expand if lun added when os is in shutdown state
DLPX-36910 spares and caches should not display expandable space
DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@4b5c8e93ca4b5c8e93cahttps://www.illumos.org/issues/7104
The current default indirect block size is 16KB. We can improve
performance by increasing it to 128KB. This is especially helpful for
any workload that needs to read most of the metadata, e.g.
scrub/resilver, file deletion, filesystem deletion, and zfs send.
We also need to fix a few space estimation errors to make the tests
pass.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@25f7d993ad25f7d993adhttps://www.illumos.org/issues/7071
upstream
DLPX-40482 lzc_snapshot does not fill in errlist on ENOENT
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@dcbf3bd6a1dcbf3bd6a1https://www.illumos.org/issues/6950
When reading compressed data from disk, the ARC should keep the compressed
block cached and only decompress it when consumers access the block. The
uncompressed data should be short-lived allowing the ARC to cache a much larger
amount of data. The DMU would also maintain a smaller cache of uncompressed
blocks to minimize the impact of decompressing frequently accessed blocks.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@10e67aa0db10e67aa0dbhttps://www.illumos.org/issues/7082
upstream
DLPX-40542 bptree_iterate() passes wrong args to zfs_dbgmsg()
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@9adfa60d489adfa60d48https://www.illumos.org/issues/6314
Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but
dsl_dataset_name copies the datasets' name PLUS the snapshot name to it,
resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@20a95fb2c420a95fb2c4https://www.illumos.org/issues/5768
zfsctl_snapshot_inactive() leaks a hold on the dvp (directory vnode) if v_count > 1.
reproduce by:
create a fs with 100 snapshots.
have a thread do:
while true; do ls -l /test/snaps/.zfs/snapshot >/dev/null; done
have another thread do:
while true; do zfs promote test/clone; zfs promote test/snaps; done
use dtrace to delay & observe:
dtrace -w -xd \\
-n 'vn_rele:entry/args0 == (void*)0xffffff01dd42ce80ULL/{[stack()]=count();
chill(100000);}' \\
-n 'zfsctl_snapshot_inactive:entry{ if (args[0]->v_count > 1) trace(args[0]-
>v_count); self->vp=args[0];}' \\
-n 'gfs_vop_inactive:entry/callers["zfsctl_snapshot_inactive"]/{self->good=1;
[stack()]=count()}' \\
-n 'zfsctl_snapshot_inactive:return{if (self->good) self->good=0; else printf
("bad return");}' \\
-n 'gfs_dir_lookup:return/callers["zfsctl_snapshot_inactive"] && self->vp-
>v_count > 1/{trace(self->vp->v_count)}'
the address is found by selecting one of the output of this at random:
dtrace -n 'zfsctl_snapshot_inactive:entry{print(args[0]);'
when you see "bad return", we have hit the bug. Then doing "zfs umount test/
snaps" will fail with EBUSY.
When we hit this case, we also leak the hold on the target vnode (vn). When the
inactive callback is called on a vnode with v_count > 1, it needs to be
decremented.
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Rich Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@0c779ad4240c779ad424https://www.illumos.org/issues/7054
upstream:
ee0003de7d3e598499be7ac3fe6b61efcc47cb7f
DLPX-40399 dmu_tx_hold_t should use refcount_t to track space
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@99189164df99189164dfhttps://www.illumos.org/issues/6940
Similar to #6334, but this time with empty directories:
$ zfs create tank/quota
$ zfs set quota=10M tank/quota
$ zfs snapshot tank/quota@snap1
$ zfs set mountpoint=/mnt/tank/quota tank/quota
$ mkdir /mnt/tank/quota/dir # create an empty directory
$ mkfile 11M /mnt/tank/quota/11M
/mnt/tank/quota/11M: initialized 9830400 of 11534336 bytes: Disc quota exceeded
$ rmdir /mnt/tank/quota/dir # now unlink the empty directory
rmdir: directory "/mnt/tank/quota/dir": Disc quota exceeded
From user perspective, I would expect that ZFS is always able to remove files
and directories even when the quota is exceeded.
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Simon Klinkert <simon.klinkert@gmail.com>
7020 sdev_cleandir can loop forever
Note that the bulk of the upstream change is not applicable to FreeBSD
and the affected files are not even in the vendor area.
illumos/illumos-gate@45b174751545b1747515https://www.illumos.org/issues/7019
Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set,
skips all processing of secpolicy functions. This means that ZFS is not doing
any kind of verification of the credentials or access rights of the caller and
assuming that (as it is an in-kernel client) all such checks have already been
done.
This turns out to be quite a dangerous assumption, especially with respect to
sdev. In general I don't think it's particularly reasonable to offload this
enforcement of access rights onto other kernel subsystems when ZFS has some
particular local semantics in this area (delegated datasets etc) and does not
provide any kind of API to allow other subsystems to avoid code duplication
when doing it. ZFS should apply its normal access policy to requests from
within the kernel, and callers should take care to give it the correct
credentials and call it from the correct context in order to get the results
they need.
You can observe the currently unfortunate consequences of this bug in any non-
global zone that has access to /dev/zvol or any subset of it via sdev profiles.
In particular, a zone used to contain a KVM or similar which has a single zvol
passed through to it using a <device match= block in its zone XML.
Even though sdev makes something of an attempt to control for whether the
caller should have access to nodes in /dev/zvol, it doesn't do this correctly,
or really at all in the lookup call path. So, if we have a zone that's been
given access to any part of /dev/zvol, it can simply look up the full path to
any other zvol on the entire system, and the node will appear and be able to be
used.
https://www.illumos.org/issues/7020
sdev_cleandir can currently hang forever when it encounters a child node that
is busy, or when it is given a matching expr and the first entry on the list
does not match.
The previous code (circa 2013) iterated over the children of the node using a
for loop with SDEV_NEXT_ENTRY, which was then changed to a while ((dv =
SDEV_FIRST_ENTRY(ddv)) { loop. Unfortunately the continue statements that
previously made it skip over an entry were left as they were, which now result
in an infinite busy-loop in the kernel.
You can trigger this pretty easily by setting up an sdev exclude rule in
zonecfg.
Diagnosis: look for a runaway process consuming 100% CPU in kernel -- they have
a distinctive stack:
# mdb -k
> 0t1234::pid2proc | ::walk thread | ::findstack -v
[ ffffd001efcd3310 _resume_from_idle+0x112() ]
ffffd001efcd3360 apix_hilevel_intr_epilog+0xc1(ffffd001efcd33d0, 0)
ffffd001efcd33c0 apix_do_interrupt+0x34a(ffffd001efcd33d0, 0)
ffffd001efcd33d0 _sys_rtt_ints_disabled+8()
ffffd001efcd3550 rw_enter+0x58()
ffffd001efcd35e0 sdev_cleandir+0x60(ffffd0631b6d75d8, 0, 0)
ffffd001efcd3630 devzvol_prunedir+0xec(ffffd0631b6d76e8)
ffffd001efcd36d0 devzvol_readdir+0x150(ffffd06333250e00, ffffd001efcd3790,
ffffd062dc990e18, ffffd001efcd37dc, 0, 0)
ffffd001efcd3760 fop_readdir+0x6b(ffffd06333250e00, ffffd001efcd3790,
ffffd062dc990e18, ffffd001efcd37dc, 0, 0)
ffffd001efcd3830 walk_dir+0xee(ffffd06333250e00, ffffd0669e4483c8,
fffffffffbbdf410)
ffffd001efcd3850 prof_make_names_walk+0x2e(ffffd0669e4483c8,
fffffffffbbdf410)
ffffd001efcd38b0 prof_make_names+0xfc(ffffd0669e4483c8)
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Wilson <alex.wilson@joyent.com>
illumos/illumos-gate@63364b0ee263364b0ee2https://www.illumos.org/issues/6922
ZFS does not do a config_sync after removing an aux (spare, log, or cache)
device. AFAICT this isn't being done because it is slow and was deemed
unnecessary. However, it should be such a rare operation that speed doesn't
matter, and not doing it results in two problems:
1) It is theoretically possible to remove an aux device from one pool and
attach it to another, then lose power. When power is restored, both pools would
think that they own the aux device.
2) Removal of the aux device doesn't send any useful sysevents to userland.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@1825bc56e51825bc56e5https://www.illumos.org/issues/6878
Summary of changes:
* Replace generic "scan done" message with "scan aborted, restarting",
"scan cancelled", or "scan done"
* Log number of errors using spa_get_errlog_size
* Refactor scan restarting check into static function
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Nav Ravindranath <nav@delphix.com>
illumos/illumos-gate@8df0bcf0df8df0bcf0dfhttps://www.illumos.org/issues/6513
If a ZFS object contains a hole at level one, and then a data block is created
at level 0 underneath that l1 block, l0 holes will be created. However, these
l0 holes do not have the birth time property set; as a result, incremental
sends will not send those holes.
Fix is to modify the dbuf_read code to fill in birth time data.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@0d8fa8f8eb0d8fa8f8ebhttps://www.illumos.org/issues/6902
pjd has authored and commited a patch in Jan 21, 2012 that substanially speeds
up zfs snapshot listing if requesting only the name property and sorting by
name.
In this special case, the snapshot properties do not need to be loaded. This
code has been adopted by zfsonlinux on May 29, 2012.
Commit message from pjd:
Dramatically optimize listing snapshots when user requests only
snapshot
names and wants to sort them by name, ie. when executes:
1. zfs list -t snapshot -o name -s name
Because only name is needed we don't have to read all snapshot
properties.
Below you can find how long does it take to list 34509 snapshots from
a single
disk pool before and after this change with cold and warm cache:
before:
1. time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 525s
warm cache: 218s
after:
1. time zfs list -t snapshot -o name -s name > /dev/null
cold cache: 1.7s
warm cache: 1.1s
References:
http://svnweb.freebsd.org/base?view=revision&revision=230438https://github.com/freebsd/freebsd/commit/8e3e9863https://github.com/zfsonlinux/zfs/commit/0cee2406
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pawel Dawidek <pjd@freebsd.org>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Martin Matuska <martin@matuska.org>
illumos/illumos-gate@c971037baac971037baahttps://www.illumos.org/issues/6876
Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for
trouble. We should check every dataset on import, using a 1024 byte buffer and
checking each time to see if the dataset's new name is longer than 256 bytes.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@11ceac77ea11ceac77eahttps://www.illumos.org/issues/6844
dnode_next_offset is used in a variety of places to iterate over the holes or
allocated blocks in a dnode. It operates under the premise that it can iterate
over the blockpointers of a dnode in open context while holding only the
dn_struct_rwlock as reader. Unfortunately, this premise does not hold.
When we create the zio for a dbuf, we pass in the actual block pointer in the
indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the bp is
now inconsistent from the perspective of dnode_next_offset: the bp will appear
to be a hole until zio_dva_allocate finally finishes filling it in. In the
meantime, dnode_next_offset can detect a hole in the dnode when none exists.
I was able to experimentally demonstrate this behavior with the following
setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the first L0
block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle
of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent hole.
The fix is to ensure that it is valid to iterate over the indirect blocks in a
dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP
and updating the actual BP in dbuf_write_ready while holding the lock.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Reece <alex@delphix.com>