illumos/illumos-gate@93c618e0f4https://www.illumos.org/issues/8930:
We normally remove an unlinked node when its last user goes away and the
node becomes inactive. However, we should not do that if the filesystem
is mounted read-only including the case where it has its readonly
property set. The node will remain on the unlinked queue, so it will
not be leaked.
One particular scenario is when we receive an incremental stream into a
mounted read-only filesystem and that stream contains an unlinked file
(still on the unlinked queue). If that file is opened before the
receive and some time later after the receive it becomes inactive we
would remove it and, thus, modify the read-only filesystem. As a
result, the filesystem would diverge from its source and further
incremental receives would not be possible (without forcing a rollback).
Another related scenario, that may or may not be possible depending on an
OS / VFS policy, is when an open file is unlinked, then the filesystem is
remounted read-only, and then the file is closed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@94ddd0900ahttps://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in module "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race is present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@a3b2868063https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.
This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
illumos/illumos-gate@27295216542729521654https://www.illumos.org/issues/7531
I found that some buffers that could be L2ARC eligible are not flagged
such, leading to some performance impact. As a test I ran the same IO
workload 10 times in a raw. It is a metadata only workload (files
listing). l2arc_noprefetch=0.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: benrubson <ben.rubson@gmail.com>
MFC after: 8 days
illumos/illumos-gate@3f7978d02b3f7978d02bhttps://www.illumos.org/issues/8081
zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
which uses -WError, the only way to build it is to disable all compiler
warnings. This makes it much harder to detect newly introduced bugs. We should
cleanup all the warnings.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@42b141117242b1411172https://www.illumos.org/issues/8648
I'm opening this bug to track integration of the following ZFS on Linux
commit into illumos:
commit f763c3d1df
Author: LOLi <loli10K@users.noreply.github.com>
Date: Mon Aug 21 17:59:48 2017 +0200
Fix range locking in ZIL commit codepath
Since OpenZFS 7578 (1b7c1e5) if we have a ZVOL with logbias=throughput
we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr
offset and length to the offset and length of the BIO from
zvol_write()->zvol_log_write(): these offset and length are later used
to take a range lock in zillog->zl_get_data function: zvol_get_data().
Now suppose we have a ZVOL with blocksize=8K and push 4K writes to
offset 0: we will only be range-locking 0-4096. This means the
ASSERTion we make in dbuf_unoverride() is no longer valid because now
dmu_sync() is called from zilog's get_data functions holding a partial
lock on the dbuf.
Fix this by taking a range lock on the whole block in zvol_get_data().
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: LOLi <loli10K@users.noreply.github.com>
illumos/illumos-gate@bd9d3f9046bd9d3f9046https://www.illumos.org/issues/8661
The "zil-cw1" dtrace probe was previously removed in 8558, and the "zil-cw2"
probe should have been removed in that patch as well. Unfortunately, the "zil-
cw2" was not removed in 8558, so this bug is to track it's removal.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@2840dce1a02840dce1a0https://www.illumos.org/issues/8600
ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this will likely entail adding
extra logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chris Williamson <chris.williamson@delphix.com>
illumos/illumos-gate@000cce6b6f000cce6b6fhttps://www.illumos.org/issues/8592
ZFS channel programs should be able to perform a rollback. This logic will
probably look pretty similar to zfs.sync.destroy().
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brad Lewis <brad.lewis@delphix.com>
illumos/illumos-gate@ed992b0aaced992b0aachttps://www.illumos.org/issues/8604
Every time we want to unmount a snapshot (happens during snapshot deletion or
renaming) we unnecessarily iterate through all the mountpoints in the VFS layer
(see zfs_get_vfs).
Ideally we would just put a hold on the snapshot and access its respective VFS
resource directly.
gwilson_snap_unmount.svg - Flamegraph indicating the issue discussed (138 KB)
Serapheim Dimitropoulos, 2017-09-14 06:36 PM
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
illumos/illumos-gate@5f39f884e25f39f884e2https://www.illumos.org/issues/8605
zfs.exists() in channel programs doesn't return any result, and should have a
man page entry.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chris Williamson <chris.williamson@delphix.com>
illumos/illumos-gate@2bcb5458542bcb545854https://www.illumos.org/issues/8602
When I landed the fix for 8558, I incorrectly added the "dp_early_sync_tasks"
field to the "dsl_pool" structure. This field is used in DelphixOS, but not in
illumos. It was incorrectly pulled into illumos, so this bug is to remove it
from the structure.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@554675eee7554675eee7https://www.illumos.org/issues/8473
Scrubbing is supposed to detect and repair all errors in the pool. However, it
wrongly ignores active spare devices. The problem can easily be reproduced in
OpenZFS at git rev 0ef125d with these commands:
truncate -s 64m /tmp/a /tmp/b /tmp/c
sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
sudo zpool replace testpool /tmp/a /tmp/c
/bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
sync
sudo zpool scrub testpool
zpool status testpool # Will show 0 errors, which is wrong
sudo zpool offline testpool /tmp/a
sudo zpool scrub testpool
zpool status testpool # Will show errors on /tmp/c, which should've already been fixed
FreeBSD head is partially affected: the first scrub will detect some errors,
but the second scrub will detect more.
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@1271e4b10d1271e4b10dhttps://www.illumos.org/issues/8585
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@e6ab4525d1e6ab4525d1https://www.illumos.org/issues/8590
In dsl_destroy_snapshots_nvl(), "snaps_normalized" is not freed after it is
added to "arg".
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@916c8d8811916c8d8811https://www.illumos.org/issues/8552
In the LUA interpreter used by "zfs program", the lua format() function
accidentally includes support for '%f' and friends, which can cause compilation
problems when building on platforms that don't support floating-point math in
the kernel (e.g. sparc). Support for '%f' friends (%f %e %E %g %G) should be
removed, since there's no way to supply a floating-point value anyway (all
numbers in ZFS LUA are int64_t's).
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@dfc115332cdfc115332chttps://www.illumos.org/issues/7431
ZFS channel programs (ZCP) adds support for performing compound ZFS
administrative actions via Lua scripts in a sandboxed environment (with time
and memory limits).
This initial commit includes both base support for running ZCP scripts, and a
small initial library of API calls which support getting properties and
listing, destroying, and promoting datasets.
Testing: in addition to the included unit tests, channel programs have been in
use at Delphix for several months for batch destroying filesystems. The
dsl_destroy_snaps_nvl() call has also been replaced with
For reference, the new zfs-program manpage is included below.
ZFS-PROGRAM(1M) 1M ZFS-PROGRAM(1M)
NAME
zfs program – executes ZFS channel programs
SYNOPSIS
zfs program [-t timeout] [-m memory-limit] pool script
DESCRIPTION
The ZFS channel program interface allows ZFS administrative operations to
be run programmatically as a Lua script. The entire script is executed
atomically, with no other administrative operations taking effect
concurrently. A library of ZFS calls is made available to channel program
scripts. Channel programs may only be run with root privileges.
A modified version of the Lua 5.2 interpreter is used to run channel
program scripts. The Lua 5.2 manual can be found at:
http://www.lua.org/manual/5.2/
...
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <chris.williamson@delphix.com>
illumos/illumos-gate@9d26e4fc029d26e4fc02https://www.illumos.org/issues/5115
Intel's NICs based on the XL710 chipset exist 1 . There exist drivers for Linux
and FreeBSD 2 .
It does not appear to be derived from the ixgbe driver source, so it would
probably require porting the i40e source from FBSD to Illumos, unless a driver
exists for a GLDv3-like platform under CDDL or similar license (none are known
to currently be available or being worked on).
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Robert Mustacchi <rm@joyent.com>
Note: this change has nothing to do with FreeBSD.
illumos/illumos-gate@e2fc3408efe2fc3408efhttps://www.illumos.org/issues/8376
The logic for generating and maintaining the cached v_path value on vnodes
could stand to be improved. If vnodes were purely ephemeral, then freshly
calculating v_path at the time of lookup() would result in correct values (at a
performance cost). When they persist, either as referenced by other structures
(such as open files, process cwd, dnlc entries, etc), the opportunity for the
v_path to become stale arises. This is exacerbated by the current behavior
that, when v_path is found to be invalid (during a vnodetopath operation) will
strive to recalculate it, but not preserve the result. The overall situation
leads to both performance and correctness (due to lack of results) problems
relating to v_path.
This has been addressed in SmartOS through a series of changes. Firstly, to do
proper invalidation of v_path when it's found to be stale:
- OS-3891 stale v_path slows vfs lookups
OS-3891 revealed that some logic made assumptions about v_path never
transitioning from non-NULL to NULL. It was addressed here:
- OS-4317 v_path accesses can race
While the pathological stale v_path behavior had been addressed, there are
still cases where the absence of valid v_path information was causing problems.
The largest patch in this series addressed it by performing v_path checking and
updates during vnode lookups/updates, when it is most convenient:
- OS-5167 cached v_path should be kept fresh
Two smaller updates are included too, to prevent erroneous behavior introduced
by the prior changes:
- OS-5846 procfs should follow VFS rules
- OS-6134 vn_reinit balks on zeroed vnodes
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Patrick Mooney <pmooney@pfmooney.com>
illumos/illumos-gate@37e84ab74e37e84ab74ehttps://www.illumos.org/issues/8569
C [C99] has peculiar rules for inline functions that are different from the
C++ rules. Unlike C++ where inline is "fire and forget", in C a programmer
must pay attention to the function's storage class / visibility. The main
problem is with the case where a compiler decides to not inline a call to the
function declared as inline.
Some relevant links:
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15831.html
- http://www.drdobbs.com/the-new-c-inline-functions/184401540
The summary is that either the inline functions should be declared 'static
inline' or one of the compilation units (.c files) must provide a callable
externally visible function definition. In the former case, the compiler would
automatically create a local non-inlined function instance in every compilation
unit where it's needed. In the latter case the single external definition is
used to satisfy any non-inlined calls in all compilation units. As things
stand right now, we can get an undefined reference error under certain
combinations of compilers and compiler options. For example, this is what I
get on FreeBSD when compiling with clang 4.0.0 and -O1:
In function `abd_free': /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/abd.c:385:
undefined reference to `abd_is_linear'
So, there are two alternatives. Either to qualify each inline function in
abd.h with static storage class. Or to add declarations like the following to
abd.c: extern inline boolean_t abd_is_linear(abd_t *abd); Both work. I am not
sure which one would be more consistent with the illumos development rules.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@216d7723a1216d7723a1https://www.illumos.org/issues/8558
On a system with more than 80K ZFS filesystems, we've seen cases where
lwp_create() will start to fail by returning EAGAIN. The problem being,
for each of those 80K ZFS filesystems, a taskq will be created for each
dataset as part of the ZIL for each dataset.
For each of these taskq's, a kernel thread will be created which results
in 24KB being allocated for each thread. With enough of these 24KB
allocations, we eventually exhaust the memory region set aside for these
allocations. Currently, segkpsize is set to a value of 2GB, which means
we can only support about 80K filesystems; 2GB / 24KB = ~80K.
The lwp_create() failure comes into play due to the fact that LWP
creation also allocates 24KB from this same region of memory. Thus, if
we've exhausted this region of memory due to the number of ZIL taskq's,
there won't be any memory avaible to allow the call to lwp_create() to
succeed.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@1702cce7511702cce751https://www.illumos.org/issues/8414
This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167
Currently, there is no way to pause a scrub. Pausing may be useful when
the pool is busy with other I/O to preserve bandwidth.
Description
This patch adds the ability to pause and resume scrubbing. This is achieved
by maintaining a persistent on-disk scrub state. While the state is 'paused'
we do not scrub any more blocks. We do however perform regular scan
housekeeping such as freeing async destroyed and deadlist blocks while paused.
If you're testing this change, you probably want to include the patch from #6164
Motivation and Context
Scrub pausing can be an I/O intensive operation and people have been asking
for the ability to pause a scrub for a while. This allows one to preserve scrub
progress while freeing up bandwidth for other I/O.
How Has This Been Tested?
Unit testing and zfs-tests. to the pool. This patch will also include the
patch from https://github.com/zfsonlinux/zfs/ pull/6164 In certain cases
(dsl_scan_sync() is one), we may end up calling
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>
illumos/illumos-gate@b11fe8c014b11fe8c014https://www.illumos.org/issues/8508
Mounting a zpool on a 32-bit system triggers a panic in spa_history_log_version
() due to a type format mismatch for ZPL_VERSION. ZPL_VERSION is a unsigned
long long, but the format expects an integer. On 64-bit platforms this may not
be an issue due to word size and alignment. On 32-bit platforms a word size is
half that of a long long, causing the second word of the long long to be seen
as the string pointer for utsname.nodename.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Justin Hibbits <chmeeedalf@gmail.com>
illumos/illumos-gate@d28671a3b0d28671a3b0https://www.illumos.org/issues/8373
The code that writes ZIL blocks uses dmu_tx_assign(TXG_WAIT) to assign a
transaction to a transaction group.
That seems to be logically incorrect as writing of the ZIL block does not
introduce any new dirty data.
Also, when there is a lot of dirty data, the call can introduce significant
delays into the ZIL commit path,
thus affecting all synchronous writes. Additionally, ARC throttling may affect
the ZIL writing.
We probably need a new mechanism similar to dmu_tx_create_assigned to assign
ZIL transactions.
(Ab)using TXG_WAITED does not seem to be sufficient.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@79c2b812ee79c2b812eehttps://www.illumos.org/issues/8491
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the
uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed and
deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
illumos/illumos-gate@267ae6c3a8267ae6c3a8https://www.illumos.org/issues/7915
l2arc_evict() is strictly serialized with respect to l2arc_write_buffers() and
l2arc_write_done().
Normally, l2arc_evict() and l2arc_write_buffers() are called from the same
thread, so they can not be concurrent.
Also, l2arc_write_buffers() uses zio_wait() on the parent zio of all cache zio-
s.
That ensures that l2arc_write_done() is completed before l2arc_write_buffers()
returns.
Finally, if a cache device is removed, then l2arc_evict() is called under
SCL_ALL in the exclusive mode.
That ensures that it can not be concurrent with the normal L2ARC accesses to
the device (including writing and evicting buffers).
Given the above, some checks and actions in l2arc_evict() do not make sense.
For instance, it must never encounter the write head header let alone remove it
from the buffer list.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@dcb6872c56dcb6872c56https://www.illumos.org/issues/8126
The sync thread is concurrently modifying dn_phys->dn_nlevels
while dbuf_dirty() is trying to assert something about it, without
holding the necessary lock. We need to move this assertion further down
in the function, after we have acquired the dn_struct_rwlock.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@4923c69fdd4923c69fddhttps://www.illumos.org/issues/8067
Add an option to zdb to print a literal embedded block pointer supplied on the
command line:
zdb -E [-A] word0:word1:...:word15
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@9b195260e29b195260e2https://www.illumos.org/issues/8426
abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so
qualify them with const.
abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the
corresponding arguments.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@77b171372e77b171372ehttps://www.illumos.org/issues/7600
At present, the kernel side code seems to blindly rollback to whatever happens
to be the latest snapshot at the time when the rollback task is processed.
The expected target's name should be passed to the kernel driver and the sync
task should validate that the target exists and that it is the latest snapshot
indeed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@42418f9e7342418f9e73https://www.illumos.org/issues/8377
The problem is that when dsl_bookmark_destroy_check() is executed from open
context (the pre-check), it fills in dbda_success based on the existence of the
bookmark.
But the bookmark (or containing filesystem as in this case) can be destroyed
before we get to syncing context. When we re-run dsl_bookmark_destroy_check()
in syncing
context, it will not add the deleted bookmark to dbda_success, intending for
dsl_bookmark_destroy_sync() to not process it. But because the bookmark is
still in dbda_success
from the open-context call, we do try to destroy it.
The fix is that dsl_bookmark_destroy_check() should not modify dbda_success
when called from open context.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@b7edcb9408b7edcb9408https://www.illumos.org/issues/8378
The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which
we then nopwrite against.
zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr
to zgd_bp, dbuf_write_ready()
could change db_blkptr, and dbuf_write_done() could remove the dirty record.
dmu_sync() then sees the stale
BP and that the dbuf it not dirty, so it is eligible for nop-writing.
The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the
db_mtx. We could still see a stale
db_blkptr, but if it is stale then the dirty record will still exist and thus
we won't attempt to nopwrite.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@16a7e5ac1116a7e5ac11https://www.illumos.org/issues/7910
It seems that the change in issue #6950 resurrected the problem that was
earlier fixed by the change in issue #5219.
Please also see the following FreeBSD bug report: https://bugs.freebsd.org/
bugzilla/show_bug.cgi?id=216178
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@5e2a0747255e2a074725https://www.illumos.org/issues/8416
A C++ compiler fails to compile abd_is_linear(), which is an inline function
defined in abd.h, with the following error:
error: cannot initialize return object of type 'boolean_t' with an
rvalue of type 'bool'
That happens because a bool can not be converted to an enum in C++.
That's a problem because abd.h can be visible through other header files that a
C++ program that works with ZFS can include.
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@2889ec41c02889ec41c0https://www.illumos.org/issues/8311
Description:
There was a misunderstanding about the enforcement details of the "Read-only"
flag introduced for SMB/CIFS compatibility, way back in 2007 in the Sun PSARC
2007/315 case.
The original authors thought enforcement of the READONLY flag should work
similarly as the IMMUTABLE flag. Unfortunately, that enforcement is
incompatible with the expectations of Windows applications using this feature
through the SMB service. Applications assume (and the MS File System Algorithms
MS-FSA confirms they should) that an SMB client can:
(a) Open an SMB handle on a file with read/write access,
(b) Set the DOS attributes to include the READONLY flag,
(c) continue to have write access via that handle.
This access model is essentially the same as a Unix/POSIX application that
creates a file (with read/write access), uses fchmod() to change the file mode
to something not granting write access (i.e. 0444), and then continues to write
that file using the open handle it got before the mode change.
Currently, the SMB server works-around this problem in a way that will become
difficult to maintain as we implement support for SMB3 persistent handles, so
SMB depends on this fix.
I've written a test program that can be used to demonstrate this problem, and
added it to zfs-tests (tests/functional/acl/cifs/cifs_attr_004_pos).
It currently fails, but will pass when this problem fixed.
Steps to Reproduce:
Run the test program on a ZFS file system.
Expected Results:
Pass
Actual Results:
Fail.
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Prakash Surya <prakash.surya@delphix.com>
Author: Gordon Ross <gwr@nexenta.com>
illumos/illumos-gate@403a8da73c403a8da73chttps://www.illumos.org/issues/5220
There are disk devices that have logical sector size larger than 512B, for
example 4KB. That is, their physical sector size is larger than 512B and they
do not provide emulation for 512B sector sizes. For such devices both a data
offset and a data size must be properly aligned. L2ARC should arrange that
because it uses physical I/O.
zio_vdev_io_start() performs a necessary transformation if io_size is not
aligned to vdev_ashift, but that is done only for logical I/O. Something
similar should be done in L2ARC code.
* a temporary write buffer should be allocated if the original buffer is
not going to be compressed and its size is not aligned
* size of a temporary compression buffer should be ashift aligned
* for the reads, if a size of a target buffer is not sufficiently large and
it is not aligned then a temporary read buffer should be allocated
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@a4b8c9aa65a4b8c9aa65https://www.illumos.org/issues/8264
Oddly there is a lzc_clone function, but no lzc_promote function.
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@kebe.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Andrew Stormont <astormont@racktopsystems.com>
illumos/illumos-gate@0255edcc850255edcc85https://www.illumos.org/issues/8056
The send size estimate for a zvol can be too low, if the size of the record
headers (dmu_replay_record_t's) is a significant portion of the size.
This is typically the case when the data is highly compressible, especially
with embedded blocks.
The problem is that dmu_adjust_send_estimate_for_indirects() assumes that
blocks are the size of the "recordsize" property (128KB).
However, for zvols, the blocks are the size of the "volblocksize" property
(8KB). Therefore, we estimate that there will be 16x less record headers than
there really will be.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@dbfd9f9300dbfd9f9300https://www.illumos.org/issues/8156
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.
It isn't necessary to hold the lock, because if we make the wrong choice
occasionally, nothing bad will happen.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5b062782535b06278253https://www.illumos.org/issues/8005
RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.
To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks. The contents of these pad sectors are not used, so we do not
need to read or write these sectors. However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors. If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance. The gap-filling performance improvement was introduced in
July 2009.
Writing the optional zio is done by the io aggregation code in
vdev_queue.c. The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB. For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.
The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks). It
cannot occur with the default recordsize=128KB. If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.
The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@adaec86ad2adaec86ad2https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5f10ef697f5f10ef697fhttps://www.illumos.org/issues/6396
LVM = SVM = Solaris Volume Manager
dead code and not using with ZFS based platform.
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
illumos/illumos-gate@c5ee46810fc5ee46810fhttps://www.illumos.org/issues/7578
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alexander Motin <mav@FreeBSD.org>
8100 8021 seems to cause random BAD TRAP: type=d (#gp General protection)
illumos/illumos-gate@770499e185770499e185https://www.illumos.org/issues/8021
The ARC buf data project (known simply as "ABD" since its genesis in the ZoL
community) changes the way the ARC allocates `b_pdata` memory from using linear
`void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This
improves ZFS's performance by helping to defragment the address space occupied
by the ARC, in particular for cases where compressed ARC is enabled. It could
also ease future work to allocate pages directly from `segkpm` for minimal-
overhead memory allocations, bypassing the `kmem` subsystem.
This is essentially the same change as the one which recently landed in ZFS on
Linux, although they made some platform-specific changes while adapting this
work to their codebase:
1. Implemented the equivalent of the `segkpm` suggestion for future work
mentioned above to bypass issues that they've had with the Linux kernel memory
allocator.
2. Changed the internal representation of the ABD's scatter/gather list so it
could be used to pass I/O directly into Linux block device drivers. (This
feature is not available in the illumos block device interface yet.)
https://www.illumos.org/issues/8100
My supermicro system is getting random BAD TRAP: type=d (#gp General
protection) at about the stage where ZFS filesystems are mounted - usually
console login prompt is already present but the services are still starting.
After backing out 8021, the boot is completed and no panics do occur.
Machine does dump, however savecore fails:
savecore: bad magic number baddcafe
I can get more data out with boot -k, if needed.
# psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (0-7)
The core has 2 virtual processors (0 4)
The core has 2 virtual processors (1 5)
The core has 2 virtual processors (2 6)
The core has 2 virtual processors (3 7)
x86 (GenuineIntel 306C3 family 6 model 60 step 3 clock 3500 MHz)
Intel(r) Xeon(r) CPU E3-1246 v3 @ 3.50GHz
# prtconf -m
32657
$ zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Reviewed by: John Kennedy john.kennedy@delphix.com
Reviewed by: Prakash Surya prakash.surya@delphix.com
Reviewed by: Prashanth Sreenivasa pks@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Chris Williamson chris.williamson@delphix.com
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Dan Kimmel <dan.kimmel@delphix.com>