illumos/illumos-gate@d6e1c446d7d6e1c446d7https://www.illumos.org/issues/8857
I had an OS panic on one of our servers:
ffffff01809128c0 vpanic()
ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
ffffff3373370908)
ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
ffffff0180912b40 thread_start+8()
It panicked here:
http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/
zio.c#430
pio->io_lock is DEAD, thus a panic. Further analysis shows the "pio"
(parent zio of "cio") has already been destroyed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
PR: 223803
Tested by: shiva.bhanujan@quorum.com
MFC after: 2 weeks
zfsctl_common_pathconf will report all the same variables that regular ZFS
volumes report. zfsctl_common_getacl will report an ACL equivalent to 555,
except that you can't read xattrs or edit attributes.
Fixes a bug where "ls .zfs" will occasionally print something like:
ls: .zfs/.: Operation not supported
PR: 225793
Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D14365
global to per-domain state. Protect reservations with the free lock
from the domain that they belong to. Refactor to make vm domains more
of a first class object.
Reviewed by: markj, kib, gallatin
Tested by: pho
Sponsored by: Netflix, Dell/EMC Isilon
Differential Revision: https://reviews.freebsd.org/D14000
We did that in the case of success to prevent the use of stale cached
data, but it makes even less sense to keep the cached data when we fail.
Ideally, we should call vgone() on the vnode in the case of zfs_rezget
failure, but the current lock order prevents us from doing that.
The change also rearranges the order of unlinked check and the size
change check.
While there, add missing SET_ERROR in one of the error paths.
MFC after: 2 weeks
illumos/illumos-gate@5cb8d943bchttps://www.illumos.org/issues/8835:
Sequential reads not aligned to block size are not detected by ZFS
prefetcher as sequential, killing prefetch and severely hurting
performance. It is caused by dmu_zfetch() in case of misaligned
sequential accesses being called with overlap of one block.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alexander Motin <mav@FreeBSD.org>
illumos/illumos-gate@4ae5f5f06chttps://www.illumos.org/issues/8652:
Clang and GCC prefer to use unsigned ints to store enums. With Clang, that
causes tautological comparison warnings when comparing a zfs_prop_t or
zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error
handling code is being silently removed as a result.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@301fd1d6f2
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Sean Eric Fagan <sef@ixsystems.com>
illumos/illumos-gate@01a059ee0chttps://www.illumos.org/issues/8856:
arc_cksum_is_equal() calls zio_push_transform() that requires abd_t*
(second arg), but a void* is passed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Roman Strashkin <roman.strashkin@nexenta.com>
8930 zfs_zinactive: do not remove the node if the filesystem is readonly
illumos/illumos-gate@93c618e0f4https://www.illumos.org/issues/8930:
We normally remove an unlinked node when its last user goes away and the
node becomes inactive. However, we should not do that if the filesystem
is mounted read-only including the case where it has its readonly
property set. The node will remain on the unlinked queue, so it will
not be leaked.
One particular scenario is when we receive an incremental stream into a
mounted read-only filesystem and that stream contains an unlinked file
(still on the unlinked queue). If that file is opened before the
receive and some time later after the receive it becomes inactive we
would remove it and, thus, modify the read-only filesystem. As a
result, the filesystem would diverge from its source and further
incremental receives would not be possible (without forcing a rollback).
Another related scenario, that may or may not be possible depending on an
OS / VFS policy, is when an open file is unlinked, then the filesystem is
remounted read-only, and then the file is closed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@94ddd0900ahttps://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.
Here's an example panic due to this bug:
> ::status
debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
panic message:
BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
dump content: kernel pages only
> $c
zio_shrink+0x12()
zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
zil_commit+0x80(ffffff03dcd15cc0, 9a9)
zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
write+0x250(42, fffffd7ff4832000, 2000)
sys_syscall+0x177()
If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.
A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.
This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).
Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.
This race is present in `zil_suspend`, leading to this bug.
At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.
This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.
So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.
Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@cf07d3da99https://www.illumos.org/issues/8603:
To help make the ZIL's code more understandable, it was suggested that
the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".
Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@a3b2868063https://www.illumos.org/issues/8677
We want to be able to run channel programs outside of synching context.
This would greatly improve performance of channel program that just gather
information, as we won't have to wait for synching context anymore.
This feature should introduce the following:
- A new command line flag in "zfs program" to specify our intention to
run in open context.
- A new flag/option within the channel program ioctl which selects the
context.
- Appropriate error handling whenever we try a channel program in
open-context that contains zfs.sync* expressions.
- Documentation for the new feature in the manual pages.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Nowadays we do not pass zfs_cmd_t directly through the ioctl interface.
Instead a small zfs_iocparm_t object is passed and the command is
explicitly copied in and out. So, the check has become irrelevant.
MFC after: 3 weeks
Sponsored by: Panzura
These return the jail ID and jail name for the traced process,
respectively, and are analogous to "zonename" on Solaris/illumos.
"zonename" is now aliased to "jailname".
Also add some stress tests for the new variables.
Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com>
Reviewed by: dteske (previous version)
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D13877
rather than kmem arena size to determine available memory.
Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.
PR: 224330 (partial), 224080
Reviewed by: markj, avg
Sponsored by: Netflix / Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D13494
r326987 enabled two #if 0'd-out EMLINK checks in zfs_link_create() for
link overflow. However, one of the checks (when the vnode adding a link
is a directory such as for mkdir) always returned even if the link did not
overflow. Change this to only return early if it needs to report an
EMLINK error.
Reported by: db, shurd
Sponsored by: Chelsio Communications
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.
To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method. For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.
While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value. In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.
Discussed with: bde
Reviewed by: kib (part of a larger patch)
MFC after: 1 month
Sponsored by: Chelsio Communications
Differential Revision: https://reviews.freebsd.org/D12572
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations. Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs. Now fail with EINVAL to
indicate hard links aren't supported.
Requested by: bde (though perhaps not this exact implementation)
Reviewed by: kib (earlier version)
MFC after: 1 month
Sponsored by: Chelsio Communications
- Define a ZFS_LINK_MAX as the ZFS version of LINK_MAX which is set to
UINT64_MAX to match the on-disk format.
- Enable the currently #if 0'd code to check for link overflows and
return EMLINK.
- Don't clamp the link count reported in stat() to LINK_MAX as that is
still the 16-bit limit, but report the full link counts. Also,
avoid possibly overflowing the reported link count to 0 when adjusting
the link count to account for ".snapshot".
- Update the LINK_MAX reported by pathconf() to report ZFS_LINK_MAX
rather than LINK_MAX (but clamped to LONG_MAX for 32-bit systems).
Reviewed by: avg (earlier version)
Sponsored by: Chelsio Communications
dtrace_gethrtime() may be called outside of probe context, and in
particular, from the DTRACEIOC_BUFSNAP handler.
Disable interrupts rather than using sched_pin() to help ensure that
we don't call any external functions when in probe context.
PR: 218452
MFC after: 1 week
Otherwise a poorly timed lowmem event may attempt to acquire a destroyed
lock. Unregister the handler before destroying the ARC reclaim thread.
Reported by: gjb
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D13480
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.
MFC after: 2 weeks
Default WARNS to 0 still, since there's still some warnings on other
architectures.
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13301
"panic: vdev_geom_close_locked: cp->private is NULL"
This panic will result if ZFS fails to open a device due to either of the
following reasons:
1) The device's sector size is greater than 8KB.
2) ZFS wants to open the device RW, but it can't be opened for writing.
The solution is to change the initialization order to ensure that the
assertion will be satisfied.
PR: 221066
Reported by: David NewHamlet <wheelcomplex@gmail.com>
Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13278
"panic: vdev_geom_close_locked: cp->private is NULL"
This panic will result if ZFS fails to open a device due to either of the
following reasons:
1) The device's sector size is greater than 8KB.
2) ZFS wants to open the device RW, but it can't be opened for writing.
The solution is to change the initialization order to ensure that the
assertion will be satisfied.
PR: 221066
Reported by: David NewHamlet <wheelcomplex@gmail.com>
Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D13278
We may create probes in the nascent child process, so we first need to
ensure that any inherited tracepoints are first removed. Otherwise the
probe sites will not be in the state expected by fasttrap, and it won't
be able to enable the probes.
MFC after: 2 weeks
The problem happens when the writes have offsets and sizes aligned with
a filesystem's recordsize (maximum block size). In this scenario
dmu_tx_assign() would fail because of being over the quota, but the uio
would already be modified in the code path where we copy data from the
uio into a borrowed ARC buffer. That makes an appearance of a partial
write, so zfs_write() would return success and the uio would be modified
consistently with writing a single block.
That bug can result in a data loss because the writes over the quota
would appear to succeed while the actual data is being discarded.
This commit fixes the bug by ensuring that the uio is not changed until
after all error checks are done. To achieve that the code now uses
uiocopy() + uioskip() as in the original illumos design. We can do that
now that uiocopy() has been updated in r326067 to use
vn_io_fault_uiomove().
Reported by: mav
Analyzed by: mav
Reviewed by: mav
Pointyhat to: avg (myself)
MFC after: 1 week
X-MFC after: r326067
X-Erratum: wanted
uiocopy() is currently unused, its purpose is copy data from a uio
without modifying the uio. It was in use before the vn_io_fault support
was added to ZFS, at which point our code diverged from the illumos code
a little bit. Because ZFS is the only (potential) user of the function
we are free to modify it to better suit ZFS needs.
The intention behind this change is to remove the differences introduced
earlier in zfs_write().
While here, re-implement uioskip() using uiomove() with
uio_segflg == UIO_NOCOPY.
The story of uioskip is the same as with uiocopy.
Reviewed by: mav
MFC after: 1 week
In general, higher-level code will atomically verify that the process
is not exiting and hold the process. In one case, we were using uwrite()
to copy a probed instruction to a per-thread scratch space block, but
copyout() can be used for this purpose instead; this change effectively
reverts r227291.
MFC after: 1 week
the ARC reclaim thread running longer than needed.
Update the arc::needfree dtrace probe triggered in arc_lowmem() to also report
the value we may want to free.
Submitted by: Nikita Kozlov <nikita.kozlov at blade-group.com>
Reviewed by: avg
Approved by: avg
MFC after: 3 weeks
Sponsored by: blade
Differential Revision: https://reviews.freebsd.org/D12163
illumos/illumos-gate@27295216542729521654https://www.illumos.org/issues/7531
I found that some buffers that could be L2ARC eligible are not flagged
such, leading to some performance impact. As a test I ran the same IO
workload 10 times in a raw. It is a metadata only workload (files
listing). l2arc_noprefetch=0.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: benrubson <ben.rubson@gmail.com>
MFC after: 8 days
illumos/illumos-gate@f37ae9a714f37ae9a714https://www.illumos.org/issues/8713
If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11) we need to
account for additional space needed by the origin dataset which will also be
snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN".
Enforce this limit in pool_namecheck().
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>
MFC after: 1 week
The generic (naive) implementation of posix_fallocate cannot provide the
standard mandated guarantee that overwrites would never fail due to the lack
of free space. The fundamental reason is the copy-on-write architecture
of ZFS. Other features like compression and deduplication can also
increase the size difference between the (pre-)allocated dummy content
and the future content.
So, until ZFS can properly implement the feature it's better to report
that it is unsupported rather than providing an ersatz implementation.
Please note that EINVAL is used to report that the underlying file system
does not support the operation (POSIX.1-2008).
illumos and ZoL seem to do the same.
MFC after: 3 weeks
Sponsored by: Panzura
If vdev_geom_close doesn't close the consumer, then the subsequent call
to vdev_geom_open() would be just a NOP and would always return success.
Thus, at present vdev_reopen() would always succeed for vdev_geom devices
even if the underlying provider is in error state.
The problem was introduced as a result of an optimization in rS308055.
The most significant manifistation of the problem is that
zio_vdev_io_done() --> vdev_probe() --> SPA_ASYNC_PROBE -->
spa_async_probe() --> vdev_reopen()
chain of calls and events becomes a NOP as well.
This chain is invoked when zio_vdev_io_done() detects an "unexpected"
error from the lower level I/O.
Additionally, that call path may race with SPA_ASYNC_REMOVE path because
of the asynchronous nature of them both. So, the SPA_ASYNC_PROBE may
erroneously mark a vdev as being healthy after SPA_ASYNC_REMOVE marked
it as removed.
Reviewed by: asomers, mav
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12731
Don't check for SPA_MINDEVSIZE in vdev_geom_attach when opening by path.
It's redundant with the check in vdev_open, and failing to attach here
results in the wrong error message being printed. However, still check for
it in some other situations:
* When opening by guids, so we don't get bogged down reading from slow
devices like floppy drives.
* In vdev_geom_read_pool_label for the same reason, because we iterate over
all providers.
* If the caller requests that we verify the guid, because then we'll have to
read from the device before vdev_open verifies the size.
PR: 222227
Reported by: Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Reviewed by: avg, mav
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D12531
Unlike spa_async_thread that can get started only from spa_sync()
spa_async_thread_vd can get started from other contexts.
Additionally, spa_async_thread_vd does not really depend on
spa sync being enabled.
The incorrect assert could be triggered by importing a pool in the
read-only mode and then disconnecting one of its disks.
In this case spa_sync_on was false because the pool was read-only
and spa_async_thread_vd was started to handle SPA_ASYNC_REMOVE event.
Note: spa_async_thread_vd() currently exists only in FreeBSD, it was
split out of spa_async_thread() in r253990.
Discussed with: mav
MFC after: 2 weeks
illumos/illumos-gate@4923c69fdd4923c69fdd
FreeBSD note: the manual page is to be updated separately.
https://www.illumos.org/issues/8067
Add an option to zdb to print a literal embedded block pointer supplied on the
command line:
zdb -E [-A] word0:word1:...:word15
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 3 weeks
Zero, <TYPE>_MIN and <TYPE>_MAX values can result from valid conversions.
They don't necessarily imply any error.
Since we do not have any reliable error signaling from libkern's strto*(),
it's better to always assume success rather than to report an error when
there is none.
Reviewed by: tsoome
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12565
illumos/illumos-gate@2840dce1a02840dce1a0https://www.illumos.org/issues/8600
ZFS channel programs should be able to create snapshots.
In addition to the base snapshot functionality, this will likely entail adding
extra logic to handle edge cases which were formerly not possible, such as
creating then destroying a snapshot in the same transaction sync.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chris Williamson <chris.williamson@delphix.com>
MFC after: 5 weeks
X-MFC after: r324163
illumos/illumos-gate@000cce6b6f000cce6b6fhttps://www.illumos.org/issues/8592
ZFS channel programs should be able to perform a rollback. This logic will
probably look pretty similar to zfs.sync.destroy().
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brad Lewis <brad.lewis@delphix.com>
MFC after: 5 weeks
X-MFC after: r324163
illumos/illumos-gate@ed992b0aaced992b0aachttps://www.illumos.org/issues/8604
Every time we want to unmount a snapshot (happens during snapshot deletion or
renaming) we unnecessarily iterate through all the mountpoints in the VFS layer
(see zfs_get_vfs).
Ideally we would just put a hold on the snapshot and access its respective VFS
resource directly.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
FreeBSD note: I added a FreeBSD specific function getzfsvfs_ref() which
is like getzfsvfs() but returns a filesystem referenced, not busied.
We want a busied filesystem in most cases, because we access its private
data and, thus, we need to prevent the filesystem from being unmounted
and its private data destroyed. But in some cases we can either get
away with just a referenced filesystem or we must not busy the
filesystem. Unmounting the filesystem is one of such cases.
MFC after: 5 weeks
X-MFC after: r324163
illumos/illumos-gate@5f39f884e25f39f884e2https://www.illumos.org/issues/8605
zfs.exists() in channel programs doesn't return any result, and should have a
man page entry.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chris Williamson <chris.williamson@delphix.com>
MFC after: 5 weeks
X-MFC after: r324163
7431 ZFS Channel Programs
illumos/illumos-gate@dfc115332cdfc115332chttps://www.illumos.org/issues/7431
ZFS channel programs (ZCP) adds support for performing compound ZFS
administrative actions via Lua scripts in a sandboxed environment (with time
and memory limits).
This initial commit includes both base support for running ZCP scripts, and a
small initial library of API calls which support getting properties and
listing, destroying, and promoting datasets.
Testing: in addition to the included unit tests, channel programs have been in
use at Delphix for several months for batch destroying filesystems. The
dsl_destroy_snaps_nvl() call has also been replaced with
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <chris.williamson@delphix.com>
8552 ZFS LUA code uses floating point math
illumos/illumos-gate@916c8d8811916c8d8811https://www.illumos.org/issues/8552
In the LUA interpreter used by "zfs program", the lua format() function
accidentally includes support for '%f' and friends, which can cause compilation
problems when building on platforms that don't support floating-point math in
the kernel (e.g. sparc). Support for '%f' friends (%f %e %E %g %G) should be
removed, since there's no way to supply a floating-point value anyway (all
numbers in ZFS LUA are int64_t's).
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
8590 memory leak in dsl_destroy_snapshots_nvl()
illumos/illumos-gate@e6ab4525d1e6ab4525d1https://www.illumos.org/issues/8590
In dsl_destroy_snapshots_nvl(), "snaps_normalized" is not freed after it is
added to "arg".
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
FreeBSD notes:
- zfs-program.8 manual page is taken almost as is from the vendor repository,
no FreeBSD-ification done
- fixed multiple instances of NULL being used where an integer is expected
- replaced ETIME and ECHRNG with ETIMEDOUT and EDOM respectively
This commit adds a modified version of Lua 5.2.4 under
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lua, mirroring the
upstream. See README.zfs in that directory for the description of Lua
customizations.
See zfs-program.8 on how to use the new feature.
MFC after: 5 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D12528
I managed to commit an older version of the change.
Plus, even the latest version was not ready for userland compilation.
Reported by: "O. Hartmann" <ohartmann@walstatt.org>,
cy
MFC after: 1 week
X-MFC with: r324011
FreeBSD notes:
- this MFV reverts FreeBSD commit r314549 to make the merge easier
- at present our emulation of cv_timedwait_hires is rather poor,
so I elected to use cv_timedwait_sbt directly
Please see the differential revision for details.
Unfortunately, I did not get any positive reviews, so there could be
bugs in the FreeBSD-specific piece of the merge.
Hence, the long MFC timeout.
illumos/illumos-gate@1271e4b10d1271e4b10dhttps://www.illumos.org/issues/8585
The current implementation of zil_commit() can introduce significant
latency, beyond what is inherent due to the latency of the underlying
storage. The additional latency comes from two main problems:
1. When there's outstanding ZIL blocks being written (i.e. there's
already a "writer thread" in progress), then any new calls to
zil_commit() will block waiting for the currently oustanding ZIL
blocks to complete. The blocks written for each "writer thread" is
coined a "batch", and there can only ever be a single "batch" being
written at a time. When a batch is being written, any new ZIL
transactions will have to wait for the next batch to be written,
which won't occur until the current batch finishes.
As a result, the underlying storage may not be used as efficiently
as possible. While "new" threads enter zil_commit() and are blocked
waiting for the next batch, it's possible that the underlying
storage isn't fully utilized by the current batch of ZIL blocks. In
that case, it'd be better to allow these new threads to generate
(and issue) a new ZIL block, such that it could be serviced by the
underlying storage concurrently with the other ZIL blocks that are
being serviced.
2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
to complete, prior to zil_commit() returning. The size of any given
batch is proportional to the number of ZIL transaction in the queue
at the time that the batch starts processing the queue; which
doesn't occur until the previous batch completes. Thus, if there's a
lot of transactions in the queue, the batch could be composed of
many ZIL blocks, and each call to zil_commit() will have to wait for
all of these writes to complete (even if the thread calling
zil_commit() only cared about one of the transactions in the batch).
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D12355
illumos/illumos-gate@42b141117242b1411172https://www.illumos.org/issues/8648
I'm opening this bug to track integration of the following ZFS on Linux
commit into illumos:
commit f763c3d1df
Author: LOLi <loli10K@users.noreply.github.com>
Date: Mon Aug 21 17:59:48 2017 +0200
Fix range locking in ZIL commit codepath
Since OpenZFS 7578 (1b7c1e5) if we have a ZVOL with logbias=throughput
we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr
offset and length to the offset and length of the BIO from
zvol_write()->zvol_log_write(): these offset and length are later used
to take a range lock in zillog->zl_get_data function: zvol_get_data().
Now suppose we have a ZVOL with blocksize=8K and push 4K writes to
offset 0: we will only be range-locking 0-4096. This means the
ASSERTion we make in dbuf_unoverride() is no longer valid because now
dmu_sync() is called from zilog's get_data functions holding a partial
lock on the dbuf.
Fix this by taking a range lock on the whole block in zvol_get_data().
Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: loli10K <ezomori.nozomu@gmail.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: LOLi <loli10K@users.noreply.github.com>
MFC after: 10 days
illumos/illumos-gate@bd9d3f9046bd9d3f9046https://www.illumos.org/issues/8661
The "zil-cw1" dtrace probe was previously removed in 8558, and the "zil-cw2"
probe should have been removed in that patch as well. Unfortunately, the "zil-
cw2" was not removed in 8558, so this bug is to track it's removal.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 1 week
illumos/illumos-gate@554675eee7554675eee7https://www.illumos.org/issues/8473
Scrubbing is supposed to detect and repair all errors in the pool. However,
it wrongly ignores active spare devices. The problem can easily be
reproduced in OpenZFS at git rev 0ef125d with these commands:
truncate -s 64m /tmp/a /tmp/b /tmp/c
sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
sudo zpool replace testpool /tmp/a /tmp/c
/bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
sync
sudo zpool scrub testpool
zpool status testpool # Will show 0 errors, which is wrong
sudo zpool offline testpool /tmp/a
sudo zpool scrub testpool
zpool status testpool # Will show errors on /tmp/c,
# which should've already been fixed
FreeBSD head is partially affected: the first scrub will detect some errors, but the second scrub will detect more.
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
MFC after: 1 week
Sponsored by: Spectra Logic Corp
It is reported that the default value of 4KB results in a substantial
memory use overhead (at least, on some configurations). Using 1KB seems
to reduce the overhead significantly.
PR: 222377
Reported by: Sean Chittenden <sean@chittenden.org>
MFC after: 1 week
I overlooked the fact that that ZIO_IOCTL_PIPELINE does not include
ZIO_STAGE_VDEV_IO_DONE stage. We do allocate a struct bio for an ioctl
zio (a disk cache flush), but we never freed it.
This change splits bio handling into two groups, one for normal
read/write i/o that passes data around and, thus, needs the abd data
tranform; the other group is for "data-less" i/o such as trim and cache
flush.
PR: 222288
Reported by: Dan Nelson <dnelson@allantgroup.com>
Tested by: Borja Marcos <borjam@sarenet.es>
MFC after: 10 days
illumos/illumos-gate@2bcb5458542bcb545854https://www.illumos.org/issues/8602
When I landed the fix for 8558, I incorrectly added the "dp_early_sync_tasks"
field to the "dsl_pool" structure. This field is used in DelphixOS, but not in
illumos. It was incorrectly pulled into illumos, so this bug is to remove it
from the structure.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 1 week
As long as mnt_ref is not zero there can be a consumer that might try
to access mnt_vnodecovered. For this reason the covered vnode must not
be freed until mnt_ref goes to zero.
So, move the release of the covered vnode to vfs_mount_destroy.
Reviewed by: kib
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D12329
The zfsonlinux feature large_dnode is not yet supported by the loader.
Reviewed by: avg, allanjude
Differential Revision: https://reviews.freebsd.org/D12288
The uncovered vnode is possible because there is no guarantee that
its hold count would go to zero (and it would be inactivated and reclaimed)
immediately after a covering filesystem is unmounted.
So, such a vnode should be expected and it is possible to re-use it
without any trouble.
MFC after: 3 weeks
Sponsored by: Panzura
The only consumer of zfs_get_vfs, zfs_unmount_snap, does not need
the filesystem to be busy, it just need a reference that it can pass
to dounmount.
Also, previously the code was racy as it unbusied the filesystem
before taking a reference on it.
Now the code should be simpler and safer.
MFC after: 2 weeks
Sponsored by: Panzura
illumos/illumos-gate@37e84ab74e37e84ab74ehttps://www.illumos.org/issues/8569
C [C99] has peculiar rules for inline functions that are different from the
C++ rules. Unlike C++ where inline is "fire and forget", in C a programmer
must pay attention to the function's storage class / visibility. The main
problem is with the case where a compiler decides to not inline a call to the
function declared as inline.
Some relevant links:
- http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15831.html
- http://www.drdobbs.com/the-new-c-inline-functions/184401540
The summary is that either the inline functions should be declared 'static
inline' or one of the compilation units (.c files) must provide a callable
externally visible function definition. In the former case, the compiler would
automatically create a local non-inlined function instance in every compilation
unit where it's needed. In the latter case the single external definition is
used to satisfy any non-inlined calls in all compilation units. As things
stand right now, we can get an undefined reference error under certain
combinations of compilers and compiler options. For example, this is what I
get on FreeBSD when compiling with clang 4.0.0 and -O1:
In function `abd_free': /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/abd.c:385:
undefined reference to `abd_is_linear'
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 1 week
illumos/illumos-gate@216d7723a1216d7723a1https://www.illumos.org/issues/8558
On a system with more than 80K ZFS filesystems, we've seen cases where
lwp_create() will start to fail by returning EAGAIN. The problem being,
for each of those 80K ZFS filesystems, a taskq will be created for each
dataset as part of the ZIL for each dataset.
For each of these taskq's, a kernel thread will be created which results
in 24KB being allocated for each thread. With enough of these 24KB
allocations, we eventually exhaust the memory region set aside for these
allocations. Currently, segkpsize is set to a value of 2GB, which means
we can only support about 80K filesystems; 2GB / 24KB = ~80K.
The lwp_create() failure comes into play due to the fact that LWP
creation also allocates 24KB from this same region of memory. Thus, if
we've exhausted this region of memory due to the number of ZIL taskq's,
there won't be any memory avaible to allow the call to lwp_create() to
succeed.
FreeBSD note: I haven't created sysctl-s for the new ZIL clean
parameters. Let's add them if anyone requires to tune them.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@1702cce7511702cce751
FreeBSD note: rather than merging the zpool.8 update I copied the zpool
scrub section from the illumos zpool.1m to FreeBSD zpool.8 almost
verbatim. Now that the illumos page uses the mdoc format, it was an
easier option. Perhaps the change is not in perfect compliance with the
FreeBSD style, but I think that it is acceptible.
https://www.illumos.org/issues/8414
This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167
Currently, there is no way to pause a scrub. Pausing may be useful when
the pool is busy with other I/O to preserve bandwidth.
Description
This patch adds the ability to pause and resume scrubbing. This is achieved
by maintaining a persistent on-disk scrub state. While the state is 'paused'
we do not scrub any more blocks. We do however perform regular scan
housekeeping such as freeing async destroyed and deadlist blocks while paused.
Motivation and Context
Scrub pausing can be an I/O intensive operation and people have been asking
for the ability to pause a scrub for a while. This allows one to preserve scrub
progress while freeing up bandwidth for other I/O.
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>
MFC after: 2 weeks
Turn on the required options in the ERL config file, and ensure
that the fbt module is listed as a dependency for mips in
the modules/dtrace/dtraceall/dtraceall.c file.
PR: 220346
Reviewed by: gnn, markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D12227
"zfs mount -o" passes a list of mount options directly to nmount(2) after
sanity checking them. In particular, zfs(8) will refuse to mount an already
existing file system unless "remount" is specified in the option list.
However, the "remount" option only exists in Illumos. FreeBSD's equivalent is
"update".
PR: 221985
Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D12233
The default value for arc_no_grow_shift may not be optimal when using
several GiB ARC. Expose it via sysctl allows users to tune it easily.
Also expose arc_grow_retry via sysctl for the same reason. The default value of
60s might, in case of intensive load, be too long.
Submitted by: Nikita Kozlov <nikita.kozlov@blade-group.com>
Reviewed by: mav, manu, bapt
MFC after: 2 weeks
Sponsored by: blade
Differential Revision: https://reviews.freebsd.org/D12144
illumos 4185 ("add new cryptographic checksums to ZFS: SHA-512,
Skein, Edon-R") was intentionally merged only partially in r289422,
without adding support for skein, sha512 and edonr on FreeBSD.
Support for skein and sha512 was added later on, but edonr is still not
implemented in FreeBSD.
Prior to this commit zfs(8) correctly rejected edonr, but with an error
message that claimed support:
fk@r500 ~ $zfs set checksum=edonr tank
cannot set property for 'tank': 'checksum' must be one of 'on | off | fletcher2 | fletcher4 | sha256 | sha512 | skein | edonr'
PR: 204055
Submitted by: Fabian Keil
Approved by: allanjude
Obtained from: ElectroBSD
MFC after: 1 week
When built with -fno-inline-functions zfs.ko contains undefined references
to these functions if they are only marked inline.
Reviewed by: avg (earlier version)
MFC after: 1 week
Sponsored by: Chelsio Communications
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Be more careful about the use of provider names vs vdev names in
ZFS_LOG statements.
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
illumos/illumos-gate@d28671a3b0d28671a3b0https://www.illumos.org/issues/8373
The code that writes ZIL blocks uses dmu_tx_assign(TXG_WAIT) to assign
a transaction to a transaction group. That seems to be logically
incorrect as writing of the ZIL block does not introduce any new dirty
data. Also, when there is a lot of dirty data, the call can introduce
significant delays into the ZIL commit path, thus affecting all
synchronous writes. Additionally, ARC throttling may affect the ZIL
writing.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@79c2b812ee79c2b812eehttps://www.illumos.org/issues/8491
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the
uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed and
deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@267ae6c3a8267ae6c3a8https://www.illumos.org/issues/7915
l2arc_evict() is strictly serialized with respect to
l2arc_write_buffers() and l2arc_write_done(). Normally, l2arc_evict()
and l2arc_write_buffers() are called from the same thread, so they can
not be concurrent. Also, l2arc_write_buffers() uses zio_wait() on the
parent zio of all cache zio-s. That ensures that l2arc_write_done()
is completed before l2arc_write_buffers() returns. Finally, if a
cache device is removed, then l2arc_evict() is called under SCL_ALL in
the exclusive mode. That ensures that it can not be concurrent with
the normal L2ARC accesses to the device (including writing and
evicting buffers). Given the above, some checks and actions in
l2arc_evict() do not make sense. For instance, it must never
encounter the write head header let alone remove it from the buffer
list.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Andriy Gapon <avg@FreeBSD.org>
MFC after: 2 weeks
illumos/illumos-gate@dcb6872c56dcb6872c56https://www.illumos.org/issues/8126
The sync thread is concurrently modifying dn_phys->dn_nlevels
while dbuf_dirty() is trying to assert something about it, without
holding the necessary lock. We need to move this assertion further down
in the function, after we have acquired the dn_struct_rwlock.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 2 weeks