Commit Graph

1977 Commits

Author SHA1 Message Date
Andriy Gapon
70639f9a5e MFV r329710: 8966 Source file zfs_acl.c, function zfs_aclset_common contains a use after end of the lifetime of a local variable
illumos/illumos-gate@82693e09cc
82693e09cc
https://www.illumos.org/issues/8966

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: WHR <msl0000023508@gmail.com>
PR:		225162
Submitted by:	WHR <msl0000023508@gmail.com>
Reported by:	WHR <msl0000023508@gmail.com>
MFC after:	1 week
2018-02-21 14:17:07 +00:00
Alexander Motin
d8e89539c8 MFV r324198: 8081 Compiler warnings in zdb
illumos/illumos-gate@3f7978d02b
3f7978d02b

https://www.illumos.org/issues/8081
  zdb(8) is full of minor problems that generate compiler warnings. On FreeBSD,
  which uses -WError, the only way to build it is to disable all compiler
  warnings. This makes it much harder to detect newly introduced bugs. We should
  cleanup all the warnings.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Alan Somers <asomers@gmail.com>
2018-02-21 03:08:47 +00:00
Alexander Motin
03618fe74d MFV r319737: 6939 add sysevents to zfs core for commands
illumos/illumos-gate@ce1577b049
ce1577b049

https://www.illumos.org/issues/6939
  Originally created https://smartos.org/bugview/OS-4489
       sysevents should be fired in the kernel from ZFS whenever a command
       is run that is logged in zpool history.
  Example output
  Terminal 1
  root - gz sunos ~ # zfs create zones/foobar
  root - gz sunos ~ # zfs set quota=10g zones/foobar
  root - gz sunos ~ # zfs destroy zones/foobar
  Terminal 2
  root - gz sunos ~ # sysevent EC_zfs
  nvlist version: 0
      date = 2016-04-28T14:50:08.964Z
      vendor = SUNW
      publisher = zfs
      class = EC_zfs
      subclass = ESC_ZFS_history_event
      pid = 0
      data = (embedded nvlist)
      nvlist version: 0
          pool_name = zones
          pool_guid = 0x40c964e8f9a7a694
          history_record = (embedded nvlist)
          nvlist version: 0
              dsname = zones/foobar
              dsid = 0x1525
              history internal str =
              internal_name = create
              history txg = 0x4c4ef3

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Joshua M. Clulow <jmc@joyent.com>
Reviewed by: Josh Wilsdon <jwilsdon@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Alan Somers <asomers@gmail.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dave Eddy <dave@daveeddy.com>
2018-02-21 02:19:42 +00:00
Alexander Motin
63e739af67 MFV r319736: 6396 remove SVM
illumos/illumos-gate@5f10ef697f
5f10ef697f

https://www.illumos.org/issues/6396
  LVM = SVM = Solaris Volume Manager
  dead code and not using with ZFS based platform.

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
2018-02-21 00:24:54 +00:00
Alexander Motin
e5a4a83784 MFV r318941: 7446 zpool create should support efi system partition
illumos/illumos-gate@7855d95b30
7855d95b30

https://www.illumos.org/issues/7446
  Since we support whole-disk configuration for boot pool, we also will need
  whole disk support with UEFI boot and for this, zpool create should create efi-
  system partition.
  I have borrowed the idea from oracle solaris, and introducing zpool create -
  B switch to provide an way to specify that boot partition should be created.
  However, there is still an question, how big should the system partition be.
  For time being, I have set default size 256MB (thats minimum size for FAT32
  with 4k blocks). To support custom size, the set on creation "bootsize"
  property is created and so the custom size can be set as: zpool create B -
  o bootsize=34MB rpool c0t0d0
  After pool is created, the "bootsize" property is read only. When -B switch is
  not used, the bootsize defaults to 0 and is shown in zpool get output with
  value ''. Older zfs/zpool implementations are ignoring this property.
  https://www.illumos.org/rb/r/219/

Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>

This commit makes no sense for FreeBSD, that is why I blocked the option,
but it should be good to stay closer to upstream.
2018-02-21 00:18:57 +00:00
Alexander Motin
89dabdb4ea MFC r316910: 7812 Remove gender specific language
illumos/illumos-gate@48bbca8168
48bbca8168

https://www.illumos.org/issues/7812
  This change removes all gendered language that did not refer specifically
  to an individual person or pet. The convention taken was to use
  variations on "they" when referring to users and/or human beings, while
  using "it" when referring to code, functions, and/or libraries.
  Additionally, we took the liberty to fix up any whitespace issues that
  were found in any files that were already being modified.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>
2018-02-20 05:07:21 +00:00
Alexander Motin
3542f1bd3a MFV r307315:
7301 zpool export -f should be able to interrupt file freeing

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>

Closes #175
2018-02-20 04:36:51 +00:00
Alexander Motin
4a0867c8d2 MFV r302649: 7016 arc_available_memory is not 32-bit safe
illumos/illumos-gate@0dd053d7d8
0dd053d7d8

https://www.illumos.org/issues/7016
  upstream DLPX-39446 arc_available_memory is not 32-bit safe
  https://github.com/delphix/delphix-os/commit/
  6b353ea3b8a1610be22e71e657d051743c64190b
  related to this upstream:
  DLPX-38547 delphix engine hang
  https://github.com/delphix/delphix-os/commit/
  3183a567b3e8c62a74a65885ca60c86f3d693783
  DLPX-38547 delphix engine hang (fix static global)
  https://github.com/delphix/delphix-os/commit/
  22ac551d8ef085ad66cc8f65e51ac372b12993b9
  DLPX-38882 system hung waiting on free segment
  https://github.com/delphix/delphix-os/commit/
  cdd6beef7548cd3b12f0fc0328eeb3af540079c2

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: Prakash Surya <prakash.surya@delphix.com>
2018-02-20 04:14:12 +00:00
Andriy Gapon
8cebd0e419 relax an assert in zfsctl_snapdir_lookup to match r323578
Since r323578 we may remove the last reference to a covered vnode with
vrele() instead of vput().  So, v_usecount may be decremented before
the vnode is locked and zfsctl_snapdir_lookup may "catch" the vnode
with v_usecount of zero and v_holdcnt of one.

PR:		225795
Reported by:	asomers
MFC after:	1 week
2018-02-19 08:55:22 +00:00
Alan Somers
4571b2776f zfs: fix formatting in a log statement
Submitted by:	Dave Baukus <daveb@spectralogic.com>
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
2018-02-16 21:59:08 +00:00
Alan Somers
dfbc272d5d Handle generic pathconf attributes in the .zfs ctldir
MFC instructions: change the value of _PC_LINK_MAX to INT_MAX

Reported by:	jhb
MFC after:	19 days
X-MFC-With:	329265
Sponsored by:	Spectra Logic Corp
2018-02-16 16:56:09 +00:00
Andriy Gapon
c945107d23 read-behind / read-ahead support for zfs_getpages()
ZFS caches blocks it reads in its ARC, so in general the optional
pages are not as useful as with filesystems that read the data
directly into the target pages.  But still the optional pages
are useful to reduce the number of page faults and associated
VM / VFS / ZFS calls.
Another case that gets optimized (as a side effect) is paging in
from a hole.  ZFS DMU does not currently provide a convenient
API to check for a hole.  Instead it creates a temporary zero-filled
block and allows accessing it as if it were a normal data block.
Getting multiple pages one by one from a hole results in repeated
creation and destruction of the temporary block (and an associated
ARC header).

Tested with fsx using various supported blocks sizes from 512 bytes
to 128 KB and additionally 1 MB.

Please note that in illumos and ZoL they do not do the range-locking in
the page-in path. This is because ZFS has a double-caching problem
between ARC and page cache and that requires zfs_read() and zfs_write()
to consult pages in the page cache. So, in those functions they first
lock a range and then lock pages corresponding to the range. While in
the page-in (and maybe page-out) path they first lock the pages and then
would lock the range. So, they would have a deadlock.

I believe that FreeBSD does not have that problem, because the page-in
deals only with invalid pages while zfs_read() and zfs_write() need to
access only valid pages. They do not wait on a busy page unless it's
already valid.

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D14263
2018-02-16 06:59:35 +00:00
Andriy Gapon
113ce413ba MFV r329313: 8857 zio_remove_child() panic due to already destroyed parent zio
illumos/illumos-gate@d6e1c446d7
d6e1c446d7

https://www.illumos.org/issues/8857
  I had an OS panic on one of our servers:

  ffffff01809128c0 vpanic()
  ffffff01809128e0 mutex_panic+0x58(fffffffffb94c904, ffffff597dde7f80)
  ffffff0180912950 mutex_vector_enter+0x347(ffffff597dde7f80)
  ffffff01809129b0 zio_remove_child+0x50(ffffff597dde7c58, ffffff32bd901ac0,
  ffffff3373370908)
  ffffff0180912a40 zio_done+0x390(ffffff32bd901ac0)
  ffffff0180912a70 zio_execute+0x78(ffffff32bd901ac0)
  ffffff0180912b30 taskq_thread+0x2d0(ffffff33bae44140)
  ffffff0180912b40 thread_start+8()

  It panicked here:
  http://src.illumos.org/source/xref/illumos-gate/usr/src/uts/common/fs/zfs/
  zio.c#430

  pio->io_lock is DEAD, thus a panic. Further analysis shows the "pio"
  (parent zio of "cio") has already been destroyed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Youzhong Yang <youzhong@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>

PR:		223803
Tested by:	shiva.bhanujan@quorum.com
MFC after:	2 weeks
2018-02-15 14:46:29 +00:00
Alan Somers
d64bae1f1a Implement .vop_pathconf and .vop_getacl for the .zfs ctldir
zfsctl_common_pathconf will report all the same variables that regular ZFS
volumes report. zfsctl_common_getacl will report an ACL equivalent to 555,
except that you can't read xattrs or edit attributes.

Fixes a bug where "ls .zfs" will occasionally print something like:
ls: .zfs/.: Operation not supported

PR:		225793
Reviewed by:	avg
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D14365
2018-02-14 15:49:31 +00:00
Alexander Motin
bc55f7298e Add sysctls for dnode block and indirect block shifts.
MFC after:	2 weeks
2018-02-09 23:29:50 +00:00
Andriy Gapon
f11548e1da remove a duplicate assignment
There should be no functional change.

MFC after:	1 week
2018-02-08 13:22:40 +00:00
Jeff Roberson
e2068d0bcd Use per-domain locks for vm page queue free. Move paging control from
global to per-domain state.  Protect reservations with the free lock
from the domain that they belong to.  Refactor to make vm domains more
of a first class object.

Reviewed by:    markj, kib, gallatin
Tested by:      pho
Sponsored by:   Netflix, Dell/EMC Isilon
Differential Revision:  https://reviews.freebsd.org/D14000
2018-02-06 22:10:07 +00:00
Andriy Gapon
be4be0e5ca zfs: move a utility function, ioflags, closer to its consumers
No functional change.

MFC after:	1 week
2018-02-05 14:19:36 +00:00
Andriy Gapon
6c1e03251e ZFS ARC: restore illumos uses of 'needfree' that were removed in r325851
This is purely a cosmetic change to have a more complete copy of
ifdef-ed out illumos code.

MFC after:	1 week
2018-02-02 12:57:33 +00:00
Andriy Gapon
1fc46a9ba0 zfs_rezget: drop cached pages before doing anything else
We did that in the case of success to prevent the use of stale cached
data, but it makes even less sense to keep the cached data when we fail.

Ideally, we should call vgone() on the vnode in the case of zfs_rezget
failure, but the current lock order prevents us from doing that.

The change also rearranges the order of unlinked check and the size
change check.

While there, add missing SET_ERROR in one of the error paths.

MFC after:	2 weeks
2018-01-31 14:44:51 +00:00
Alexander Motin
1dbc0fc352 MFV r328253: 8835 Speculative prefetch in ZFS not working for misaligned reads
illumos/illumos-gate@5cb8d943bc

https://www.illumos.org/issues/8835:
Sequential reads not aligned to block size are not detected by ZFS
prefetcher as sequential, killing prefetch and severely hurting
performance.  It is caused by dmu_zfetch() in case of misaligned
sequential accesses being called with overlap of one block.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alexander Motin <mav@FreeBSD.org>
2018-01-22 05:57:14 +00:00
Alexander Motin
4eb2803697 MFV r328251: 8652 Tautological comparisons with ZPROP_INVAL
illumos/illumos-gate@4ae5f5f06c

https://www.illumos.org/issues/8652:
Clang and GCC prefer to use unsigned ints to store enums. With Clang, that
causes tautological comparison warnings when comparing a zfs_prop_t or
zpool_prop_t variable to the macro ZPROP_INVAL. It's likely that error
handling code is being silently removed as a result.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Alan Somers <asomers@gmail.com>
2018-01-22 05:52:39 +00:00
Alexander Motin
b5eb78f824 MFV r328247: 8959 Add notifications when a scrub is paused or resumed
illumos/illumos-gate@301fd1d6f2

Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Sean Eric Fagan <sef@ixsystems.com>
2018-01-22 04:31:48 +00:00
Alexander Motin
abe7ff88e1 MFV r328245: 8856 arc_cksum_is_equal() doesn't take into account ABD-logic
illumos/illumos-gate@01a059ee0c

https://www.illumos.org/issues/8856:
arc_cksum_is_equal() calls zio_push_transform() that requires abd_t*
(second arg), but a void* is passed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Roman Strashkin <roman.strashkin@nexenta.com>
2018-01-22 04:23:48 +00:00
Alexander Motin
226cd471b4 MFV r328229:
8930 zfs_zinactive: do not remove the node if the filesystem is readonly

illumos/illumos-gate@93c618e0f4

https://www.illumos.org/issues/8930:
We normally remove an unlinked node when its last user goes away and the
node becomes inactive. However, we should not do that if the filesystem
is mounted read-only including the case where it has its readonly
property set. The node will remain on the unlinked queue, so it will
not be leaked.

One particular scenario is when we receive an incremental stream into a
mounted read-only filesystem and that stream contains an unlinked file
(still on the unlinked queue). If that file is opened before the
receive and some time later after the receive it becomes inactive we
would remove it and, thus, modify the read-only filesystem. As a
result, the filesystem would diverge from its source and further
incremental receives would not be possible (without forcing a rollback).

Another related scenario, that may or may not be possible depending on an
OS / VFS policy, is when an open file is unlinked, then the filesystem is
remounted read-only, and then the file is closed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@FreeBSD.org>
2018-01-21 23:49:17 +00:00
Alexander Motin
272f833cde MFV r328227: 8909 8585 can cause a use-after-free kernel panic
illumos/illumos-gate@94ddd0900a

https://www.illumos.org/issues/8909:
There's a race condition that exists if `zil_free_lwb` races with either
`zil_commit_waiter_timeout` and/or `zil_lwb_flush_vdevs_done`.

Here's an example panic due to this bug:

> ::status
    debugging crash dump vmcore.0 (64-bit) from ip-10-110-205-40
    operating system: 5.11 dlpx-5.2.2.0_2017-12-04-17-28-32b6ba51fb (i86pc)
    image uuid: 4af0edfb-e58e-6ed8-cafc-d3e9167c7513
    panic message:
    BAD TRAP: type=e (#pf Page fault) rp=ffffff0010555970 addr=60 occurred in mo
dule "zfs" due to a NULL pointer dereference
    dump content: kernel pages only

> $c
    zio_shrink+0x12()
    zil_lwb_write_issue+0x30d(ffffff03dcd15cc0, ffffff03e0730e20)
    zil_commit_waiter_timeout+0xa2(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit_waiter+0xf3(ffffff03dcd15cc0, ffffff03d97ffcf8)
    zil_commit+0x80(ffffff03dcd15cc0, 9a9)
    zfs_write+0xc34(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    fop_write+0x5b(ffffff03dc38b140, ffffff0010555e60, 40, ffffff03e00fb758, 0)
    write+0x250(42, fffffd7ff4832000, 2000)
    sys_syscall+0x177()

If there's an outstanding lwb that's in `zil_commit_waiter_timeout`
waiting to timeout, waiting on it's waiter's CV, we must be sure not to
call `zil_free_lwb`. If we end up calling `zil_free_lwb`, then that LWB
may be freed and can result in a use-after-free situation where the
stale lwb pointer stored in the `zil_commit_waiter_t` structure of the
thread waiting on the waiter's CV is used.

A similar situation can occur if an lwb is issued to disk, and thus in
the `LWB_STATE_ISSUED` state, and `zil_free_lwb` is called while the
disk is servicing that lwb. In this situation, the lwb will be freed by
`zil_free_lwb`, which will result in a use-after-free situation when the
lwb's zio completes, and `zil_lwb_flush_vdevs_done` is called.

This race condition is prevented in `zil_close` by calling `zil_commit`
before `zil_free_lwb` is called, which will ensure all outstanding (i.e.
all lwb's in the `LWB_STATE_OPEN` and/or `LWB_STATE_ISSUED` states)
reach the `LWB_STATE_DONE` state before the lwb's are freed
(`zil_commit` will not return untill all the lwb's are
`LWB_STATE_DONE`).

Further, this race condition is prevented in `zil_sync` by only calling
`zil_free_lwb` for lwb's that do not have their `lwb_buf` pointer set.
All lwb's not in the `LWB_STATE_DONE` state will have a non-null value
for this pointer; the pointer is only cleared in
`zil_lwb_flush_vdevs_done`, at which point the lwb's state will be
changed to `LWB_STATE_DONE`.

This race is present in `zil_suspend`, leading to this bug.

At first glance, it would appear as though this would not be true
because `zil_suspend` will call `zil_commit`, just like `zil_close`, but
the problem is that `zil_suspend` will set the zilog's `zl_suspend`
field prior to calling `zil_commit`. Further, in `zil_commit`, if
`zl_suspend` is set, `zil_commit` will take a special branch of logic
and use `txg_wait_synced` instead of performing the normal `zil_commit`
logic.

This call to `txg_wait_synced` might be good enough for the data to
reach disk safely before it returns, but it does not ensure that all
outstanding lwb's reach the `LWB_STATE_DONE` state before it returns.
This is because, if there's an lwb "stuck" in
`zil_commit_waiter_timeout`, waiting for it's lwb to timeout, it will
maintain a non-null value for it's `lwb_buf` field and thus `zil_sync`
will not free that lwb. Thus, even though the lwb's data is already on
disk, the lwb will be left lingering, waiting on the CV, and will
eventually timeout and be issued to disk even though the write is
unnesseary.

So, after `zil_commit` is called from `zil_suspend`, we incorrectly
assume that there are not outstanding lwb's, and proceed to free all
lwb's found on the zilog's lwb list. As a result, we free the lwb that
will later be used `zil_commit_waiter_timeout`.

Reviewed by: John Kennedy <jwk404@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
2018-01-21 23:18:42 +00:00
Alexander Motin
d252c97fc0 MFV r328225: 8603 rename zilog's "zl_writer_lock" to "zl_issuer_lock"
illumos/illumos-gate@cf07d3da99

https://www.illumos.org/issues/8603:
  To help make the ZIL's code more understandable, it was suggested that
  the zilog_t's "zl_writer_lock" field should be renamed to "zl_issuer_lock".

Reviewed by: C Fraire <cfraire@me.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
2018-01-21 23:11:20 +00:00
Alexander Motin
442814b7e7 MFV r328220: 8677 Open-Context Channel Programs
illumos/illumos-gate@a3b2868063

https://www.illumos.org/issues/8677
  We want to be able to run channel programs outside of synching context.
  This would greatly improve performance of channel program that just gather
  information, as we won't have to wait for synching context anymore.

  This feature should introduce the following:
  - A new command line flag in "zfs program" to specify our intention to
  run in open context.
  - A new flag/option within the channel program ioctl which selects the
  context.
  - Appropriate error handling whenever we try a channel program in
  open-context that contains zfs.sync* expressions.
  - Documentation for the new feature in the manual pages.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
2018-01-21 23:02:05 +00:00
Andriy Gapon
e09304d8f3 zfs: no need to check that size of zfs_cmd_t is not greater than IOCPARM_MAX
Nowadays we do not pass zfs_cmd_t directly through the ioctl interface.
Instead a small zfs_iocparm_t object is passed and the command is
explicitly copied in and out.  So, the check has become irrelevant.

MFC after:	3 weeks
Sponsored by:	Panzura
2018-01-21 11:19:18 +00:00
Mark Johnston
94a889089b Use the thread's ucred struct when fetching jid or jailname.
Reported by:	mjg
X-MFC with:	r327888
2018-01-14 17:55:40 +00:00
Mark Johnston
224e0c2f61 Add "jid" and "jailname" variables to DTrace.
These return the jail ID and jail name for the traced process,
respectively, and are analogous to "zonename" on Solaris/illumos.
"zonename" is now aliased to "jailname".

Also add some stress tests for the new variables.

Submitted by:	Domagoj Stolfa <domagoj.stolfa@gmail.com>
Reviewed by:	dteske (previous version)
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D13877
2018-01-12 19:59:46 +00:00
Andriy Gapon
be5060c116 zfs_mount: restore a bit of ifdef-out illumos code
And correctly mark the end of the replacement FreeBSD code.

MFC after:	1 week
2018-01-09 13:43:04 +00:00
Jeff Roberson
ad5b0f5b51 Fix arc after r326347 broke various memory limit queries. Use UMA features
rather than kmem arena size to determine available memory.

Initialize the UMA limit to LONG_MAX to avoid spurious wakeups on boot before
the real limit is set.

PR:		224330 (partial), 224080
Reviewed by:	markj, avg
Sponsored by:	Netflix / Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D13494
2018-01-02 04:35:56 +00:00
Dimitry Andric
e42df78a1a Remove obsolete register keyword from opensolaris's sysmacros.h. When
compiling zfsd with recent clang, it leads to a warning about the
register storage class being incompatible with C++17.

MFC after:	3 days
2017-12-24 19:17:15 +00:00
John Baldwin
f27d3a8a72 Don't return early for non-failure for one of the EMLINK checks.
r326987 enabled two #if 0'd-out EMLINK checks in zfs_link_create() for
link overflow.  However, one of the checks (when the vnode adding a link
is a directory such as for mkdir) always returned even if the link did not
overflow.  Change this to only return early if it needs to report an
EMLINK error.

Reported by:	db, shurd
Sponsored by:	Chelsio Communications
2017-12-19 23:54:44 +00:00
John Baldwin
b501cc5da6 Rework pathconf handling for FIFOs.
On the one hand, FIFOs should respect other variables not supported by
the fifofs vnode operation (such as _PC_NAME_MAX, _PC_LINK_MAX, etc.).
These values are fs-specific and must come from a fs-specific method.
On the other hand, filesystems that support FIFOs are required to
support _PC_PIPE_BUF on directory vnodes that can contain FIFOs.
Given this latter requirement, once the fs-specific VOP_PATHCONF
method supports _PC_PIPE_BUF for directories, it is also suitable for
FIFOs permitting a single VOP_PATHCONF method to be used for both
FIFOs and non-FIFOs.

To that end, retire all of the FIFO-specific pathconf methods from
filesystems and change FIFO-specific vnode operation switches to use
the existing fs-specific VOP_PATHCONF method.  For fifofs, set it's
VOP_PATHCONF to VOP_PANIC since it should no longer be used.

While here, move _PC_PIPE_BUF handling out of vop_stdpathconf() so that
only filesystems supporting FIFOs will report a value.  In addition,
only report a valid _PC_PIPE_BUF for directories and FIFOs.

Discussed with:	bde
Reviewed by:	kib (part of a larger patch)
MFC after:	1 month
Sponsored by:	Chelsio Communications
Differential Revision:	https://reviews.freebsd.org/D12572
2017-12-19 22:39:05 +00:00
John Baldwin
599afe53a8 Move NAME_MAX, LINK_MAX, and CHOWN_RESTRICTED out of vop_stdpathconf().
Having all filesystems fall through to default values isn't always correct
and these values can vary for different filesystem implementations.  Most
of these changes just use the existing default values with a few exceptions:
- Don't report CHOWN_RESTRICTED for ZFS since it doesn't do the exact
  permissions check this claims for chown().
- Use NANDFS_NAME_LEN for NAME_MAX for nandfs.
- Don't report a LINK_MAX of 0 on smbfs.  Now fail with EINVAL to
  indicate hard links aren't supported.

Requested by:	bde (though perhaps not this exact implementation)
Reviewed by:	kib (earlier version)
MFC after:	1 month
Sponsored by:	Chelsio Communications
2017-12-19 19:51:36 +00:00
John Baldwin
697a86b6bf Adjust ZFS' link count handling for ino64.
- Define a ZFS_LINK_MAX as the ZFS version of LINK_MAX which is set to
  UINT64_MAX to match the on-disk format.
- Enable the currently #if 0'd code to check for link overflows and
  return EMLINK.
- Don't clamp the link count reported in stat() to LINK_MAX as that is
  still the 16-bit limit, but report the full link counts.  Also,
  avoid possibly overflowing the reported link count to 0 when adjusting
  the link count to account for ".snapshot".
- Update the LINK_MAX reported by pathconf() to report ZFS_LINK_MAX
  rather than LINK_MAX (but clamped to LONG_MAX for 32-bit systems).

Reviewed by:	avg (earlier version)
Sponsored by:	Chelsio Communications
2017-12-19 19:07:24 +00:00
Mark Johnston
6c7828a280 Avoid CPU migration in dtrace_gethrtime() on x86.
dtrace_gethrtime() may be called outside of probe context, and in
particular, from the DTRACEIOC_BUFSNAP handler.

Disable interrupts rather than using sched_pin() to help ensure that
we don't call any external functions when in probe context.

PR:		218452
MFC after:	1 week
2017-12-18 17:26:24 +00:00
Mark Johnston
7a177c2d5e Unregister the ARC lowmem event handler earlier in arc_fini().
Otherwise a poorly timed lowmem event may attempt to acquire a destroyed
lock. Unregister the handler before destroying the ARC reclaim thread.

Reported by:	gjb
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D13480
2017-12-17 18:21:40 +00:00
Mark Johnston
a981eff82e MFV r326785: 8880 improve DTrace error checking
illumos/illumos-gate@2cf374268f
2cf374268f

https://www.illumos.org/issues/8880

Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>

MFC after:	1 week
2017-12-12 22:08:34 +00:00
Mark Johnston
f74deabac8 Correct initialization of pc on powerpc.
PR:		224293
Submitted by:	Breno Leitao <breno.leitao@gmail.com>
X-MFC with:	r326774
Pointy hat:	markj
2017-12-12 20:41:11 +00:00
Mark Johnston
5bab623438 Pass the trap frame to fasttrap hooks.
The DTrace fasttrap entry points expect a struct reg containing the
register values of the calling thread. Perform the conversion in
fasttrap rather than in the trap handler: this reduces the number of
ifdefs and avoids wasting stack space for traps that don't involve
DTrace.

MFC after:	2 weeks
2017-12-11 19:21:39 +00:00
Warner Losh
f5b24e1c9f Mark two things as unused (since they are only sometimes used) and
toss in a DECONST to remove a const in some tricky code that would
require too extensive a change to unwind otherwise.

Sponsored by: Netflix
2017-12-03 04:55:33 +00:00
Warner Losh
1227a4f4ea Fix all warnings related to geli and ZFS support on x86.
Default WARNS to 0 still, since there's still some warnings on other
architectures.

Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D13301
2017-12-02 00:07:37 +00:00
Alan Somers
3613ea59c7 Fix assertion when ZFS fails to open certain devices
"panic: vdev_geom_close_locked: cp->private is NULL"
This panic will result if ZFS fails to open a device due to either of the
following reasons:

1) The device's sector size is greater than 8KB.
2) ZFS wants to open the device RW, but it can't be opened for writing.

The solution is to change the initialization order to ensure that the
assertion will be satisfied.

PR:		221066
Reported by:	David NewHamlet <wheelcomplex@gmail.com>
Reviewed by:	avg
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D13278
2017-11-30 15:36:06 +00:00
Alan Somers
fb20566033 Revert r326399
Accidentally committed wrong file

Pointy hat to:	asomers
Sponsored by:	Spectra Logic Corp
2017-11-30 15:34:55 +00:00
Alan Somers
de65823b48 Fix assertion when ZFS fails to open certain devices
"panic: vdev_geom_close_locked: cp->private is NULL"
This panic will result if ZFS fails to open a device due to either of the
following reasons:

1) The device's sector size is greater than 8KB.
2) ZFS wants to open the device RW, but it can't be opened for writing.

The solution is to change the initialization order to ensure that the
assertion will be satisfied.

PR:		221066
Reported by:	David NewHamlet <wheelcomplex@gmail.com>
Reviewed by:	avg
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D13278
2017-11-30 15:28:29 +00:00
Mark Johnston
0037455148 Don't use pcpu_find() to determine if a CPU ID is valid.
This addresses assertion failures after r326218.

MFC after:	1 week
2017-11-27 18:42:23 +00:00
Mark Johnston
e9f63df76d Duplicate helpers after disabling inherited tracepoints during a fork.
We may create probes in the nascent child process, so we first need to
ensure that any inherited tracepoints are first removed. Otherwise the
probe sites will not be in the state expected by fasttrap, and it won't
be able to enable the probes.

MFC after:	2 weeks
2017-11-23 14:29:07 +00:00
Justin Hibbits
efa8edd5bb PowerPC has 12 artificial frames for the profiler
It may need to be different between AIM and Book-E, this was tested only on
Book-E (64- and 32-bit)

MFC after:	3 weeks
2017-11-22 01:53:59 +00:00
Andriy Gapon
7bcc2cfc86 zfs_write: fix problem with writes appearing to succeed when over quota
The problem happens when the writes have offsets and sizes aligned with
a filesystem's recordsize (maximum block size).  In this scenario
dmu_tx_assign() would fail because of being over the quota, but the uio
would already be modified in the code path where we copy data from the
uio into a borrowed ARC buffer.  That makes an appearance of a partial
write, so zfs_write() would return success and the uio would be modified
consistently with writing a single block.

That bug can result in a data loss because the writes over the quota
would appear to succeed while the actual data is being discarded.

This commit fixes the bug by ensuring that the uio is not changed until
after all error checks are done.  To achieve that the code now uses
uiocopy() + uioskip() as in the original illumos design.  We can do that
now that uiocopy() has been updated in r326067 to use
vn_io_fault_uiomove().

Reported by:	mav
Analyzed by:	mav
Reviewed by:	mav
Pointyhat to:	avg (myself)
MFC after:	1 week
X-MFC after:	r326067
X-Erratum:	wanted
2017-11-21 18:28:14 +00:00
Andriy Gapon
bf71fe61af make illumos uiocopy use vn_io_fault_uiomove
uiocopy() is currently unused, its purpose is copy data from a uio
without modifying the uio.  It was in use before the vn_io_fault support
was added to ZFS, at which point our code diverged from the illumos code
a little bit.  Because ZFS is the only (potential) user of the function
we are free to modify it to better suit ZFS needs.

The intention behind this change is to remove the differences introduced
earlier in zfs_write().

While here, re-implement uioskip() using uiomove() with
uio_segflg == UIO_NOCOPY.
The story of uioskip is the same as with uiocopy.

Reviewed by:	mav
MFC after:	1 week
2017-11-21 18:01:43 +00:00
Mark Johnston
e9a2e17d1b Avoid holding the process in uread() and uwrite().
In general, higher-level code will atomically verify that the process
is not exiting and hold the process. In one case, we were using uwrite()
to copy a probed instruction to a per-thread scratch space block, but
copyout() can be used for this purpose instead; this change effectively
reverts r227291.

MFC after:	1 week
2017-11-16 07:25:12 +00:00
Baptiste Daroussin
c07d14deb5 remove the poor emulation of the IllumOS needfree global variable to prevent
the ARC reclaim thread running longer than needed.

Update the arc::needfree dtrace probe triggered in arc_lowmem() to also report
the value we may want to free.

Submitted by:	Nikita Kozlov <nikita.kozlov at blade-group.com>
Reviewed by:	avg
Approved by:	avg
MFC after:	3 weeks
Sponsored by:	blade
Differential Revision:	https://reviews.freebsd.org/D12163
2017-11-15 12:48:36 +00:00
Andriy Gapon
da1bfa506f MFV r325609: 7531 Assign correct flags to prefetched buffers
illumos/illumos-gate@2729521654
2729521654

https://www.illumos.org/issues/7531
  I found that some buffers that could be L2ARC eligible are not flagged
  such, leading to some performance impact.  As a test I ran the same IO
  workload 10 times in a raw.  It is a metadata only workload (files
  listing).  l2arc_noprefetch=0.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: benrubson <ben.rubson@gmail.com>

MFC after:	8 days
2017-11-09 18:22:42 +00:00
Andriy Gapon
885a5425f3 MFV r325607: 8607 zfs: variable set but not used
illumos/illumos-gate@b852c2f543
b852c2f543

https://www.illumos.org/issues/8607

Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>

MFC after:	1 week
2017-11-09 18:14:42 +00:00
Andriy Gapon
2e06836fee MFV r325605: 8713 Buffer overflow in dsl_dataset_name()
illumos/illumos-gate@f37ae9a714
f37ae9a714

https://www.illumos.org/issues/8713
  If we're creating a pool with version >= SPA_VERSION_DSL_SCRUB (v11) we need to
  account for additional space needed by the origin dataset which will also be
  snapshotted: "poolname"+"/"+"$ORIGIN"+"@"+"$ORIGIN".
  Enforce this limit in pool_namecheck().

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: loli10K <ezomori.nozomu@gmail.com>

MFC after:	1 week
2017-11-09 18:12:21 +00:00
Andriy Gapon
96ed2690df Disable posix_fallocate(2) for ZFS
The generic (naive) implementation of posix_fallocate cannot provide the
standard mandated guarantee that overwrites would never fail due to the lack
of free space.  The fundamental reason is the copy-on-write architecture
of ZFS.  Other features like compression and deduplication can also
increase the size difference between the (pre-)allocated dummy content
and the future content.

So, until ZFS can properly implement the feature it's better to report
that it is unsupported rather than providing an ersatz implementation.
Please note that EINVAL is used to report that the underlying file system
does not support the operation (POSIX.1-2008).

illumos and ZoL seem to do the same.

MFC after:	3 weeks
Sponsored by:	Panzura
2017-11-02 13:49:08 +00:00
Andriy Gapon
40d47bb4eb vdev_geom_close: close errored consumer even if vdev_reopening is set
If vdev_geom_close doesn't close the consumer, then the subsequent call
to vdev_geom_open() would be just a NOP and would always return success.
Thus, at present vdev_reopen() would always succeed for vdev_geom devices
even if the underlying provider is in error state.
The problem was introduced as a result of an optimization in rS308055.

The most significant manifistation of the problem is that
zio_vdev_io_done() --> vdev_probe() --> SPA_ASYNC_PROBE -->
spa_async_probe() --> vdev_reopen()
chain of calls and events becomes a NOP as well.
This chain is invoked when zio_vdev_io_done() detects an "unexpected"
error from the lower level I/O.
Additionally, that call path may race with SPA_ASYNC_REMOVE path because
of the asynchronous nature of them both.  So, the SPA_ASYNC_PROBE may
erroneously mark a vdev as being healthy after SPA_ASYNC_REMOVE marked
it as removed.

Reviewed by:	asomers, mav
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D12731
2017-10-31 10:15:03 +00:00
Alan Somers
659058b06f Fix the error message when creating a zpool on a too-small device
Don't check for SPA_MINDEVSIZE in vdev_geom_attach when opening by path.
It's redundant with the check in vdev_open, and failing to attach here
results in the wrong error message being printed.  However, still check for
it in some other situations:

* When opening by guids, so we don't get bogged down reading from slow
  devices like floppy drives.
* In vdev_geom_read_pool_label for the same reason, because we iterate over
  all providers.
* If the caller requests that we verify the guid, because then we'll have to
  read from the device before vdev_open verifies the size.

PR:		222227
Reported by:	Marie Helene Kvello-Aune <marieheleneka@gmail.com>
Reviewed by:	avg, mav
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D12531
2017-10-23 23:05:29 +00:00
Mateusz Guzik
5a17c5524f sdt: make all sdt probe sites test one variable
This saves on cache misses at the expense of a slight grow of .text.

Note this is a bandaid for lack of hotpatching.

Discussed with:	markj
2017-10-22 20:22:23 +00:00
Andriy Gapon
13bacc7144 remove spa_sync_on assert from spa_async_thread_vd
Unlike spa_async_thread that can get started only from spa_sync()
spa_async_thread_vd can get started from other contexts.
Additionally, spa_async_thread_vd does not really depend on
spa sync being enabled.

The incorrect assert could be triggered by importing a pool in the
read-only mode and then disconnecting one of its disks.
In this case spa_sync_on was false because the pool was read-only
and spa_async_thread_vd was started to handle SPA_ASYNC_REMOVE event.

Note: spa_async_thread_vd() currently exists only in FreeBSD, it was
split out of spa_async_thread() in r253990.

Discussed with:	mav
MFC after:	2 weeks
2017-10-19 16:36:07 +00:00
Andriy Gapon
1dce7acd37 illumos mutex_init: use SX_NEW instead of bzero
There should be no functional change, but SX_NEW seems to be more
idiomatic to the use-case.

MFC after:	2 weeks
X-MFC note:	stable/11 only
2017-10-09 07:44:09 +00:00
Andriy Gapon
e117882ba2 MFV r322235: 8067 zdb should be able to dump literal embedded block pointer
illumos/illumos-gate@4923c69fdd
4923c69fdd

FreeBSD note: the manual page is to be updated separately.

https://www.illumos.org/issues/8067
  Add an option to zdb to print a literal embedded block pointer supplied on the
  command line:
  zdb -E [-A] word0:word1:...:word15

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

MFC after:	3 weeks
2017-10-06 08:21:06 +00:00
Andriy Gapon
a4bc304da5 remove heuristic error detection from ddi_strto*()
Zero, <TYPE>_MIN and <TYPE>_MAX values can result from valid conversions.
They don't necessarily imply any error.
Since we do not have any reliable error signaling from libkern's strto*(),
it's better to always assume success rather than to report an error when
there is none.

Reviewed by:	tsoome
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D12565
2017-10-05 12:25:18 +00:00
Andriy Gapon
fed20fc736 really unbreak kernel builds on sparc64 and powerpc64 after r324163, ZFS Channel Programs
This commit also reverts r324178 that did not fix the problem on powerpc64
where char is usigned.

MFC after:	4 weeks
X-MFC with:	r324163
2017-10-05 06:39:57 +00:00
Andriy Gapon
620c2c801b MFV r323913: 8600 ZFS channel programs - snapshot
illumos/illumos-gate@2840dce1a0
2840dce1a0

https://www.illumos.org/issues/8600
  ZFS channel programs should be able to create snapshots.
  In addition to the base snapshot functionality, this will likely entail adding
  extra logic to handle edge cases which were formerly not possible, such as
  creating then destroying a snapshot in the same transaction sync.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chris Williamson <chris.williamson@delphix.com>

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-02 11:32:08 +00:00
Andriy Gapon
a1f65a15ce MFV r323912: 8592 ZFS channel programs - rollback
illumos/illumos-gate@000cce6b6f
000cce6b6f

https://www.illumos.org/issues/8592
  ZFS channel programs should be able to perform a rollback. This logic will
  probably look pretty similar to zfs.sync.destroy().

Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brad Lewis <brad.lewis@delphix.com>

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-02 11:23:31 +00:00
Andriy Gapon
8cdc315e9e MFV r323795: 8604 Avoid unnecessary work search in VFS when unmounting snapshots
illumos/illumos-gate@ed992b0aac
ed992b0aac

https://www.illumos.org/issues/8604
  Every time we want to unmount a snapshot (happens during snapshot deletion or
  renaming) we unnecessarily iterate through all the mountpoints in the VFS layer
  (see zfs_get_vfs).
  Ideally we would just put a hold on the snapshot and access its respective VFS
  resource directly.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

FreeBSD note: I added a FreeBSD specific function getzfsvfs_ref() which
is like getzfsvfs() but returns a filesystem referenced, not busied.
We want a busied filesystem in most cases, because we access its private
data and, thus, we need to prevent the filesystem from being unmounted
and its private data destroyed.  But in some cases we can either get
away with just a referenced filesystem or we must not busy the
filesystem.  Unmounting the filesystem is one of such cases.

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-02 11:15:32 +00:00
Andriy Gapon
55b73eb9c5 fix incorrect use of getzfsvfs_impl in r324163, ZFS Channel Programs
getzfsvfs_impl() returns a referenced, not busied, filesystem,
so the matching call is vfs_rel, not vfs_unbusy.

MFC after:	4 weeks
X-MFC with:	r324163
2017-10-02 11:07:48 +00:00
Andriy Gapon
6a2f82bdf8 unbreak kernel builds on sparc64 and powerpc after r324163, ZFS Channel Programs
The custom iscntrl() in ZFS Lua code expects a signed argumnet, so
remove the harmful cast.

Reported by:	ian
MFC after:	5 weeks
X-MFC with:	r324163
2017-10-01 20:12:30 +00:00
Andriy Gapon
3e52a05570 MFV r323794: 8605 zfs channel programs: zfs.exists undocumented and non-working
illumos/illumos-gate@5f39f884e2
5f39f884e2

https://www.illumos.org/issues/8605
  zfs.exists() in channel programs doesn't return any result, and should have a
  man page entry.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Chris Williamson <chris.williamson@delphix.com>

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-01 16:51:05 +00:00
Andriy Gapon
529b326e63 MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best()
illumos/illumos-gate@7d3000f774
7d3000f774

https://www.illumos.org/issues/8521
  Yuri reported this to the mailing list:
  doing a `reboot -d` on current illumos-gate HEAD gives the following "::
  findleaks -dv" output:
  findleaks: maximum buffers => 301061
  findleaks: actual buffers => 297587
  findleaks:
  findleaks: potential pointers => 29289774
  findleaks: dismissals => 26242305 (89.5%)
  findleaks: misses => 331153 ( 1.1%)
  findleaks: dups => 2419681 ( 8.2%)
  findleaks: follows => 296635 ( 1.0%)
  findleaks:
  findleaks: peak memory usage => 7353 kB
  findleaks: elapsed CPU time => 1.5 seconds
  findleaks: elapsed wall time => 2.0 seconds
  findleaks:
  CACHE LEAKED BUFCTL CALLER
  ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f
  ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f
  ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f
  ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f
  ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f
  ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f
  ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f
  ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f
  ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f
  ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f
  ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f
  ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f
  ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-01 16:41:05 +00:00
Andriy Gapon
550374efe6 revert r324166, it has an unrelated change in it 2017-10-01 16:37:54 +00:00
Andriy Gapon
7d5c6491f0 MFV r323531: 8521 nvlist memory leak in get_clones_stat() and spa_load_best()
illumos/illumos-gate@7d3000f774
7d3000f774

https://www.illumos.org/issues/8521
  Yuri reported this to the mailing list:
  doing a `reboot -d` on current illumos-gate HEAD gives the following "::
  findleaks -dv" output:
  findleaks: maximum buffers => 301061
  findleaks: actual buffers => 297587
  findleaks:
  findleaks: potential pointers => 29289774
  findleaks: dismissals => 26242305 (89.5%)
  findleaks: misses => 331153 ( 1.1%)
  findleaks: dups => 2419681 ( 8.2%)
  findleaks: follows => 296635 ( 1.0%)
  findleaks:
  findleaks: peak memory usage => 7353 kB
  findleaks: elapsed CPU time => 1.5 seconds
  findleaks: elapsed wall time => 2.0 seconds
  findleaks:
  CACHE LEAKED BUFCTL CALLER
  ffffff03d222b008 120 ffffff03ef7ceb78 nv_alloc_sys+0x1f
  ffffff03d222a448 123 ffffff03f4150cc8 nv_alloc_sys+0x1f
  ffffff03d222b448 5 ffffff03f28bd598 nv_alloc_sys+0x1f
  ffffff03d222b888 87 ffffff03f28c10f0 nv_alloc_sys+0x1f
  ffffff03d222c008 21 ffffff03f4139310 nv_alloc_sys+0x1f
  ffffff03d222b888 43 ffffff040ef3f3e8 nv_alloc_sys+0x1f
  ffffff03d222c008 120 ffffff03f4591e58 nv_alloc_sys+0x1f
  ffffff03d222b008 121 ffffff03f352c068 nv_alloc_sys+0x1f
  ffffff03d222a448 112 ffffff03f414e5f8 nv_alloc_sys+0x1f
  ffffff03d222b008 119 ffffff03ee92fdc0 nv_alloc_sys+0x1f
  ffffff03d222b888 46 ffffff03f28c1378 nv_alloc_sys+0x1f
  ffffff03d222b448 4 ffffff03f28c7708 nv_alloc_sys+0x1f
  ffffff03d222c008 20 ffffff03f2a6e7e8 nv_alloc_sys+0x1f

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>

MFC after:	5 weeks
X-MFC after:	r324163
2017-10-01 16:34:16 +00:00
Andriy Gapon
bda88d07d9 MFV r323530,r323533,r323534: 7431 ZFS Channel Programs, and followups
7431 ZFS Channel Programs

illumos/illumos-gate@dfc115332c
dfc115332c

https://www.illumos.org/issues/7431
  ZFS channel programs (ZCP) adds support for performing compound ZFS
  administrative actions via Lua scripts in a sandboxed environment (with time
  and memory limits).
  This initial commit includes both base support for running ZCP scripts, and a
  small initial library of API calls which support getting properties and
  listing, destroying, and promoting datasets.
  Testing: in addition to the included unit tests, channel programs have been in
  use at Delphix for several months for batch destroying filesystems. The
  dsl_destroy_snaps_nvl() call has also been replaced with

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <chris.williamson@delphix.com>

8552 ZFS LUA code uses floating point math

illumos/illumos-gate@916c8d8811
916c8d8811

https://www.illumos.org/issues/8552
  In the LUA interpreter used by "zfs program", the lua format() function
  accidentally includes support for '%f' and friends, which can cause compilation
  problems when building on platforms that don't support floating-point math in
  the kernel (e.g. sparc). Support for '%f' friends (%f %e %E %g %G) should be
  removed, since there's no way to supply a floating-point value anyway (all
  numbers in ZFS LUA are int64_t's).

Reviewed by: Yuri Pankov <yuripv@gmx.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

8590 memory leak in dsl_destroy_snapshots_nvl()

illumos/illumos-gate@e6ab4525d1
e6ab4525d1

https://www.illumos.org/issues/8590
  In dsl_destroy_snapshots_nvl(), "snaps_normalized" is not freed after it is
  added to "arg".

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

FreeBSD notes:
- zfs-program.8 manual page is taken almost as is from the vendor repository,
  no FreeBSD-ification done
- fixed multiple instances of NULL being used where an integer is expected
- replaced ETIME and ECHRNG with ETIMEDOUT and EDOM respectively

This commit adds a modified version of Lua 5.2.4 under
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/lua, mirroring the
upstream.  See README.zfs in that directory for the description of Lua
customizations.
See zfs-program.8 on how to use the new feature.

MFC after:	5 weeks
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D12528
2017-10-01 16:11:07 +00:00
Mark Johnston
47f11baaca Use C99 initializers for DTrace provider methods.
This makes the definitions easier to read and more cscope-friendly.

MFC after:	1 week
2017-09-27 17:46:38 +00:00
Andriy Gapon
443efc868c fix r324011, MFV of r323535, 8585 improve batching done in zil_commit()
I managed to commit an older version of the change.
Plus, even the latest version was not ready for userland compilation.

Reported by:	"O. Hartmann" <ohartmann@walstatt.org>,
		cy
MFC after:	1 week
X-MFC with:	r324011
2017-09-26 15:38:16 +00:00
Andriy Gapon
c13f1d82c8 MFV r323535: 8585 improve batching done in zil_commit()
FreeBSD notes:
- this MFV reverts FreeBSD commit r314549 to make the merge easier
- at present our emulation of cv_timedwait_hires is rather poor,
  so I elected to use cv_timedwait_sbt directly
Please see the differential revision for details.
Unfortunately, I did not get any positive reviews, so there could be
bugs in the FreeBSD-specific piece of the merge.
Hence, the long MFC timeout.

illumos/illumos-gate@1271e4b10d
1271e4b10d

https://www.illumos.org/issues/8585
  The current implementation of zil_commit() can introduce significant
  latency, beyond what is inherent due to the latency of the underlying
  storage. The additional latency comes from two main problems:
  1. When there's outstanding ZIL blocks being written (i.e. there's
      already a "writer thread" in progress), then any new calls to
      zil_commit() will block waiting for the currently oustanding ZIL
      blocks to complete. The blocks written for each "writer thread" is
      coined a "batch", and there can only ever be a single "batch" being
      written at a time. When a batch is being written, any new ZIL
      transactions will have to wait for the next batch to be written,
      which won't occur until the current batch finishes.
  As a result, the underlying storage may not be used as efficiently
      as possible. While "new" threads enter zil_commit() and are blocked
      waiting for the next batch, it's possible that the underlying
      storage isn't fully utilized by the current batch of ZIL blocks. In
      that case, it'd be better to allow these new threads to generate
      (and issue) a new ZIL block, such that it could be serviced by the
      underlying storage concurrently with the other ZIL blocks that are
      being serviced.
  2. Any call to zil_commit() must wait for all ZIL blocks in its "batch"
      to complete, prior to zil_commit() returning. The size of any given
      batch is proportional to the number of ZIL transaction in the queue
      at the time that the batch starts processing the queue; which
      doesn't occur until the previous batch completes. Thus, if there's a
      lot of transactions in the queue, the batch could be composed of
      many ZIL blocks, and each call to zil_commit() will have to wait for
      all of these writes to complete (even if the thread calling
      zil_commit() only cared about one of the transactions in the batch).

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D12355
2017-09-26 11:04:08 +00:00
Ian Lepore
a78b4d1462 Use nstosbt() instead of multiplying by SBT_1NS to avoid roundoff errors.
Differential Revision:	https://reviews.freebsd.org/D11779
2017-09-25 15:03:27 +00:00
Andriy Gapon
f94aa61c33 MFV r323917: 8648 Fix range locking in ZIL commit codepath
illumos/illumos-gate@42b1411172
42b1411172

https://www.illumos.org/issues/8648
  I'm opening this bug to track integration of the following ZFS on Linux
  commit into illumos:

  commit f763c3d1df569a8d6b60bcb5e95cf07aa7a189e6
  Author: LOLi <loli10K@users.noreply.github.com>
  Date:   Mon Aug 21 17:59:48 2017 +0200

      Fix range locking in ZIL commit codepath

      Since OpenZFS 7578 (1b7c1e5) if we have a ZVOL with logbias=throughput
      we will force WR_INDIRECT itxs in zvol_log_write() setting itx->itx_lr
      offset and length to the offset and length of the BIO from
      zvol_write()->zvol_log_write(): these offset and length are later used
      to take a range lock in zillog->zl_get_data function: zvol_get_data().

      Now suppose we have a ZVOL with blocksize=8K and push 4K writes to
      offset 0: we will only be range-locking 0-4096. This means the
      ASSERTion we make in dbuf_unoverride() is no longer valid because now
      dmu_sync() is called from zilog's get_data functions holding a partial
      lock on the dbuf.

      Fix this by taking a range lock on the whole block in zvol_get_data().

      Reviewed-by: Chunwei Chen <tuxoko@gmail.com>
      Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
      Signed-off-by: loli10K <ezomori.nozomu@gmail.com>

Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Alexander Motin <mav@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: LOLi <loli10K@users.noreply.github.com>

MFC after:	10 days
2017-09-22 08:27:27 +00:00
Andriy Gapon
1c6ea90df5 MFV r323914: 8661 remove "zil-cw2" dtrace probe
illumos/illumos-gate@bd9d3f9046
bd9d3f9046

https://www.illumos.org/issues/8661
  The "zil-cw1" dtrace probe was previously removed in 8558, and the "zil-cw2"
  probe should have been removed in that patch as well. Unfortunately, the "zil-
  cw2" was not removed in 8558, so this bug is to track it's removal.

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

MFC after:	1 week
2017-09-22 08:21:14 +00:00
Alan Somers
cd037f075c MFV r323789: 8473 scrub does not detect errors on active spares
illumos/illumos-gate@554675eee7
554675eee7

https://www.illumos.org/issues/8473
Scrubbing is supposed to detect and repair all errors in the pool. However,
it wrongly ignores active spare devices. The problem can easily be
reproduced in OpenZFS at git rev 0ef125d with these commands:

truncate -s 64m /tmp/a /tmp/b /tmp/c
sudo zpool create testpool mirror /tmp/a /tmp/b spare /tmp/c
sudo zpool replace testpool /tmp/a /tmp/c
/bin/dd if=/dev/zero bs=1024k count=63 oseek=1 conv=notrunc of=/tmp/c
sync
sudo zpool scrub testpool
zpool status testpool # Will show 0 errors, which is wrong
sudo zpool offline testpool /tmp/a
sudo zpool scrub testpool
zpool status testpool # Will show errors on /tmp/c,
		      # which should've already been fixed

FreeBSD head is partially affected: the first scrub will detect some errors, but the second scrub will detect more.

Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

MFC after:	1 week
Sponsored by:	Spectra Logic Corp
2017-09-20 16:31:00 +00:00
Andriy Gapon
aacd0b4bb2 add vfs_zfs.abd_chunk_size tunable
It is reported that the default value of 4KB results in a substantial
memory use overhead (at least, on some configurations).  Using 1KB seems
to reduce the overhead significantly.

PR:		222377
Reported by:	Sean Chittenden <sean@chittenden.org>
MFC after:	1 week
2017-09-20 08:36:31 +00:00
Andriy Gapon
3d5487d981 fix memory leak in g_bio zone introduced in r320452, another ABD fallout
I overlooked the fact that that ZIO_IOCTL_PIPELINE does not include
ZIO_STAGE_VDEV_IO_DONE stage.  We do allocate a struct bio for an ioctl
zio (a disk cache flush), but we never freed it.

This change splits bio handling into two groups, one for normal
read/write i/o that passes data around and, thus, needs the abd data
tranform; the other group is for "data-less" i/o such as trim and cache
flush.

PR:		222288
Reported by:	Dan Nelson <dnelson@allantgroup.com>
Tested by:	Borja Marcos <borjam@sarenet.es>
MFC after:	10 days
2017-09-20 08:27:21 +00:00
Andriy Gapon
8c9377cde7 MFV r323792: 8602 remove unused "dp_early_sync_tasks" field from "dsl_pool" structure
illumos/illumos-gate@2bcb545854
2bcb545854

https://www.illumos.org/issues/8602
  When I landed the fix for 8558, I incorrectly added the "dp_early_sync_tasks"
  field to the "dsl_pool" structure. This field is used in DelphixOS, but not in
  illumos. It was incorrectly pulled into illumos, so this bug is to remove it
  from the structure.

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

MFC after:	1 week
2017-09-20 07:26:52 +00:00
Andriy Gapon
cbc785c293 dounmount: do not release the mount point's reference on the covered vnode
As long as mnt_ref is not zero there can be a consumer that might try
to access mnt_vnodecovered.  For this reason the covered vnode must not
be freed until mnt_ref goes to zero.
So, move the release of the covered vnode to vfs_mount_destroy.

Reviewed by:	kib
MFC after:	3 weeks
Differential Revision: https://reviews.freebsd.org/D12329
2017-09-14 08:47:06 +00:00
Andriy Gapon
86261a95ed slightly simplify zfs_vptocnp
It's not necessary to look up the parent's ID to check if the node is
the root node of the filesystem.

MFC after:	2 weeks
2017-09-13 07:09:58 +00:00
Toomas Soome
b40aaca6dd loader should support large_dnode
The zfsonlinux feature large_dnode is not yet supported by the loader.

Reviewed by:	avg, allanjude
Differential Revision:	https://reviews.freebsd.org/D12288
2017-09-12 13:45:04 +00:00
Andriy Gapon
1a2ddb2997 fix a fallout from the ZTOV tightening, r323479
MFC after:	13 days
X-MFC with:	r323479
2017-09-12 13:21:14 +00:00
Andriy Gapon
bcab65cab5 zfsctl_snapdir_lookup should be able to handle an uncovered vnode
The uncovered vnode is possible because there is no guarantee that
its hold count would go to zero (and it would be inactivated and reclaimed)
immediately after a covering filesystem is unmounted.
So, such a vnode should be expected and it is possible to re-use it
without any trouble.

MFC after:	3 weeks
Sponsored by:	Panzura
2017-09-12 06:06:58 +00:00
Andriy Gapon
c09d0da8d1 zfs_ctldir: remove obsolete / bogus ARGSUSED lint directives
None of the tagged functions had unused parameters.

MFC after:	1 week
2017-09-12 06:05:30 +00:00
Andriy Gapon
65b38f7311 zfsvfs_hold: assert that the busied filesystem can not be unmounted
This is a FreeBSD specific feature.

MFC after:	3 weeks
Sponsored by:	Panzura
2017-09-12 06:04:50 +00:00
Andriy Gapon
d092f79489 zfs_get_vfs: reference a requested filesystem instead of vfs_busy-ing it
The only consumer of zfs_get_vfs, zfs_unmount_snap, does not need
the filesystem to be busy, it just need a reference that it can pass
to dounmount.

Also, previously the code was racy as it unbusied the filesystem
before taking a reference on it.

Now the code should be simpler and safer.

MFC after:	2 weeks
Sponsored by:	Panzura
2017-09-12 06:04:01 +00:00
Andriy Gapon
f7519dbb76 zfs: tighten debug versions of ZTOV and VTOZ
MFC after:	2 weeks
Sponsored by:	Panzura
2017-09-12 06:02:21 +00:00
Andriy Gapon
970165f190 MFV r323111: 8569 problem with inline functions in abd.h
illumos/illumos-gate@37e84ab74e
37e84ab74e

https://www.illumos.org/issues/8569
  C [C99] has peculiar rules for inline functions that are different from the
  C++ rules.  Unlike C++ where inline is "fire and forget", in C a programmer
  must pay attention to the function's storage class / visibility.  The main
  problem is with the case where a compiler decides to not inline a call to the
  function declared as inline.
  Some relevant links:
  - http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.faqs/ka15831.html
  - http://www.drdobbs.com/the-new-c-inline-functions/184401540
  The summary is that either the inline functions should be declared 'static
  inline' or one of the compilation units (.c files) must provide a callable
  externally visible function definition.  In the former case, the compiler would
  automatically create a local non-inlined function instance in every compilation
  unit where it's needed.  In the latter case the single external definition is
  used to satisfy any non-inlined calls in all compilation units.  As things
  stand right now, we can get an undefined reference error under certain
  combinations of compilers and compiler options.  For example, this is what I
  get on FreeBSD when compiling with clang 4.0.0 and -O1:
    In function `abd_free': /usr/src/sys/cddl/contrib/opensolaris/uts/common/fs/zfs/abd.c:385:
    undefined reference to `abd_is_linear'

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	1 week
2017-09-11 12:15:49 +00:00
Andriy Gapon
25625d8746 Revert r322601, Mark ZFS ABD inline functions static
An alternative fix is to be merged from illumos shortly.
2017-09-11 12:08:20 +00:00
Andriy Gapon
3d9a0e564d MFV r323110: 8558 lwp_create() returns EAGAIN on system with more than 80K ZFS filesystems
illumos/illumos-gate@216d7723a1
216d7723a1

https://www.illumos.org/issues/8558
  On a system with more than 80K ZFS filesystems, we've seen cases where
  lwp_create() will start to fail by returning EAGAIN. The problem being,
  for each of those 80K ZFS filesystems, a taskq will be created for each
  dataset as part of the ZIL for each dataset.
  For each of these taskq's, a kernel thread will be created which results
  in 24KB being allocated for each thread. With enough of these 24KB
  allocations, we eventually exhaust the memory region set aside for these
  allocations. Currently, segkpsize is set to a value of 2GB, which means
  we can only support about 80K filesystems; 2GB / 24KB = ~80K.
  The lwp_create() failure comes into play due to the fact that LWP
  creation also allocates 24KB from this same region of memory. Thus, if
  we've exhausted this region of memory due to the number of ZIL taskq's,
  there won't be any memory avaible to allow the call to lwp_create() to
  succeed.

FreeBSD note: I haven't created sysctl-s for the new ZIL clean
parameters.  Let's add them if anyone requires to tune them.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
MFC after:	3 weeks
2017-09-11 11:31:43 +00:00
Andriy Gapon
90354c3200 MFV r323107: 8414 Implemented zpool scrub pause/resume
illumos/illumos-gate@1702cce751
1702cce751

FreeBSD note:  rather than merging the zpool.8 update I copied the zpool
scrub section from the illumos zpool.1m to FreeBSD zpool.8 almost
verbatim.  Now that the illumos page uses the mdoc format, it was an
easier option.  Perhaps the change is not in perfect compliance with the
FreeBSD style, but I think that it is acceptible.

https://www.illumos.org/issues/8414
  This issue tracks the port of scrub pause from ZoL: https://github.com/zfsonlinux/zfs/pull/6167
  Currently, there is no way to pause a scrub. Pausing may be useful when
  the pool is busy with other I/O to preserve bandwidth.

  Description

  This patch adds the ability to pause and resume scrubbing.  This is achieved
  by maintaining a persistent on-disk scrub state.  While the state is 'paused'
  we do not scrub any more blocks.  We do however perform regular scan
  housekeeping such as freeing async destroyed and deadlist blocks while paused.

  Motivation and Context

  Scrub pausing can be an I/O intensive operation and people have been asking
  for the ability to pause a scrub for a while. This allows one to preserve scrub
  progress while freeing up bandwidth for other I/O.

Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>

MFC after:	2 weeks
2017-09-09 11:00:07 +00:00
Kurt Lidl
a8273e4371 Enable dtrace support for mips64 and the ERL kernel config
Turn on the required options in the ERL config file, and ensure
that the fbt module is listed as a dependency for mips in
the modules/dtrace/dtraceall/dtraceall.c file.

PR: 		220346
Reviewed by:	gnn, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D12227
2017-09-06 03:19:52 +00:00
Alan Somers
0df3d85af4 Fix remounting ZFS filesystem with "zfs mount"
"zfs mount -o" passes a list of mount options directly to nmount(2) after
sanity checking them. In particular, zfs(8) will refuse to mount an already
existing file system unless "remount" is specified in the option list.
However, the "remount" option only exists in Illumos. FreeBSD's equivalent is
"update".

PR:		221985
Reviewed by:	avg
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D12233
2017-09-05 19:40:04 +00:00
Baptiste Daroussin
91f1caeccb Add sysctls for arc shrinking and growing values
The default value for arc_no_grow_shift may not be optimal when using
several GiB ARC. Expose it via sysctl allows users to tune it easily.

Also expose arc_grow_retry via sysctl for the same reason. The default value of
60s might, in case of intensive load, be too long.

Submitted by:	Nikita Kozlov <nikita.kozlov@blade-group.com>
Reviewed by:	mav, manu, bapt
MFC after:	2 weeks
Sponsored by:	blade
Differential Revision:	https://reviews.freebsd.org/D12144
2017-08-31 13:02:17 +00:00
Ed Maste
3c3d2ba6fe zfs: do not advertise edonr which is not yet supported
illumos 4185 ("add new cryptographic checksums to ZFS: SHA-512,
Skein, Edon-R") was intentionally merged only partially in r289422,
without adding support for skein, sha512 and edonr on FreeBSD.

Support for skein and sha512 was added later on, but edonr is still not
implemented in FreeBSD.

Prior to this commit zfs(8) correctly rejected edonr, but with an error
message that claimed support:

fk@r500 ~ $zfs set checksum=edonr tank
cannot set property for 'tank': 'checksum' must be one of 'on | off | fletcher2 | fletcher4 | sha256 | sha512 | skein | edonr'

PR:		204055
Submitted by:	Fabian Keil
Approved by:	allanjude
Obtained from:	ElectroBSD
MFC after:	1 week
2017-08-29 22:24:22 +00:00
John Baldwin
ac3b479ec8 Add a guard around _ILP32 for mips.
This is already done for other architectures in this file and fixes the
build with clang.
2017-08-21 17:45:06 +00:00
John Baldwin
b99836cea9 Mark ZFS ABD inline functions static.
When built with -fno-inline-functions zfs.ko contains undefined references
to these functions if they are only marked inline.

Reviewed by:	avg (earlier version)
MFC after:	1 week
Sponsored by:	Chelsio Communications
2017-08-16 23:40:32 +00:00
Alan Somers
69b14f7acd Fix some ZFS debugging messages
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
	Be more careful about the use of provider names vs vdev names in
	ZFS_LOG statements.

MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
2017-08-15 15:20:04 +00:00
Andriy Gapon
984d43cca5 MFV r322242: 8373 TXG_WAIT in ZIL commit path
illumos/illumos-gate@d28671a3b0
d28671a3b0

https://www.illumos.org/issues/8373
  The code that writes ZIL blocks uses dmu_tx_assign(TXG_WAIT) to assign
  a transaction to a transaction group.  That seems to be logically
  incorrect as writing of the ZIL block does not introduce any new dirty
  data.  Also, when there is a lot of dirty data, the call can introduce
  significant delays into the ZIL commit path, thus affecting all
  synchronous writes. Additionally, ARC throttling may affect the ZIL
  writing.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	2 weeks
2017-08-08 11:26:03 +00:00
Andriy Gapon
2653426e89 MFV r322240: 8491 uberblock on-disk padding to reserve space for smoothly merging zpool checkpoint & MMP in ZFS
illumos/illumos-gate@79c2b812ee
79c2b812ee

https://www.illumos.org/issues/8491
  The zpool checkpoint feature in DxOS added a new field in the uberblock.
  The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the
  uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
  As these two changes come from two different sources and once upstreamed and
  deployed will introduce an incompatibility with each other we want
  to upstream a change that will reserve the padding for both of them so
  integration goes smoothly and everyone gets both features.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>

MFC after:	3 weeks
2017-08-08 11:21:58 +00:00
Andriy Gapon
6f2f8727e3 MFV r322238: 7915 checks in l2arc_evict could use some cleaning up
illumos/illumos-gate@267ae6c3a8
267ae6c3a8

https://www.illumos.org/issues/7915
  l2arc_evict() is strictly serialized with respect to
  l2arc_write_buffers() and l2arc_write_done().  Normally, l2arc_evict()
  and l2arc_write_buffers() are called from the same thread, so they can
  not be concurrent.  Also, l2arc_write_buffers() uses zio_wait() on the
  parent zio of all cache zio-s.  That ensures that l2arc_write_done()
  is completed before l2arc_write_buffers() returns.  Finally, if a
  cache device is removed, then l2arc_evict() is called under SCL_ALL in
  the exclusive mode.  That ensures that it can not be concurrent with
  the normal L2ARC accesses to the device (including writing and
  evicting buffers).  Given the above, some checks and actions in
  l2arc_evict() do not make sense.  For instance, it must never
  encounter the write head header let alone remove it from the buffer
  list.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	2 weeks
2017-08-08 11:19:14 +00:00
Andriy Gapon
3cf2ea1aea MFV r322236: 8126 ztest assertion failed in dbuf_dirty due to dn_nlevels changing
illumos/illumos-gate@dcb6872c56
dcb6872c56

https://www.illumos.org/issues/8126
  The sync thread is concurrently modifying dn_phys->dn_nlevels
  while dbuf_dirty() is trying to assert something about it, without
  holding the necessary lock. We need to move this assertion further down
  in the function, after we have acquired the dn_struct_rwlock.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

MFC after:	2 weeks
2017-08-08 11:14:40 +00:00
Andriy Gapon
98a9c8e68a zfs: no need for __DECONST after abd constification in r322233
Note that vdev_label_write_pad2() is FreeBSD specific.

MFC after:	2 weeks
X-MFC after:	r322233
2017-08-08 11:07:34 +00:00
Andriy Gapon
b9a4f29445 MFV r322232: 8426 mark immutable buffer arguments as such in abd.h
illumos/illumos-gate@9b195260e2
9b195260e2

https://www.illumos.org/issues/8426
  abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so
  qualify them with const.
  abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the
  corresponding arguments.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	2 weeks
2017-08-08 10:59:18 +00:00
Andriy Gapon
9c48e95dd9 MFV r322229: 7600 zfs rollback should pass target snapshot to kernel
illumos/illumos-gate@77b171372e
77b171372e

https://www.illumos.org/issues/7600
  At present, the kernel side code seems to blindly rollback to whatever happens
  to be the latest snapshot at the time when the rollback task is processed.
  The expected target's name should be passed to the kernel driver and the sync
  task should validate that the target exists and that it is the latest snapshot
  indeed.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	3 weeks
2017-08-08 10:52:01 +00:00
Andriy Gapon
8605a08bd2 MFV r322227: 8377 Panic in bookmark deletion
illumos/illumos-gate@42418f9e73
42418f9e73

https://www.illumos.org/issues/8377
  The problem is that when dsl_bookmark_destroy_check() is executed from open
  context (the pre-check), it fills in dbda_success based on the existence of the
  bookmark.
  But the bookmark (or containing filesystem as in this case) can be destroyed
  before we get to syncing context. When we re-run dsl_bookmark_destroy_check()
  in syncing
  context, it will not add the deleted bookmark to dbda_success, intending for
  dsl_bookmark_destroy_sync() to not process it. But because the bookmark is
  still in dbda_success
  from the open-context call, we do try to destroy it.
  The fix is that dsl_bookmark_destroy_check() should not modify dbda_success
  when called from open context.

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

MFC after:	2 weeks
2017-08-08 10:48:52 +00:00
Andriy Gapon
b4e4140d13 MFV r322223: 8378 crash due to bp in-memory modification of nopwrite block
illumos/illumos-gate@b7edcb9408
b7edcb9408

https://www.illumos.org/issues/8378
  The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which
  we then nopwrite against.
  zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr
  to zgd_bp, dbuf_write_ready()
  could change db_blkptr, and dbuf_write_done() could remove the dirty record.
  dmu_sync() then sees the stale
  BP and that the dbuf it not dirty, so it is eligible for nop-writing.
  The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the
  db_mtx. We could still see a stale
  db_blkptr, but if it is stale then the dirty record will still exist and thus
  we won't attempt to nopwrite.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

MFC after:	2 weeks
2017-08-08 10:46:51 +00:00
Andriy Gapon
c6fb364293 MFV r322221: 7910 l2arc_write_buffers() may write beyond target_sz
FreeBD note: the essence of this change was committed to FreeBSD in
r314274.  This commit catches up with differences between what was
committed to FreeBSD and what was committed to OpenZFS, mainly more
logical variable names.

illumos/illumos-gate@16a7e5ac11
16a7e5ac11

https://www.illumos.org/issues/7910
  It seems that the change in issue #6950 resurrected the problem that was
  earlier fixed by the change in issue #5219.
  Please also see the following FreeBSD bug report:
  https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=216178

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	2 weeks
2017-08-08 10:43:41 +00:00
Ruslan Bukin
ca20f8ec29 o Replace __riscv__ with __riscv
o Replace __riscv64 with (__riscv && __riscv_xlen == 64)

This is required to support new GCC 7.1 compiler.
This is compatible with current GCC 6.1 compiler.

RISC-V is extensible ISA and the idea here is to have built-in define
per each extension, so together with __riscv we will have some subset
of these as well (depending on -march string passed to compiler):

__riscv_compressed
__riscv_atomic
__riscv_mul
__riscv_div
__riscv_muldiv
__riscv_fdiv
__riscv_fsqrt
__riscv_float_abi_soft
__riscv_float_abi_single
__riscv_float_abi_double
__riscv_cmodel_medlow
__riscv_cmodel_medany
__riscv_cmodel_pic
__riscv_xlen

Reviewed by:	ngie
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D11901
2017-08-07 14:09:57 +00:00
Andriy Gapon
f6040f9e8e spa_import_rootpool should be able to handle an imported root pool
That is required to support reboot -r with a new root filesystem being
on an already imported pool.

PR:		210721
Reported by:	Jan Bramkamp <crest_maintainer@rlwinm.de>
MFC after:	2 weeks
2017-07-25 13:17:06 +00:00
Ed Maste
5c12f7c3e2 zfs: Fix a typo in the delay_min_dirty_percent sysctl description
The description is FreeBSD-specific and was added in r266497
to fix PR189865.

PR:		220825
Submitted by:	Fabian Keil
Obtained from:	ElectroBSD
MFC after:	1 week
2017-07-19 18:17:41 +00:00
Andriy Gapon
37ec52ca7a fix a regression in r320452, ZFS ABD import
I overlooked the fact that vdev_op_io_done hook is called even if the
actual I/O is skipped, for example, in the case of a missing vdev.
Arguably, this could be considered an issue in the zio pipeline engine,
but for now I am adding defensive code to check for io_bp being NULL
along with assertions that that happens only when it can be really
expected.

PR:		220691
Reported by:	peter, cy
Tested by:	cy
MFC after:	1 week
X-MFC with:	r320156, r320452
2017-07-18 07:41:38 +00:00
Justin Hibbits
8fe026c641 Make ZFS not crash on mount on 32-bit systems
ZPL_VERSION is unsigned long long, not an int.  With this change, a zpool can be
created on a 32-bit system (tested on powerpcspe) and mounted correctly.

Reviewed by:	allanjude
2017-07-18 01:08:45 +00:00
Andriy Gapon
1db5f1724b fix an architectural problem introduced in r320156, ZFS ABD import
The implementation of ZFS refcount_t uses the emulated illumos mutex
(the sx lock) and the waiting memory allocation when ZFS_DEBUG is
enabled.  This makes refcount_t unsuitable for use in GEOM g_up
thread where sleeping is prohibited.

When importing the ABD change I modified vdev_geom using illumos
vdev_disk as an example.  As a result, I added a call to abd_return_buf
in vdev_geom_io_intr.  The latter is called on g_up thread while the
former uses refcount_t.

This change fixes the problem by deferring the abd_return_buf call to
the previously unused vdev_geom_io_done that is called on a ZFS zio
taskqueue thread where sleeping is allowed.

A side bonus of this change is that now a vdev zio has a pointer
to its corresponding bio while the zio is active.

Reported by:	Shawn Webb <shawn.webb@hardenedbsd.org>
Tested by:	Shawn Webb <shawn.webb@hardenedbsd.org>
MFC after:	1 week
X-MFC with:	r320156
2017-06-28 13:59:20 +00:00
Andriy Gapon
c20b00c6af zfs: port vdev_file part of illumos change 3306
3306 zdb should be able to issue reads in parallel
illumos/illumos-gate/31d7e8fa33fae995f558673adb22641b5aa8b6e1
https://www.illumos.org/issues/3306

The upstream change was made before we started to import upstream commits
individually.  It was imported into the illumos vendor area as r242733.
That commit was MFV-ed in r260138, but as the commit message says
vdev_file.c was left intact.

This commit actually implements the parallel I/O for vdev_file using a
taskqueue with multiple thread.  This implementation does not depend on
the illumos or FreeBSD bio interface at all, but uses zio_t to pass
around all the relevent data.  So, the code looks a bit different from
the upstream.

This commit also incorporates ZoL commit
zfsonlinux/zfs/bc25c9325b0e5ced897b9820dad239539d561ec9 that fixed
https://github.com/zfsonlinux/zfs/issues/2270
We need to use a dedicated taskqueue for exactly the same reason as ZoL
as we do not implement TASKQ_DYNAMIC.

Obtained from:	illumos, ZFS on Linux
MFC after:	2 weeks
2017-06-26 09:10:09 +00:00
Andriy Gapon
ee2d3c0a5b fix gcc-specific fallout from r320156, MFV of r318946, ZFS ABD
Reported by:	jhibbits
MFC after:	1 week
X-MFC with:	r320156
2017-06-23 08:42:53 +00:00
Andriy Gapon
3385c74539 MFV r319950: 5220 L2ARC does not support devices that do not provide 512B access
FreeBSD note: the actual change has been in FreeBSD since r297848.  This
commit accounts for integration of that change with subsequent changes,
especially r320156 (MFV of r318946) and r314274.

illumos/illumos-gate@403a8da73c
403a8da73c

https://www.illumos.org/issues/5220
  There are disk devices that have logical sector size larger than 512B, for
  example 4KB. That is, their physical sector size is larger than 512B and they
  do not provide emulation for 512B sector sizes. For such devices both a data
  offset and a data size must be properly aligned. L2ARC should arrange that
  because it uses physical I/O.
  zio_vdev_io_start() performs a necessary transformation if io_size is not
  aligned to vdev_ashift, but that is done only for logical I/O. Something
  similar should be done in L2ARC code.
      * a temporary write buffer should be allocated if the original buffer is
        not going to be compressed and its size is not aligned
      * size of a temporary compression buffer should be ashift aligned
      * for the reads, if a size of a target buffer is not sufficiently large and
        it is not aligned then a temporary read buffer should be allocated

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	3 weeks
2017-06-22 17:10:34 +00:00
Andriy Gapon
ae5ec64b88 MFV r319742: 8056 zfs send size estimate is inaccurate for some zvols
illumos/illumos-gate@0255edcc85
0255edcc85

https://www.illumos.org/issues/8056
  The send size estimate for a zvol can be too low, if the size of the record
  headers (dmu_replay_record_t's) is a significant portion of the size.
  This is typically the case when the data is highly compressible, especially
  with embedded blocks.
  The problem is that dmu_adjust_send_estimate_for_indirects() assumes that
  blocks are the size of the "recordsize" property (128KB).
  However, for zvols, the blocks are the size of the "volblocksize" property
  (8KB). Therefore, we estimate that there will be 16x less record headers than
  there really will be.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>

MFC after:	3 weeks
2017-06-22 16:58:09 +00:00
Andriy Gapon
e70097b50f MFV r318947: 7578 Fix/improve some aspects of ZIL writing.
FreeBSD note: this commit removes small differences between what mav
committed to FreeBSD in r308782 and what ended up committed to illumos
after addressing all review comments.

illumos/illumos-gate@c5ee46810f
c5ee46810f

https://www.illumos.org/issues/7578
  After some ZIL changes 6 years ago zil_slog_limit got partially broken
  due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
  Actually because of other changes about that time zl_itx_list_sz is not
  really required to implement the functionality, so this patch removes
  some unneeded broken code and variables.
  Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
  single heavy logger, that increased latency for other (more latency critical)
  loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
  huge latency increase for heavy writers, this implementation caused double
  write of all data, since the log records were explicitly prepared for SLOG.
  Since we now have I/O scheduler, I've found it can be much more efficient
  to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
  to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
  Existing ZIL implementation had problem with space efficiency when it
  has to write large chunks of data into log blocks of limited size. In some
  cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
  spinning rust, that also reduced log write speed in half, since head had to
  uselessly fly over allocated but not written areas. This change improves
  the situation by offloading problematic operations from z*_log_write() to
  zil_lwb_commit(), which knows real situation of log blocks allocation and
  can split large requests into pieces much more efficiently. Also as side
  effect it removes one of two data copy operations done by ZIL code WR_COPIED
  case.
  While there, untangle and unify code of z*_log_write() functions.
  Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
  block boundary, that may also improve efficiency if ZPL is made to do that.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alexander Motin <mav@FreeBSD.org>

MFC after:	3 weeks
2017-06-22 16:52:22 +00:00
Andriy Gapon
a4a2976d8a fix several fallouts from r320156, ZFS ABD import
All of the problems were related to the FreeBSD-only features.
One was caused by a mismerge in the zfsbootcfg support code.
All others were in the TRIM support code.

MFC after:	1 week
X-MFC with:	r320156
2017-06-21 08:12:07 +00:00
Andriy Gapon
ebf3b53dac fix several fallouts from r320156, ZFS ABD import
All of the problems were related to the FreeBSD-only features.
One was caused by a mismerge in the zfsbootcfg support code.
All others were in the TRIM support code.

Reported by:	ken,
		O. Hartmann <ohartmann@walstatt.org>,
		Trond Endrestøl <Trond.Endrestol@fagskolen.gjovik.no>
MFC after:	1 week
X-MFC with:	r320156
2017-06-21 08:10:45 +00:00
Andriy Gapon
f9cdbaba8d MFV r318946: 8021 ARC buf data scatter-ization
illumos/illumos-gate@770499e185
770499e185

https://www.illumos.org/issues/8021
  The ARC buf data project (known simply as "ABD" since its genesis in the ZoL
  community) changes the way the ARC allocates `b_pdata` memory from using linear
  `void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This
  improves ZFS's performance by helping to defragment the address space occupied
  by the ARC, in particular for cases where compressed ARC is enabled. It could
  also ease future work to allocate pages directly from `segkpm` for minimal-
  overhead memory allocations, bypassing the `kmem` subsystem.
  This is essentially the same change as the one which recently landed in ZFS on
  Linux, although they made some platform-specific changes while adapting this
  work to their codebase:
  1. Implemented the equivalent of the `segkpm` suggestion for future work
  mentioned above to bypass issues that they've had with the Linux kernel memory
  allocator.
  2. Changed the internal representation of the ABD's scatter/gather list so it
  could be used to pass I/O directly into Linux block device drivers. (This
  feature is not available in the illumos block device interface yet.)

FreeBSD notes:
- the actual (default) chunk size is 4KB (despite the text above saying 1KB)
- we can try to reimplement ABDs, so that they are not permanently
  mapped into the KVA unless explicitly requested, especially on
  platforms with scarce KVA
- we can try to use unmapped I/O and avoid intermediate allocation of a
  linear, virtual memory mapped buffer
- we can try to avoid extra data copying by referring to chunks / pages
  in the original ABD

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Dan Kimmel <dan.kimmel@delphix.com>

MFC after:	3 weeks
2017-06-20 17:39:24 +00:00
Andriy Gapon
42ce346fcc revert r315852 which introduced zio_buf_alloc_nowait for use in vdev_queue_aggregate
I think that the change is still good, but reconciling it with a planned
merge of the ARC buf data scatter-ization is a bit more tedious
than I can handle.

MFC after:	17 days
2017-06-20 16:55:30 +00:00
Andriy Gapon
602cf4e4a7 MFV r319951: 8311 ZFS_READONLY is a little too strict
illumos/illumos-gate@2889ec41c0
2889ec41c0

https://www.illumos.org/issues/8311
  Description:
  There was a misunderstanding about the enforcement details of the "Read-only"
  flag introduced for SMB/CIFS compatibility, way back in 2007 in the Sun PSARC
  2007/315 case.
  The original authors thought enforcement of the READONLY flag should work
  similarly as the IMMUTABLE flag. Unfortunately, that enforcement is
  incompatible with the expectations of Windows applications using this feature
  through the SMB service. Applications assume (and the MS File System Algorithms
  MS-FSA confirms they should) that an SMB client can:
  (a) Open an SMB handle on a file with read/write access,
  (b) Set the DOS attributes to include the READONLY flag,
  (c) continue to have write access via that handle.
  This access model is essentially the same as a Unix/POSIX application that
  creates a file (with read/write access), uses fchmod() to change the file mode
  to something not granting write access (i.e. 0444), and then continues to write
  that file using the open handle it got before the mode change.
  Currently, the SMB server works-around this problem in a way that will become
  difficult to maintain as we implement support for SMB3 persistent handles, so
  SMB depends on this fix.
  I've written a test program that can be used to demonstrate this problem, and
  added it to zfs-tests (tests/functional/acl/cifs/cifs_attr_004_pos).
  It currently fails, but will pass when this problem fixed.
  Steps to Reproduce:
    Run the test program on a ZFS file system.
  Expected Results:
    Pass
  Actual Results:
    Fail.

Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Prakash Surya <prakash.surya@delphix.com>
Author: Gordon Ross <gwr@nexenta.com>
MFC after:	2 weeks
2017-06-14 16:55:47 +00:00
Andriy Gapon
7d506d0d57 MFV r319948: 5428 provide fts(), reallocarray(), and strtonum()
illumos/illumos-gate@4585130b25
4585130b25

https://www.illumos.org/issues/5428

Most of the upstream change is not applicable to FreeBSD.
Only the renaming of strtonum to zfs_strtonum is relevant to us.
And we already had it partially done.

Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
MFC after:	1 week
2017-06-14 16:42:38 +00:00
Andriy Gapon
b8d341fe26 MFV r319945,r319946: 8264 want support for promoting datasets in libzfs_core
illumos/illumos-gate@a4b8c9aa65
a4b8c9aa65

https://www.illumos.org/issues/8264
  Oddly there is a lzc_clone function, but no lzc_promote function.

Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@kebe.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Andrew Stormont <astormont@racktopsystems.com>
MFC after:	1 week
2017-06-14 16:31:36 +00:00
Justin Hibbits
880870b41a Follow up r313841 on powerpc
Close a potential race in reading the CPU dtrace flags, where a thread can
start on one CPU, and partway through retrieving the flags be swapped out,
while another thread traps and sets the CPU_DTRACE_NOFAULT.  This could
cause the first thread to return without handling the fault.

Discussed with:	markj@
2017-06-09 20:26:42 +00:00
Andriy Gapon
667002fa27 MFV r319741: 8156 dbuf_evict_notify() does not need dbuf_evict_lock
illumos/illumos-gate@dbfd9f9300
dbfd9f9300

https://www.illumos.org/issues/8156
  dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
  the eviction itself (because the evict thread is not able to keep up).
  This can result in massive lock contention.
  It isn't necessary to hold the lock, because if we make the wrong choice
  occasionally, nothing bad will happen.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after:	1 week
2017-06-09 15:28:57 +00:00
Andriy Gapon
5f9cf93878 MFV r319739: 8005 poor performance of 1MB writes on certain RAID-Z configurations
illumos/illumos-gate@5b06278253
5b06278253

https://www.illumos.org/issues/8005
  RAID-Z requires that space be allocated in multiples of P+1 sectors,
  because this is the minimum size block that can have the required amount
  of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
  sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
  is a unit of 2^ashift bytes, typically 512B or 4KB.
  To satisfy this constraint, the allocation size is rounded up to the
  proper multiple, resulting in up to 3 "pad sectors" at the end of some
  blocks. The contents of these pad sectors are not used, so we do not
  need to read or write these sectors. However, some storage hardware
  performs much worse (around 1/2 as fast) on mostly-contiguous writes
  when there are small gaps of non-overwritten data between the writes.
  Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
  include pad sectors. If writing a pad sector will fill the gap between
  two (required) writes, we will issue the optional zio, thus doubling
  performance. The gap-filling performance improvement was introduced in
  July 2009.
  Writing the optional zio is done by the io aggregation code in
  vdev_queue.c. The problem is that it is also subject to the limit on
  the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
  default 128KB. For a given block, if the amount of data plus padding
  written to a leaf device exceeds zfs_vdev_aggregation_limit, the
  optional zio will not be written, resulting in a ~2x performance
  degradation.
  The problem occurs only for certain values of ashift, compressed block
  size, and RAID-Z configuration (number of parity and data disks). It
  cannot occur with the default recordsize=128KB. If compression is
  enabled, all configurations with recordsize=1MB or larger will be
  impacted to some degree.
  The problem notably occurs with recordsize=1MB, compression=off, with 10
  disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore

Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after:	10 days
2017-06-09 15:27:22 +00:00
Andriy Gapon
9f141f8d71 MFV r319738: 8155 simplify dmu_write_policy handling of pre-compressed buffers
illumos/illumos-gate@adaec86ad2
adaec86ad2

https://www.illumos.org/issues/8155
  When writing pre-compressed buffers, arc_write() requires that the compression
  algorithm used to compress the buffer matches the compression algorithm
  requested by the zio_prop_t, which is set by dmu_write_policy().
  This makes dmu_write_policy() and its callers a bit more complicated.
  We can simplify this by making arc_write() trust the caller to supply the type
  of pre-compressed buffer that it wants to write, and override the compression
  setting in the zio_prop_t.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after:	10 days
2017-06-09 15:26:03 +00:00
Andriy Gapon
1628f75af1 zfs_lookup: fix bogus arguments to lookup of "snapshot" directory
When a parent directory lookup is done at the root of a snapshot mounted
under .zfs/snapshot directory, we need to look up that directory in
the parent filesystem.  We achieve that by doing a VOP_LOOKUP operation
on a .zfs vnode with "snapshot" as a target name.  But previously we
also passed ISDOTDOT flag to the lookup and, because of that, the lookup
actually returned the parent of the .zfs vnode, that is, a root vnode of
the parent filesystem.

Reported by:	lev
Tested by:	lev
MFC after:	3 days
2017-05-29 06:30:34 +00:00
Konstantin Belousov
03311f117b Use whole mnt_stat.f_fsid bits for st_dev.
Since ino64 expanded dev_t to 64bit, make VOP_GETATTR(9) provide all
bits of mnt_stat.f_fsid as va_fsid for vnodes on filesystems which use
f_fsid.  In particular, NFSv3 and sometimes NFSv4, and ZFS use this
method or reporting st_dev by stat(2).

Provide a new helper vn_fsid() to avoid duplicating code to copy
f_fsid to va_fsid.

Note that the change is mostly cosmetic.  Its motivation is to avoid
sign-extension of f_fsid[0] into 64bit dev_t value which happens after
dev_t becomes 64bit..

Reviewed by:	avg(zfs), rmacklem (nfs) (both for previous version)
Sponsored by:	The FreeBSD Foundation
2017-05-27 17:00:30 +00:00
Andriy Gapon
32ecf81aff MFV r318944: 8265 Reserve send stream flag for large dnode feature
illumos/illumos-gate@bc83969fdb
bc83969fdb

https://www.illumos.org/issues/8265
  Reserve bit 23 in the zfs send stream flags for the large
  dnode feature which has been implemented for Linux.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brian Behlendorf <behlendorf1@llnl.gov>

MFC after:	1 week
2017-05-26 12:08:38 +00:00
Andriy Gapon
a51eb0a964 MFV r318942: 8166 zpool scrub thinks it repaired offline device
illumos/illumos-gate@2d2f193a21
2d2f193a21

https://www.illumos.org/issues/8166
  If we do a scrub while a leaf device is offline (via "zpool offline"),
  we will inadvertently clear the DTL (dirty time log) of the offline
  device, even though it is still damaged. When the device comes back
  online, we will incompletely resilver it, thinking that the scrub
  repaired blocks written before the scrub was started. The incomplete
  resilver can lead to data loss if there is a subsequent failure of a
  different leaf device.
  The fix is to never clear the DTL of offline devices. Note that if a
  device is onlined while a scrub is in progress, the scrub will be
  restarted.
  The problem can be worked around by running "zpool scrub" after
  "zpool online".
  See also https://github.com/zfsonlinux/zfs/issues/5806

Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-05-26 12:04:21 +00:00
Andriy Gapon
2cd05c2473 MFV r318934: 8070 Add some ZFS comments
illumos/illumos-gate@40713f2b24
40713f2b24

https://www.illumos.org/issues/8070
  Add some ZFS comments left by various developers at different times

Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>

MFC after:	1 week
2017-05-26 11:49:42 +00:00
Andriy Gapon
0a07ea0e2f MFV r318931: 8063 verify that we do not attempt to access inactive txg
illumos/illumos-gate@b7b2590dd9
b7b2590dd9

https://www.illumos.org/issues/8063
  A standard practice in ZFS is to keep track of "per-txg" state. Any of
  the 3 active TXG's (open, quiescing, syncing) can have different values
  for this state. We should assert that we do not attempt to modify other
  (inactive) TXG's.

Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>

MFC after:	2 weeks
2017-05-26 11:37:11 +00:00
Andriy Gapon
28c5e43e36 MFV r318929: 7786 zfs`vdev_online() needs better notification about state changes
illumos/illumos-gate@5f368aef86
5f368aef86

https://www.illumos.org/issues/7786
  Currently, vdev_online() will only post sysevent if previous state was
  "offline". It should also post the event when the state changes from "removed"
  or "faulted" to "healthy" or "degraded".
  This will fix the following scenario:
  - pull disk from slot A
  - check that hotspare has taken its place (if available)
  - insert disk into slot B
  - check that hotspare moved back to "avail" state (if spare was used)
  The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail
  to update the vdev FRU.

Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Approved by: Albert Lee <trisk@forkgnu.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>

MFC after:	1 week
2017-05-26 11:33:34 +00:00
Andriy Gapon
9c2a3c861f MFV r318927: 8025 dbuf_read() creates unnecessary zio_root() for bonus buf
illumos/illumos-gate@def4fac588
def4fac588

https://www.illumos.org/issues/8025
  dbuf_read() creates a zio_root() to track and wait for all the zio's
  that may happen as part of this call. However, if the blkptr_t for
  this buffer is NULL or a hole, we will not create any more zio's, so
  this zio_root() is unnecessary. This is always the case when calling
  dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
  the containing dnode). For workloads that read a lot of bonus buffers
  (e.g. file creation and removal), creating and destroying these
  unnecessary zio's can decrease performance by around 3%.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-05-26 11:30:55 +00:00
Andriy Gapon
ebaf416f95 MFV r316929: 6914 kernel virtual memory fragmentation leads to hang
illumos/illumos-gate@af868f46a5
af868f46a5

https://www.illumos.org/issues/6914

FreeBSD note: only a ZFS part of the change is merged, changes to the VM
subsystem are not ported (obviously).  Also, now that FreeBSD has
vmem(9) we don't have to ifdef-out the code that uses it.

MFC after:	2 weeks
2017-05-26 11:23:16 +00:00
Andriy Gapon
8629ec8394 arc_init: make code closer to upstream by introducing 'allmem' variable
All the differences in calculations are kept.
A comment about arc_max being 1/2 of all memory is fixed to reflect the
actual code that uses 5/8 as a factor.

MFC after:	1 week
2017-05-26 11:05:56 +00:00
Andriy Gapon
cf781c9b60 zfs_putpages: assert that sa_bulk_update() must succeed
Same as the upstream does in r316927.

MFC after:	1 week
2017-05-26 10:37:55 +00:00