Commit Graph

1298 Commits

Author SHA1 Message Date
Andriy Gapon
d26312a4e4 fix vnode lock assertion for extended attributes directory
Background.  In ZFS a file with extended attributes has a special
directory associated with it where each extended attribute is a file.
The attribute's name is a file name and its value is a file content.
When the ownership of a file with extended attributes is changed, ZFS
also changes ownership of the special directory.  This is where the bug
was hit.

The bug was introduced in r209158.

Nota bene.  ZFS vnode locks are typically acquired before
z_teardown_lock (i.e., before ZFS_ENTER).  But this is not the case for
the vnodes that represent the extended attribute directory and files.
Those are always locked after ZFS_ENTER.  This is confusing and fragile.

PR:		212702
Reported by:	Christian Fuss to FreeNAS
Tested by:	mav
MFC after:	1 week
2016-09-24 08:13:15 +00:00
Allan Jude
c2b475d0ee MFV r268120:
4936 lz4 could theoretically overflow a pointer with a certain input

  illumos/illumos-gate@58d0718061

Reviewed by:	delphij
MFC after:	2 weeks
Sponsored by:	ScaleEngine Inc.
Differential Revision:	https://reviews.freebsd.org/D7850
2016-09-11 17:48:06 +00:00
Alexander Motin
4605bf63c4 MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused
The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch:

    commit ca0cc3918a1789fa839194af2a9245f801a06b1a
    Author: Matthew Ahrens <mahrens@delphix.com>
    Date:   Fri Jul 24 09:53:55 2015 -0700

        5959 clean up per-dataset feature count code
        Reviewed by: Toomas Soome <tsoome@me.com>
        Reviewed by: George Wilson <george@delphix.com>
        Reviewed by: Alex Reece <alex@delphix.com>
        Approved by: Richard Lowe <richlowe@richlowe.net>

This patch simply removes this macro from dsl_dataset.h.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-07 20:09:24 +00:00
Alexander Motin
de1fdddeda MFV r305560: 7278 tuning zfs_arc_max does not impact arc_c_min
When changing zfs_arc_max (e.g. as zdb does), it may be set to less
than the default arc_c_min. arc_c_min should decrease to not be more than
arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@608764bead
2016-09-07 20:05:10 +00:00
Andriy Gapon
1a82707cd7 fix zfs pool creation accidentally broken by r305331
The upstream change introduced a new load state, SPA_LOAD_CREATE,
and vdev_geom code needs to be aware of it.

Tested by:	cy
MFC after:	1 week
X-MFC with:	r305331
2016-09-06 06:09:12 +00:00
Alexander Motin
9b9258a12a Missed FreeBSD-specific piece of r305338. 2016-09-03 11:17:33 +00:00
Alexander Motin
d7e781bda3 MFC r305337: 7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes #109

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@d3e523d489
2016-09-03 11:00:29 +00:00
Alexander Motin
4ad4b70e77 MFV r305336: 7247 zfs receive of deduplicated stream fails
This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.

Closes #159

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>

openzfs/openzfs@b09697c8c1
2016-09-03 10:59:05 +00:00
Alexander Motin
070da3f779 MFV r305335: 7003 zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Closes #108

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@0780b3eab5
2016-09-03 10:58:14 +00:00
Alexander Motin
d3ec2cdb4a MFV r304157:
7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records

illumos/illumos-gate@12b90ee2d3
https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f
e1d9cfa

https://www.illumos.org/issues/7230
  A test failure occurred where a send stream had only a BEGIN record. This
  should not be possible if the send returns without error. Prevented this from
  happening in the future by adding an assertion to dmu_send_impl() to verify
  that if the function returns 0 (success) both a BEGIN and END record are
  present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
  END records sent), having dump_record() set flags appropriately, adding VERIFY
  statement to dmu_send_impl().

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>
2016-09-03 10:10:58 +00:00
Alexander Motin
7aafc9d4c8 MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr
illumos/illumos-gate@bd56f80007
https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec
844a96b

https://www.illumos.org/issues/7235
  The function dsl_dataset_set_blkptr() is unused. We should remove it.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-03 10:09:23 +00:00
Alexander Motin
c9fa25c110 MFV r304155: 7090 zfs should improve allocation order and throttle allocations
illumos/illumos-gate@0f7643c737
https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b
10c846a

https://www.illumos.org/issues/7090
  When write I/Os are issued, they are issued in block order but the ZIO pipelin
e
  will drive them asynchronously through the allocation stage which can result i
n
  blocks being allocated out-of-order. It would be nice to preserve as much of
  the logical order as possible.
  In addition, the allocations are equally scattered across all top-level VDEVs
  but not all top-level VDEVs are created equally. The pipeline should be able t
o
  detect devices that are more capable of handling allocations and should
  allocate more blocks to those devices. This allows for dynamic allocation
  distribution when devices are imbalanced as fuller devices will tend to be
  slower than empty devices.
  The change includes a new pool-wide allocation queue which would throttle and
  order allocations in the ZIO pipeline. The queue would be ordered by issued
  time and offset and would provide an initial amount of allocation of work to
  each top-level vdev. The allocation logic utilizes a reservation system to
  reserve allocations that will be performed by the allocator. Once an allocatio
n
  is successfully completed it's scheduled on a given top-level vdev. Each top-
  level vdev maintains a maximum number of allocations that it can handle
  (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
  mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
  groups and round robin across all eligible metaslab groups to distribute the
  work. As top-levels complete their work, they receive additional work from the
  pool-wide allocation queue until the allocation queue is emptied.

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 10:04:37 +00:00
Alexander Motin
0b51a59fc7 MFV r303078:
7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer

illumos/illumos-gate@926549256b
https://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7
8fa08ef

https://www.illumos.org/issues/7086
  In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the
  db_blkptr, to prevent it from being changed by syncing context.
  Otherwise we may see that ztest got a segfault from this stack:
  libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20,
0)
  libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0)
  libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530)
  libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1)
  libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1)
  libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780)
  libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a)
  ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0)
  ztest_remove+0x9f(ea294588, ea293f50, 4, 3)
  ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1)
  ztest_dmu_object_alloc_free+0x71(ea294588, 13)
  ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0)
  ztest_execute+0x89(a, 807c720, 13, 0)
  ztest_thread+0xea(13, 0, 0, 0)
  libc.so.1`_thrp_setup+0x88(f0983240)
  libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0)
  Looking into it a bit, we see that this is an embedded blockpointer, so
  BP_GET_NDVAS should have returned 0:
       b32b240::blkptr
  EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L
  Instead, it looks like another thread is modifying this blockpointer:
       b32b240::ugrep | ::whatis
  f47a0e0c is in [ stack tid=0x19f ]
  ebd6ec40 is in [ stack tid=0x226 ]
  ea293bd0 is in [ stack tid=0x244 ]
  ea293be4 is in [ stack tid=0x244 ]

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-03 08:43:43 +00:00
Alexander Motin
84c3781ac9 MFV r303077:
7072 zfs fails to expand if lun added when os is in shutdown state

illumos/illumos-gate@c39a2aae1e
c39a2aae1e

https://www.illumos.org/issues/7072
  upstream:
  38733 zfs fails to expand if lun added when os is in shutdown state
  DLPX-36910 spares and caches should not display expandable space
  DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 08:42:12 +00:00
Alexander Motin
efa0867fb0 MFV r302991: 6950 ARC should cache compressed data
illumos/illumos-gate@dcbf3bd6a1
dcbf3bd6a1

https://www.illumos.org/issues/6950
  When reading compressed data from disk, the ARC should keep the compressed
  block cached and only decompress it when consumers access the block. The
  uncompressed data should be short-lived allowing the ARC to cache a much larger
  amount of data. The DMU would also maintain a smaller cache of uncompressed
  blocks to minimize the impact of decompressing frequently accessed blocks.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 08:30:51 +00:00
Alexander Motin
c543b519be MFV r304158:
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information

7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often

illumos/illumos-gate@b72b6bb10a
https://github.com/illumos/illumos-gate/commit/b72b6bb10ad55121a1b352c6f68ebdc8e
20c9086

https://www.illumos.org/issues/7136
  6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents
  whenever an aux device gets removed from a pool. However, those sysevents will
  be created without the vdev_guid and vdev_path fields. It would be better to
  always populate those fields.

https://www.illumos.org/issues/7115
  The addition of spa_event_notify in vdev removal code (see #6922) causes event
s
  to be generated even if the spare failed to be removed with EBUSY.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>
2016-09-01 18:37:11 +00:00
Alexander Motin
25584d12e7 MFV r302993: 7104 increase indirect block size
illumos/illumos-gate@4b5c8e93ca
https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3
be1db1c

https://www.illumos.org/issues/7104
  The current default indirect block size is 16KB. We can improve
  performance by increasing it to 128KB. This is especially helpful for
  any workload that needs to read most of the metadata, e.g.
  scrub/resilver, file deletion, filesystem deletion, and zfs send.
  We also need to fix a few space estimation errors to make the tests
  pass.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 18:33:39 +00:00
Alexander Motin
dd7f7cb7ac MFV r302992: 7071 lzc_snapshot does not fill in errlist on ENOENT
illumos/illumos-gate@25f7d993ad
https://github.com/illumos/illumos-gate/commit/25f7d993adbfb3452ac4625b379167074
6d35ae3

https://www.illumos.org/issues/7071
  upstream
  DLPX-40482 lzc_snapshot does not fill in errlist on ENOENT

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 18:25:49 +00:00
Alexander Motin
3d1e0e0830 MFV r302662: 6447 handful of nvpair cleanups
illumos/illumos-gate@759e89be35
https://github.com/illumos/illumos-gate/commit/759e89be359f2af635e4122d147df56bc
e948773

https://www.illumos.org/issues/6447
  I got a patch from someone who uses nvpair code outside of illumos. It fixes a
  couple of gcc warnings/bugs for him.
     1. silence uninitialized use warnings
     2. add parentheses around assignment used as truth value
     3. fix printf format specifier (ll is for integers only)
     4. strstr, strspn, strcspn, and strcmp are declared in string.h, not
        strings.h.
     5. avoid scanning integer into boolean variable

Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Steve Dougherty <sdougherty@barracuda.com>
2016-09-01 15:17:39 +00:00
Alexander Motin
3421688c2d MFV r302661: 7082 bptree_iterate() passes wrong args to zfs_dbgmsg()
illumos/illumos-gate@10e67aa0db
https://github.com/illumos/illumos-gate/commit/10e67aa0db0823d5464aafdd681f3c966
155c68e

https://www.illumos.org/issues/7082
  upstream
  DLPX-40542 bptree_iterate() passes wrong args to zfs_dbgmsg()

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 15:10:40 +00:00
Alexander Motin
41b9077ef6 MFV r302660: 6314 buffer overflow in dsl_dataset_name
illumos/illumos-gate@9adfa60d48
https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6
92b6660

https://www.illumos.org/issues/6314
  Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but
  dsl_dataset_name copies the datasets' name PLUS the snapshot name to it,
  resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 15:08:27 +00:00
Alexander Motin
e12a269749 MFV r302659: 6931 lib/libzfs: cleanup gcc warnings
illumos/illumos-gate@88f61dee20
88f61dee20

https://www.illumos.org/issues/6931
  need cleanup:
  CERRWARN += -_gcc=-Wno-switch
  CERRWARN += -_gcc=-Wno-parentheses
  CERRWARN += -_gcc=-Wno-unused-function

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Igor Kozhukhov <ikozhukhov@gmail.com>
2016-09-01 14:57:06 +00:00
Alexander Motin
a95a9fe945 MFV r302651: 7054 dmu_tx_hold_t should use refcount_t to track space
illumos/illumos-gate@0c779ad424
https://github.com/illumos/illumos-gate/commit/0c779ad424a92a84d1e07d47cab7f8009
189202b

https://www.illumos.org/issues/7054
  upstream:
  ee0003de7d3e598499be7ac3fe6b61efcc47cb7f
  DLPX-40399 dmu_tx_hold_t should use refcount_t to track space

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 14:38:25 +00:00
Alexander Motin
96bf48b8cb MFV r302648: 7019 zfsdev_ioctl skips secpolicy when FKIOCTL is set
Note that the bulk of the upstream change is not applicable to FreeBSD
and the affected files are not even in the vendor area.

illumos/illumos-gate@45b1747515
45b1747515

https://www.illumos.org/issues/7019
  Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set,
  skips all processing of secpolicy functions. This means that ZFS is not doing
  any kind of verification of the credentials or access rights of the caller and
  assuming that (as it is an in-kernel client) all such checks have already been
  done.
  This turns out to be quite a dangerous assumption, especially with respect to
  sdev. In general I don't think it's particularly reasonable to offload this
  enforcement of access rights onto other kernel subsystems when ZFS has some
  particular local semantics in this area (delegated datasets etc) and does not
  provide any kind of API to allow other subsystems to avoid code duplication
  when doing it. ZFS should apply its normal access policy to requests from
  within the kernel, and callers should take care to give it the correct
  credentials and call it from the correct context in order to get the results
  they need.
  You can observe the currently unfortunate consequences of this bug in any non-
  global zone that has access to /dev/zvol or any subset of it via sdev profiles.
  In particular, a zone used to contain a KVM or similar which has a single zvol
  passed through to it using a <device match= block in its zone XML.
  Even though sdev makes something of an attempt to control for whether the
  caller should have access to nodes in /dev/zvol, it doesn't do this correctly,
  or really at all in the lookup call path. So, if we have a zone that's been
  given access to any part of /dev/zvol, it can simply look up the full path to
  any other zvol on the entire system, and the node will appear and be able to be
  used.

Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Wilson <alex.wilson@joyent.com>
2016-09-01 14:24:54 +00:00
Alexander Motin
13876b47d7 MFV r302647: 6922 Emit ESC_ZFS_VDEV_REMOVE_AUX after removing an aux device
illumos/illumos-gate@63364b0ee2
https://github.com/illumos/illumos-gate/commit/63364b0ee2604783e7a55f84258888677
68eafa4

https://www.illumos.org/issues/6922
  ZFS does not do a config_sync after removing an aux (spare, log, or cache)
  device. AFAICT this isn't being done because it is slow and was deemed
  unnecessary. However, it should be such a rare operation that speed doesn't
  matter, and not doing it results in two problems:
  1) It is theoretically possible to remove an aux device from one pool and
  attach it to another, then lose power. When power is restored, both pools woul
d
  think that they own the aux device.
  2) Removal of the aux device doesn't send any useful sysevents to userland.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@gmail.com>
2016-09-01 14:17:30 +00:00
Alexander Motin
1c7d88abed MFV r302646:
6980 6902 causes zfs send to break due to 32-bit/64-bit struct mismatch

illumos/illumos-gate@ea4a67f462
https://github.com/illumos/illumos-gate/commit/ea4a67f462de0a39a9adea8197bcdef84
9de5371

https://www.illumos.org/issues/6980
  doing zfs send -i snap1 snap2 >testfile results in
  internal error: Invalid argument
  Abort (core dumped)

Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 14:06:30 +00:00
Alexander Motin
4536fd9bed MFV r302643:
6902 speed up listing of snapshots if requesting name only and sorting by name

This was our change from the beginning, so just reduce the upstream diff.
2016-09-01 13:29:53 +00:00
Alexander Motin
5fd28943d6 MFV r302642:
6876 Stack corruption after importing a pool with a too-long name

illumos/illumos-gate@c971037baa
c971037baa

https://www.illumos.org/issues/6876
  Calling dsl_dataset_name on a dataset with a 256 byte buffer is asking for
  trouble. We should check every dataset on import, using a 1024 byte buffer and
  checking each time to see if the dataset's new name is longer than 256 bytes.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
2016-09-01 13:04:36 +00:00
Alexander Motin
9007a8679a Fix kernel panic when inheriting properties without default.
There are two writable hidden properties "iscsioptions" and "stmf_sbd_lu",
that have no default string value.  Attempt to unset them or replicate
caused kernel panic.  This simple bandaid seems fixes the problem nicely.

MFC after:	2 weeks
2016-08-31 11:55:31 +00:00
Mark Johnston
59ceeddecf MFV r301526:
7035 string-related subroutines should validate input earlier

Reviewed by: Alex Wilson <alex.wilson@joyent.com>
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Patrick Mooney <pmooney@pfmooney.com>

illumos/illumos-gate@771e39c3b1

MFC after:	2 weeks
2016-08-16 02:25:19 +00:00
Mark Johnston
f66200ee22 MFV r301525:
7033 ustack helper should fault on bad return values

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alex Wilson <alex.wilson@joyent.com>

illumos/illumos-gate@a2f72b65eb

MFC after:	2 weeks
2016-08-16 02:20:02 +00:00
Mark Johnston
4aea8f31b1 MFV r301524:
7034 negative record sizes should be rejected

Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Bryan Cantrill <bryan@joyent.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alex Wilson <alex.wilson@joyent.com>

illumos/illumos-gate@0b8049bfb0

MFC after:	2 weeks
2016-08-16 02:18:34 +00:00
Mark Johnston
b7125fa9cd MFV r296989:
6734 dtrace_canstore_statvar() fails for some valid static variables

Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bryan Cantrill <bryan@joyent.com>

illumos/illumos-gate@d65f2bb4e5

MFC after:	2 weeks
2016-08-16 02:16:54 +00:00
Andriy Gapon
96762fe314 fix a zfs cross-device rename crash introduced in r303763
The problem was that 'zfsvfs' variable was not initialized if the error
was detected, but in the exit path the variable was dereferenced before
the error code was checked.

Reported by:	np
MFC after:	3 days
X-MFC with:	r303763
2016-08-09 06:11:24 +00:00
Andriy Gapon
4fb51b52ef fix .zfs-related cases in zfs_lookup that were broken by r303763
The problem is that the special .zfs nodes are not represented by
znodes but by special gfs-based nodes.
r303763 changed interface of zfs_dirlook such that started operating on
znodes rather than on vnodes and, thus, the function became unsuitable
for handling .zfs entities.
The solution is to move the handling of the special cases to zfs_lookup,
the only consumer of zfs_dirlook.
I already had this solution applied in D7421, but for different reasons.

Reported by:	asomers
MFC after:	3 days
X-MFC with:	r303763
2016-08-06 11:02:07 +00:00
Andriy Gapon
f79bc17233 zfs: honour and make use of vfs vnode locking protocol
ZFS POSIX Layer is originally written for Solaris VFS which is very
different from FreeBSD VFS.  Most importantly many things that FreeBSD VFS
manages on behalf of all filesystems are implemented in ZPL in a different
way.
Thus, ZPL contains code that is redundant on FreeBSD or duplicates VFS
functionality or, in the worst cases, badly interacts / interferes
with VFS.

The most prominent problem is a deadlock caused by the lock order reversal
of vnode locks that may happen with concurrent zfs_rename() and lookup().
The deadlock is a result of zfs_rename() not observing the vnode locking
contract expected by VFS.

This commit removes all ZPL internal locking that protects parent-child
relationships of filesystem nodes.  These relationships are protected
by vnode locks and the code is changed to take advantage of that fact
and to properly interact with VFS.

Removal of the internal locking allowed all ZPL dmu_tx_assign calls to
use TXG_WAIT mode.

Another victim, disputable perhaps, is ZFS support for filesystems with
mixed case sensitivity.  That support is not provided by the OS anyway,
so in ZFS it was a buch of dead code.

To do:
- replace ZFS_ENTER mechanism with VFS managed / visible mechanism
- replace zfs_zget with zfs_vget[f] as much as possible
- get rid of not really useful now zfs_freebsd_* adapters
- more cleanups of unneeded / unused code
- fix / replace .zfs support

PR:		209158
Reported by:	many
Tested by:	many (thank you all!)
MFC after:	5 days
Sponsored by:	HybridCluster / ClusterHQ
Differential Revision: https://reviews.freebsd.org/D6533
2016-08-05 06:23:06 +00:00
Ruslan Bukin
98f50c44e3 Update RISC-V port to Privileged Architecture Version 1.9.
Sponsored by:	DARPA, AFRL
Sponsored by:	HEIF5
2016-08-02 14:50:14 +00:00
Enji Cooper
4fcae4df7e Conditionalize code which defines sysctls per _KERNEL #ifdef guard
This resolves several issues when compiling libzpool (userspace library), i.e.
-Wimplicit-function-declaration and -Wmissing-declarations issues.

MFC after:	2 weeks
Reported by:	clang
Tested with:	clang 3.8.1, gcc 4.2.1, gcc 5.3.0
Sponsored by:	EMC / Isilon Storage Division
2016-07-31 06:34:49 +00:00
Mark Johnston
57185c52de Restore an ifdef that should not have been removed in r303535.
X-MFC-With:	r303535
2016-07-30 07:05:32 +00:00
Mark Johnston
6d1ffb50fc Include fasttrap handling for DATAMODEL_ILP32 when compiling for amd64.
MFC after:	1 month
2016-07-30 03:11:53 +00:00
Andriy Gapon
70e3da3892 MFV r302645: 6878 Add scrub completion info to "zpool history"
illumos/illumos-gate@1825bc56e5
1825bc56e5

https://www.illumos.org/issues/6878
  Summary of changes:
      * Replace generic "scan done" message with "scan aborted, restarting",
        "scan cancelled", or "scan done"
      * Log number of errors using spa_get_errlog_size
      * Refactor scan restarting check into static function

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Nav Ravindranath <nav@delphix.com>
MFC after:	2 weeks
2016-07-14 11:53:39 +00:00
Andriy Gapon
39a6b17491 MFV r302650: 6940 Cannot unlink directories when over quota
illumos/illumos-gate@99189164df
99189164df

https://www.illumos.org/issues/6940
  Similar to #6334, but this time with empty directories:
  $ zfs create tank/quota
  $ zfs set quota=10M tank/quota
  $ zfs snapshot tank/quota@snap1
  $ zfs set mountpoint=/mnt/tank/quota tank/quota
  $ mkdir /mnt/tank/quota/dir # create an empty directory
  $ mkfile 11M /mnt/tank/quota/11M
  /mnt/tank/quota/11M: initialized 9830400 of 11534336 bytes: Disc quota exceeded
  $ rmdir /mnt/tank/quota/dir # now unlink the empty directory
  rmdir: directory "/mnt/tank/quota/dir": Disc quota exceeded
  From user perspective, I would expect that ZFS is always able to remove files
  and directories even when the quota is exceeded.

Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Simon Klinkert <simon.klinkert@gmail.com>
MFC after:	2 weeks
2016-07-14 11:51:01 +00:00
Andriy Gapon
fe0cc75230 MFV r302644: 6513 partially filled holes lose birth time
illumos/illumos-gate@8df0bcf0df
8df0bcf0df

https://www.illumos.org/issues/6513
  If a ZFS object contains a hole at level one, and then a data block is created
  at level 0 underneath that l1 block, l0 holes will be created. However, these
  l0 holes do not have the birth time property set; as a result, incremental
  sends will not send those holes.
  Fix is to modify the dbuf_read code to fill in birth time data.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
MFC after:	3 weeks
2016-07-14 11:48:42 +00:00
Andriy Gapon
e7ed92bbbc MFV r302641: 6844 dnode_next_offset can detect fictional holes
illumos/illumos-gate@11ceac77ea
11ceac77ea

https://www.illumos.org/issues/6844
  dnode_next_offset is used in a variety of places to iterate over the holes or
  allocated blocks in a dnode. It operates under the premise that it can iterate
  over the blockpointers of a dnode in open context while holding only the
  dn_struct_rwlock as reader. Unfortunately, this premise does not hold.
  When we create the zio for a dbuf, we pass in the actual block pointer in the
  indirect block above that dbuf. When we later zero the bp in
  zio_write_compress, we are directly modifying the bp. The state of the bp is
  now inconsistent from the perspective of dnode_next_offset: the bp will appear
  to be a hole until zio_dva_allocate finally finishes filling it in. In the
  meantime, dnode_next_offset can detect a hole in the dnode when none exists.
  I was able to experimentally demonstrate this behavior with the following
  setup:
  1. Create a file with 1 million dbufs.
  2. Create a thread that randomly dirties L2 blocks by writing to the first L0
  block under them.
  3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle
  of a file.
  4. Do dnode_next_offset in a loop until we skip over such a non-existent hole.
  The fix is to ensure that it is valid to iterate over the indirect blocks in a
  dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP
  and updating the actual BP in dbuf_write_ready while holding the lock.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Reece <alex@delphix.com>
MFC after:	3 weeks
2016-07-14 11:42:53 +00:00
Andriy Gapon
875e6e5b04 MFV r302640: 6874 rollback and receive need to reset ZPL state to what's on disk
illumos/illumos-gate@1fdcbd00c9
1fdcbd00c9

https://www.illumos.org/issues/6874
  When we do a clone swap (caused by "zfs rollback" or "zfs receive"), the ZPL
  doesn't completely reload the state from the DMU; some values remain cached in
  the zfsvfs_t.
  steps to reproduce:
  ```
  #!/bin/bash -x
  zfs destroy -R test/fs
  zfs destroy -R test/recvd
  zfs create test/fs
  zfs snapshot test/fs@a
  zfs set userquota@$USER=1m test/fs
  zfs snapshot test/fs@b
  zfs send test/fs@a | zfs recv test/recvd
  zfs send -i @a test/fs@b | zfs recv test/recvd
  zfs userspace test/recvd
     1. should show 1m quota
        dd if=/dev/urandom of=/test/recvd/file bs=1k count=1024
        sync
        dd if=/dev/urandom of=/test/recvd/file2 bs=1k count=1024
     2. should fail with ENOSPC
        sync
        zfs unmount test/recvd
        zfs mount test/recvd
        zfs userspace test/recvd
     3. if bug above, now shows 1m quota
        dd if=/dev/urandom of=/test/recvd/file3 bs=1k count=1024
     4. if bug above, now fails with ENOSPC
  ```

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after:	3 weeks
2016-07-14 11:39:36 +00:00
Andriy Gapon
ac3623e090 re-apply r299908: zfsctl_snapdir_lookup: clear VV_ROOT of snapshot's root
The change has been undone in r301275 on the assumption that it was no
longer required.  But that was incorrect, because in this case (and only
in this case) the snapshot root vnode is looked up before z_parent is
fixed up.

MFC after:	5 days
2016-07-13 15:16:51 +00:00
Mark Johnston
ca1ef36cf4 Avoid truncating the return value of DTrace predicates.
Predicates are DIF objects whose return value is compared with zero to
determine whether the corresponding probe body is to be executed. The return
value itself is the contents of a 64-bit DIF register, but it was being
truncated to an int before the comparison. This meant that a predicate such
as /0x100000000/ would evaluate to false.

Reported by:	rwatson
MFC after:	3 days
2016-07-09 22:41:21 +00:00
Steven Hartland
ae8420ed72 Fix ZFS ARC min / max tunable
Due to ARC initial configuration not being done and kmem information
not being available we need to blindly set zfs_arc_max and zfs_arc_min
when configured via the tunable.

This fixes vfs.zfs.arc_(min|max) configuration via loader.conf broken by
r302265.

Approved by:	re(gjb)
MFC after:	1 week
2016-07-06 23:49:19 +00:00
Alexander Motin
e36599916f Revert r299454 and r299448.
Those changes were found confusing FreeBSD libc ACL code, that doesn't
differentiate ACL for directories and files, and report ACLs for all
directories created after those patches as non-trivial.  On the other
side these changes were considered wrong from POSIX and NFSv4 points of
view.  Until further investigation done upstream, revert those changes
locally in preparation for FreeBSD 11.0 release.

Approved by:	re (hrs)
2016-06-30 14:55:49 +00:00
Steven Hartland
f535b2d7d8 Allow ZFS ARC min / max to be tuned at runtime
Prior to this change ZFS ARC min / max could only be changed using
boot time tunables, this allows the values to be tuned at runtime
using the sysctls:
* vfs.zfs.arc_max
* vfs.zfs.arc_min

When adjusting ZFS ARC minimum the memory used  will only reduce
to the new minimum given memory pressure.

Reviewed by:	allanjude
Approved by:	re (gjb)
MFC after:	2 weeks
Relnotes:	yes
Sponsored by:	Multiplay
Differential Revision:	https://reviews.freebsd.org/D5907
2016-06-29 07:55:45 +00:00