Commit Graph

1822 Commits

Author SHA1 Message Date
Andriy Gapon
e1b8f10a5e MFV r316920: 8023 Panic destroying a metaslab deferred range tree
illumos/illumos-gate@3991b535a8
3991b535a8

https://www.illumos.org/issues/8023
       $C
  ffffff0011bc0970 vpanic()
  ffffff0011bc0a00 strlog()
  ffffff0011bc0a30 range_tree_destroy+0x72(ffffff043769ad00)
  ffffff0011bc0a70 metaslab_fini+0xd5(ffffff0449acf380)
  ffffff0011bc0ab0 vdev_metaslab_fini+0x56(ffffff0462bae800)
  ffffff0011bc0af0 spa_unload+0x9b(ffffff03e3dac000)
  ffffff0011bc0b70 spa_export_common+0x115(ffffff047f4b4000, 2, 0, 0, 0)
  ffffff0011bc0b90 spa_destroy+0x1d(ffffff047f4b4000)
  ffffff0011bc0bd0 zfs_ioc_pool_destroy+0x20(ffffff047f4b4000)
  ffffff0011bc0c80 zfsdev_ioctl+0x4d7(11400000000, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68)
  ffffff0011bc0cc0 cdev_ioctl+0x39(11400000000, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68)
  ffffff0011bc0d10 spec_ioctl+0x60(ffffff03d9153b00, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68, 0)
  ffffff0011bc0da0 fop_ioctl+0x55(ffffff03d9153b00, 5a01, 8040190, 100003,
  ffffff03e1956b10, ffffff0011bc0e68, 0)
  ffffff0011bc0ec0 ioctl+0x9b(3, 5a01, 8040190)
  ffffff0011bc0f10 _sys_sysenter_post_swapgs+0x149()

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
MFC after:	2 weeks
2017-05-24 22:25:26 +00:00
Andriy Gapon
5386d7295a MFV r316917: 7968 multi-threaded spa_sync()
illumos/illumos-gate@94c2d0eb22
94c2d0eb22

https://www.illumos.org/issues/7968
  spa_sync() iterates over all the dirty dnodes and processes each of them by
  calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
  or removed a lot of files), the single thread of spa_sync() calling
  dnode_sync() can become a bottleneck. Additionally, if many dnodes are dirtied
  concurrently in open context (e.g. due to concurrent file creation), the
  os_lock will experience lock contention via dnode_setdirty().
  The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
  use separate threads to process each of the sublists in the multilist.
  On the concurrent file creation microbenchmark, the performance improvement
  from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
  spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
  reward, once the other bottlenecks are addressed, fixing this bug will provide
  a medium-large performance gain and require a medium amount of effort to
  implement.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after:	3 weeks
2017-05-24 22:21:24 +00:00
Andriy Gapon
2ba631553c MFV r316916: 7970 zfs_arc_num_sublists_per_state should be common to all multilists
illumos/illumos-gate@10fbdecb05
10fbdecb05

https://www.illumos.org/issues/7970
  The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
  the dbuf cache, and other users are planned. We should change this
  tunable to be common to all multilists.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after:	3 weeks
2017-05-24 22:15:16 +00:00
Andriy Gapon
1d7634429c MFC r316915: 7801 add more by-dnode routines (lint)
illumos/illumos-gate@411be58a6e
411be58a6e
MFC after:	24 days
X-MFC with:	r318823
2017-05-24 21:52:20 +00:00
Andriy Gapon
31fd119cc2 MFC r316914: 7801 add more by-dnode routines
illumos/illumos-gate@b0c42cd470
b0c42cd470

https://www.illumos.org/issues/7801
  Add *_by_dnode() routines for accessing objects given their
  dnode_t *, this is more efficient than accessing the object by
  (objset_t *, uint64_t object). This change converts some but
  not all of the existing consumers. As performance-sensitive
  code paths are discovered they should be converted to use
  these routines.
  Ported from: 0eef1bde31

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>
MFC after:	24 days
2017-05-24 21:49:21 +00:00
Andriy Gapon
5aab788866 MFC r316913: 7869 panic in bpobj_space(): null pointer dereference
illumos/illumos-gate@a3905a4592
a3905a4592

https://www.illumos.org/issues/7869
  The issue fixed by this patch is a race condition in the deadlist code.
  A thread executing an administrative command that uses
  `dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to
  protect the access of all its entries that the deadlist contains in an
  avl tree.
  Sync threads trying to insert a new entry in the deadlist
  (through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the
  deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our
  sentinel value), we close and reopen it. Between these two operations,
  it is possible for the `dsl_deadlist_space_range()` thread to dereference
  that bpobj which is `NULL` during that window.
  Threads should hold the a deadlist's `dl_lock` when they manipulate its
  internal data so scenarios like the one above are avoided. In addition,
  threads should also hold the bpobj lock whenever they are allocating the
  subobj list of a bpobj, and not just when they actually insert the subobj
  to the list. This way we can avoid potential memory leaks.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
MFC after:	2 weeks
2017-05-24 21:45:52 +00:00
Andriy Gapon
930f1af491 MFC r316912: 7793 ztest fails assertion in dmu_tx_willuse_space
illumos/illumos-gate@61e255ce72
61e255ce72

https://www.illumos.org/issues/7793
  Background information: This assertion about tx_space_* verifies that we
  are not dirtying more stuff than we thought we would. We “need” to know
  how much we will dirty so that we can check if we should fail this
  transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
  transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
  typically less than a millisecond), we call dbuf_dirty() on the exact
  blocks that will be modified. Once this happens, the temporary
  accounting in tx_space_* is unnecessary, because we know exactly what
  blocks are newly dirtied; we call dnode_willuse_space() to track this
  more exact accounting.
  The fundamental problem causing this bug is that dmu_tx_hold_*() relies
  on the current state in the DMU (e.g. dn_nlevels) to predict how much
  will be dirtied by this transaction, but this state can change before we
  actually perform the transaction (i.e. call dbuf_dirty()).
  This bug will be fixed by removing the assertion that the tx_space_*
  accounting is perfectly accurate (i.e. we never dirty more than was
  predicted by dmu_tx_hold_*()). By removing the requirement that this
  accounting be perfectly accurate, we can also vastly simplify it, e.g.
  removing most of the logic in dmu_tx_count_*().
  The new tx space accounting will be very approximate, and may be more or
  less than what is actually dirtied. It will still be used to determine
  if this transaction will put us over quota. Transactions that are marked
  by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
  an attempt to determine how much space will be freed by the transaction
  — this was rarely accurate enough to determine if a transaction should
  be permitted when we are over quota, which is why dmu_tx_mark_netfree()
  was introduced in 2014.
  We also won’t attempt to give “credit” when overwriting existing blocks,
  if those blocks may be freed. This allows us to remove the
  do_free_accounting logic in dbuf_dirty(), and associated routines. This

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after:	3 weeks
2017-05-24 21:43:34 +00:00
Andriy Gapon
3a9c923927 MFC r316907: 1300 filename normalization doesn't work for removes
illumos/illumos-gate@1c17160ac5
1c17160ac5

https://www.illumos.org/issues/1300

FreeBSD note: recent FreeBSD was not affected by the issue fixed as the
name cache is completely bypassed when normalization is enabled.
The change is imported for the sake of ZAP infrastructure modifications.

Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>

MFC after:	3 weeks
2017-05-24 21:29:31 +00:00
Konstantin Belousov
6992112349 Commit the 64-bit inode project.
Extend the ino_t, dev_t, nlink_t types to 64-bit ints.  Modify
struct dirent layout to add d_off, increase the size of d_fileno
to 64-bits, increase the size of d_namlen to 16-bits, and change
the required alignment.  Increase struct statfs f_mntfromname[] and
f_mntonname[] array length MNAMELEN to 1024.

ABI breakage is mitigated by providing compatibility using versioned
symbols, ingenious use of the existing padding in structures, and
by employing other tricks.  Unfortunately, not everything can be
fixed, especially outside the base system.  For instance, third-party
APIs which pass struct stat around are broken in backward and
forward incompatible ways.

Kinfo sysctl MIBs ABI is changed in backward-compatible way, but
there is no general mechanism to handle other sysctl MIBS which
return structures where the layout has changed. It was considered
that the breakage is either in the management interfaces, where we
usually allow ABI slip, or is not important.

Struct xvnode changed layout, no compat shims are provided.

For struct xtty, dev_t tty device member was reduced to uint32_t.
It was decided that keeping ABI compat in this case is more useful
than reporting 64-bit dev_t, for the sake of pstat.

Update note: strictly follow the instructions in UPDATING.  Build
and install the new kernel with COMPAT_FREEBSD11 option enabled,
then reboot, and only then install new world.

Credits: The 64-bit inode project, also known as ino64, started life
many years ago as a project by Gleb Kurtsou (gleb).  Kirk McKusick
(mckusick) then picked up and updated the patch, and acted as a
flag-waver.  Feedback, suggestions, and discussions were carried
by Ed Maste (emaste), John Baldwin (jhb), Jilles Tjoelker (jilles),
and Rick Macklem (rmacklem).  Kris Moore (kris) performed an initial
ports investigation followed by an exp-run by Antoine Brodin (antoine).
Essential and all-embracing testing was done by Peter Holm (pho).
The heavy lifting of coordinating all these efforts and bringing the
project to completion were done by Konstantin Belousov (kib).

Sponsored by:	The FreeBSD Foundation (emaste, kib)
Differential revision:	https://reviews.freebsd.org/D10439
2017-05-23 09:29:05 +00:00
Mark Johnston
de3a96e3b1 Ensure that profile and tick probes provide a non-zero PC value.
The idle thread may process callouts while reloading the timer in
cpu_activeclock(). In this case, provide a representative value, &cpu_idle,
instead of 0 for args[0] so that the active thread can be more easily
identified from the probe.

This addresses intermittent failures of the profile-n/tst.argtest.d test.

MFC after:	2 weeks
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D10651
2017-05-15 21:44:40 +00:00
Alan Somers
7ac72c256f vdev_geom may associate multiple vdevs per g_consumer
vdev_geom.c currently uses the g_consumer's private field to point to a
vdev_t. That way, a GEOM event can cause a change to a ZFS vdev. For
example, when you remove a disk, the vdev's status will change to REMOVED.
However, vdev_geom will sometimes attach multiple vdevs to the same GEOM
consumer. If this happens, then geom events will only be propagated to one
of the vdevs.

Fix this by storing a linked list of vdevs in g_consumer's private field.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c

* g_consumer.private now stores a linked list of vdev pointers associated
  with the consumer instead of just a single vdev pointer.

* Change vdev_geom_set_physpath's signature to more closely match
  vdev_geom_set_rotation_rate

* Don't bother calling g_access in vdev_geom_set_physpath. It's guaranteed
  that we've already accessed the consumer by the time we get here.

* Don't call vdev_geom_set_physpath in vdev_geom_attach. Instead, call it
  in vdev_geom_open, after we know that the open has succeeded.

PR:		218634
Reviewed by:	gibbs
MFC after:	1 week
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D10391
2017-05-11 16:26:56 +00:00
Justin Hibbits
675cad71e7 Fix stack tracing in dtrace for powerpc
The current method only sort of works, and usually doesn't work reliably.
Also, on Book-E the return address from DEBUG exceptions is not the sentinel
addresses, so it won't exit the loop correctly.

Fix this by better handling trap frames during unwinding, and using the
common trap handler for debug traps, as the code in that segment is
identical between the two.

MFC after:	1 week
2017-05-11 00:23:51 +00:00
Justin Hibbits
679ea09441 Fix the encoded instruction for FBT traps on powerpc
r314370 changed EXC_DTRACE to a different instruction, but neglected to
make the same change to fbt, so dtrace didn't actually pick it up,
resulting in entering KDB instead of trapping for dtrace.

MFC after:	1 week
2017-05-10 03:47:22 +00:00
Justin Hibbits
0440a7f539 Fix check for fbt_excluded() in powerpc
fbt_excluded() returns 1 if the symbol is to be excluded.  Every other
arch has this correct, powerpc was the only broken one

MFC after:	1 week
2017-05-10 03:20:20 +00:00
Mark Johnston
23bff6073b Fix a harmless LOR in dtrace_load().
MFC after:	1 week
2017-05-01 17:01:00 +00:00
Josh Paetzel
ba13ab83f2 Fix misport of compressed ZFS send/recv from 317414
Reported by:	Michael Jung <mikej@mikej.com>
Reviewed by:	avg
2017-05-01 12:56:12 +00:00
Mark Johnston
babf030fd6 Get rid of some ifdef soup in the fasttrap ioctl handler.
No functional change intended.

MFC after:	1 week
2017-04-28 22:25:22 +00:00
Josh Paetzel
285d85ab04 MFV 316905
7740 fix for 6513 only works in hole punching case, not truncation

illumos/illumos-gate@7de35a3ed0
7de35a3ed0

https://www.illumos.org/issues/7740
  The problem is that dbuf_findbp will return ENOENT if the block it's
  trying to find is beyond the end of the file. If that happens, we assume
  there is no birth time, and so we lose that information when we write
  out new blkptrs. We should teach dbuf_findbp to look for things that are
  beyond the current end, but not beyond the absolute end of the file.
  To verify, create a large file, truncate it to a short length, and then
  write beyond the end. Check with zdb to make sure that there are no
  holes with birth time zero (will appear as gaps).

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>
2017-04-28 02:11:29 +00:00
Josh Paetzel
358f157522 MFV 316900
7743 per-vdev-zaps have no initialize path on upgrade

illumos/illumos-gate@555da5111b
555da5111b

https://www.illumos.org/issues/7743
  When loading a pool that had been created before the existance of
  per-vdev zaps, on a system that knows about per-vdev zaps, the
  per-vdev zaps will not be allocated and initialized.
  This appears to be because the logic that would have done so, in
  spa_sync_config_object(), is not reached under normal operation. It is
  only reached if spa_config_dirty_list is non-empty.
  The fix is to add another `AVZ_ACTION_` enum that will allow this code
  to be reached when we detect that we're loading an old pool, even when
  there are no dirty configs.

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
2017-04-27 23:31:38 +00:00
Josh Paetzel
8ad5797208 MFV 316898
7613 ms_freetree[4] is only used in syncing context

illumos/illumos-gate@5f14577801
5f14577801

https://www.illumos.org/issues/7613
  metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should
  replace it with two trees: the freeing tree (ranges that we are freeing this
  syncing txg) and the freed tree (ranges which have been freed this txg).

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-27 22:00:03 +00:00
Josh Paetzel
e3cb0e99f8 MFV 316897
7586 remove #ifdef __lint hack from dmu.h

illumos/illumos-gate@4ba5b96163
4ba5b96163

https://www.illumos.org/issues/7586
  The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if
  we add other inline functions into header files in ZFS, especially since it is
  difficult to make any other solution work across all compilation targets. We
  should switch to disabling the lint flags that are failing instead.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>
2017-04-27 21:11:57 +00:00
Josh Paetzel
011275233c MFV 316896
7580 ztest failure in dbuf_read_impl

illumos/illumos-gate@1a01181fdc
1a01181fdc

https://www.illumos.org/issues/7580
  We need to prevent any reader whenever we're about the zero out all the
  blkptrs. To do this we need to grab the dn_struct_rwlock as writer in
  dbuf_write_children_ready and free_children just prior to calling bzero.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
2017-04-27 16:38:28 +00:00
Josh Paetzel
fa88c78914 MFV 316895
7606 dmu_objset_find_dp() takes a long time while importing pool

illumos/illumos-gate@7588687e6b
7588687e6b

https://www.illumos.org/issues/7606
  When importing a pool with a large number of filesystems within the same
  parent filesystem, we see that dmu_objset_find_dp() takes a long time.
  It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
  and spa_load_verify().
  There are several ways to improve performance here:
  1. We don't really need to do spa_check_logs() or
         spa_ld_claim_log_blocks() if the pool was closed cleanly.
  2. spa_load_verify() uses dmu_objset_find_dp() to check that no
         datasets have too long of names.
  3. dmu_objset_find_dp() is slow because it's doing
         zap_value_search() (which is O(N sibling datasets)) to determine
         the name of each dsl_dir when it's opened. In this case we
         actually know the name when we are opening it, so we can provide
         it and avoid the lookup.
  This change implements fix #3 from the above list; i.e. make
  dmu_objset_find_dp() provide the name of the dataset so that we don't
  have to search for it.

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2017-04-27 15:10:45 +00:00
Josh Paetzel
c78abb8b50 MFV 316894
7252 7628 compressed zfs send / receive

illumos/illumos-gate@5602294fda
5602294fda

https://www.illumos.org/issues/7252
  This feature includes code to allow a system with compressed ARC enabled to
  send data in its compressed form straight out of the ARC, and receive data in
  its compressed form directly into the ARC.

https://www.illumos.org/issues/7628
  We should have longer, more readable versions of the ZFS send / recv options.

7628 create long versions of ZFS send / receive options

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>
2017-04-25 17:57:43 +00:00
Josh Paetzel
ef18459108 MFV 316891
7386 zfs get does not work properly with bookmarks

illumos/illumos-gate@edb901aab9
edb901aab9

https://www.illumos.org/issues/7386
  The zfs get command does not work with the bookmark parameter while it works
  properly with both filesystem and snapshot:
  # zfs get -t all -r creation rpool/test
  NAME               PROPERTY  VALUE                  SOURCE
  rpool/test         creation  Fri Sep 16 15:00 2016  -
  rpool/test@snap    creation  Fri Sep 16 15:00 2016  -
  rpool/test#bkmark  creation  Fri Sep 16 15:00 2016  -
  # zfs get -t all -r creation rpool/test@snap
  NAME             PROPERTY  VALUE                  SOURCE
  rpool/test@snap  creation  Fri Sep 16 15:00 2016  -
  # zfs get -t all -r creation rpool/test#bkmark
  cannot open 'rpool/test#bkmark': invalid dataset name
  #
  The zfs get command should be modified to work properly with bookmarks too.

Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Marcel Telka <marcel@telka.sk>
2017-04-21 19:53:52 +00:00
Josh Paetzel
36064ac2d5 MFV 316871
7490 real checksum errors are silenced when zinject is on

illumos/illumos-gate@6cedfc397d
6cedfc397d

https://www.illumos.org/issues/7490
  When zinject is on, error codes from zfs_checksum_error() can be overwritten
  due to an incorrect and overly-complex if condition.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
2017-04-21 00:24:59 +00:00
Josh Paetzel
9a625bd31c MFV 316870
7448 ZFS doesn't notice when disk vdevs have no write cache

illumos/illumos-gate@295438ba32
295438ba32

https://www.illumos.org/issues/7448
       I built a SmartOS image with all the NVMe commits including 7372
       (support NVMe volatile write cache) and repeated my dd testing:
       > #!/bin/bash
       > for i in `seq 1 1000`; do
       > dd if=/dev/zero of=file00 bs=1M count=102400 oflag=sync &
       > dd if=/dev/zero of=file01 bs=1M count=102400 oflag=sync &
       > wait
       > rm file00 file01
       > done
       >
       Previously each dd command took ~145 seconds to finish, now it takes
       ~400 seconds.
       Eventually I figured out it is 7372 that causes unnecessary
       nvme_bd_sync() executions which wasted CPU cycles.
  If a NVMe device doesn't support a write cache, the nvme_bd_sync function will
  return ENOTSUP to indicate this to upper layers.
  It seems this returned value is ignored by ZFS, and as such this bug is not
  really specific to NVMe. In vdev_disk_io_start() ZFS sends the flush to the
  disk driver (blkdev) with a callback to vdev_disk_ioctl_done(). As nvme filled
  in the bd_sync_cache function pointer, blkdev will not return ENOTSUP, as the
  nvme driver in general does support cache flush. Instead it will issue an
  asynchronous flush to nvme and immediately return 0, and hence ZFS will not set
  vdev_nowritecache here. The nvme driver will at some point process the cache
  flush command, and if there is no write cache on the device it will return
  ENOTSUP, which will be delivered to the vdev_disk_ioctl_done() callback. This
  function will not check the error code and not set nowritecache.
  The right place to check the error code from the cache flush is in
  zio_vdev_io_assess(). This would catch both cases, synchronous and asynchronous
  cache flushes. This would also be independent of the implementation detail that
  some drivers can return ENOTSUP immediately.

Reviewed by: Dan Fields <dan.fields@nexenta.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
Obtained from:	Illumos
2017-04-21 00:17:54 +00:00
Josh Paetzel
47e222432b MFV 316868
7430 Backfill metadnode more intelligently

illumos/illumos-gate@af346df588
af346df588

https://www.illumos.org/issues/7430
  Description and patch from brought over from the following ZoL commit: https://
  github.com/zfsonlinux/zfs/commit/68cbd56e182ab949f58d004778d463aeb3f595c6
  Only attempt to backfill lower metadnode object numbers if at least
  4096 objects have been freed since the last rescan, and at most once
  per transaction group. This avoids a pathology in dmu_object_alloc()
  that caused O(N^2) behavior for create-heavy workloads and
  substantially improves object creation rates. As summarized by
  @mahrens in #4636:
  "Normally, the object allocator simply checks to see if the next
  object is available. The slow calls happened when dmu_object_alloc()
  checks to see if it can backfill lower object numbers. This happens
  every time we move on to a new L1 indirect block (i.e. every 32 *
  128 = 4096 objects). When re-checking lower object numbers, we use
  the on-disk fill count (blkptr_t:blk_fill) to quickly skip over
  indirect blocks that don?t have enough free dnodes (defined as an L2
  with at least 393,216 of 524,288 dnodes free). Therefore, we may
  find that a block of dnodes has a low (or zero) fill count, and yet
  we can?t allocate any of its dnodes, because they've been allocated
  in memory but not yet written to disk. In this case we have to hold
  each of the dnodes and then notice that it has been allocated in
  memory.
  The end result is that allocating N objects in the same TXG can
  require CPU usage proportional to N^2."
  Add a tunable dmu_rescan_dnode_threshold to define the number of
  objects that must be freed before a rescan is performed. Don't bother
  to export this as a module option because testing doesn't show a
  compelling reason to change it. The vast majority of the performance
  gain comes from limit the rescan to at most once per TXG.

Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Ned Bass <bass6@llnl.gov>

Obtained from:	Illumos
2017-04-21 00:12:47 +00:00
Gleb Smirnoff
83c9dea1ba - Remove 'struct vmmeter' from 'struct pcpu', leaving only global vmmeter
in place.  To do per-cpu stats, convert all fields that previously were
  maintained in the vmmeters that sit in pcpus to counter(9).
- Since some vmmeter stats may be touched at very early stages of boot,
  before we have set up UMA and we can do counter_u64_alloc(), provide an
  early counter mechanism:
  o Leave one spare uint64_t in struct pcpu, named pc_early_dummy_counter.
  o Point counter(9) fields of vmmeter to pcpu[0].pc_early_dummy_counter,
    so that at early stages of boot, before counters are allocated we already
    point to a counter that can be safely written to.
  o For sparc64 that required a whole dummy pcpu[MAXCPU] array.

Further related changes:
- Don't include vmmeter.h into pcpu.h.
- vm.stats.vm.v_swappgsout and vm.stats.vm.v_swappgsin changed to 64-bit,
  to match kernel representation.
- struct vmmeter hidden under _KERNEL, and only vmstat(1) is an exclusion.

This is based on benno@'s 4-year old patch:
https://lists.freebsd.org/pipermail/freebsd-arch/2013-July/014471.html

Reviewed by:	kib, gallatin, marius, lidl
Differential Revision:	https://reviews.freebsd.org/D10156
2017-04-17 17:34:47 +00:00
Gleb Smirnoff
9ed01c32e0 All these files need sys/vmmeter.h, but now they got it implicitly
included via sys/pcpu.h.
2017-04-17 17:07:00 +00:00
Andriy Gapon
656074ea60 rename vfs.zfs.debug_flags to vfs.zfs.debugflags
While the former name is easier to read, the "_flags" suffix has a special
meaning for loader(8) and, thus, it was impossible to set the knob via
loader.conf(5).  The loader interpreted the setting as flags that should
be passed to a kernel module named "vfs.zfs.debug".

Discussed with:	smh
MFC after:	2 weeks
2017-04-14 15:35:07 +00:00
Alan Somers
d255847d9e Fix vdev_geom_attach_by_guids for partitioned disks
When opening a vdev whose path is unknown, vdev_geom must find a geom
provider with a label whose guids match the desired vdev. However, due to
partitioning, it is possible that two non-synonomous providers will share
some labels. For example, if the first partition starts at the beginning of
the drive, then ada0 and ada0p1 will share the first label. More troubling,
if the last partition runs to the end of the drive, then ada0p3 and ada0
will share the last label. If vdev_geom opens ada0 when it should've opened
ada0p3, then the pool won't be readable. If it opens ada0 when it should've
opened ada0p1, then it will corrupt some other partition when it writes the
3rd and 4th labels.

The easiest way to reproduce this problem is to install a mirrored root pool
with the default partition layout, then swap the positions of the two boot
drives and reboot.  Whether the bug manifests depends on the order in which
geom lists its providers, which is arbitrary.

Fix this situation by modifying the search algorithm to prefer geom
providers that have all four labels intact. If no such provider exists, then
open whichever provider has the most.

Reviewed by:	mav
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D10365
2017-04-13 14:51:34 +00:00
Patrick Kelsey
67d955aab4 Corrected misspelled versions of rendezvous.
The MFC will include a compat definition of smp_no_rendevous_barrier()
that calls smp_no_rendezvous_barrier().

Reviewed by:	gnn, kib
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D10313
2017-04-09 02:00:03 +00:00
Steven Hartland
3e856909b7 Fix expandsz 16.0E vals and vdev_min_asize of RAIDZ children
When a member of a RAIDZ has been replaced with a device smaller than the
original, then the top level vdev can report its expand size as 16.0E.

The reduced child asize causes the RAIDZ to have a vdev_asize lower than its
vdev_max_asize which then results in an underflow during the calculation of
the parents expand size.

Fix this by updating the vdev_asize if it shrinks, which is already
protected by a check against vdev_min_asize so should always be safe.

Also for RAIDZ vdevs, ensure that the sum of their child vdev_min_asize is
always greater than the parents vdev_min_size.

Fixes: https://www.illumos.org/issues/7885

MFC after:	2 weeks
Sponsored by:	Multiplay
2017-04-03 13:11:28 +00:00
Josh Paetzel
e106234416 MFV: 315989
7603 xuio_stat_wbuf_* should be declared (void)

illumos/illumos-gate@99aa8b5505
99aa8b5505

https://www.illumos.org/issues/7603

  The funcs are declared k&r style, where the args are not specified:

  void xuio_stat_wbuf_copied();
  They should be declared to take no arguments:

  void xuio_stat_wbuf_copied(void);
  Need to change both .c and .h.

Author: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
2017-03-27 17:27:46 +00:00
Alexander Motin
3aef5b286a MFV r315290, r315291: 7303 dynamic metaslab selection
illumos/illumos-gate@8363e80ae7
https://github.com/illumos/illumos-gate/commit/8363e80ae72609660f6090766ca8c2c18

https://www.illumos.org/issues/7303

  This change introduces a new weighting algorithm to improve metaslab selection.
  The new weighting algorithm relies on the SPACEMAP_HISTOGRAM feature. As a result,
  the metaslab weight now encodes the type of weighting algorithm used
  (size-based vs segment-based).

  This also introduce a new allocation tracing facility and two new dcmds to help
  debug allocation problems. Each zio now contains a zio_alloc_list_t structure
  that is populated as the zio goes through the allocations stage. Here's an
  example of how to use the tracing facility:

> c5ec000::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
  MSID    DVA    ASIZE      WEIGHT             RESULT               VDEV
     -      0      400           0    NOT_ALLOCATABLE           ztest.0a
     -      0      400           0    NOT_ALLOCATABLE           ztest.0a
     -      0      400           0             ENOSPC           ztest.0a
     -      0      200           0    NOT_ALLOCATABLE           ztest.0a
     -      0      200           0    NOT_ALLOCATABLE           ztest.0a
     -      0      200           0             ENOSPC           ztest.0a
     1      0      400      1 x 8M            17b1a00           ztest.0a

> 1ff2400::print zio_t io_alloc_list | ::walk list | ::metaslab_trace
  MSID    DVA    ASIZE      WEIGHT             RESULT               VDEV
     -      0      200           0    NOT_ALLOCATABLE           mirror-2
     -      0      200           0    NOT_ALLOCATABLE           mirror-0
     1      0      200      1 x 4M            112ae00           mirror-1
     -      1      200           0    NOT_ALLOCATABLE           mirror-2
     -      1      200           0    NOT_ALLOCATABLE           mirror-0
     1      1      200      1 x 4M            112b000           mirror-1
     -      2      200           0    NOT_ALLOCATABLE           mirror-2

  If the metaslab is using segment-based weighting then the WEIGHT column will
  display the number of segments available in the bucket where the allocation
  attempt was made.

Author: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Chris Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
2017-03-24 09:37:00 +00:00
Andriy Gapon
16b46572fa zfs_putpages: use TXG_WAIT
Explicit looping using TXG_NOWAIT is more verbose and may harm performance
under heavy load because of multiple waits.

MFC after:	1 week
2017-03-23 09:13:21 +00:00
Andriy Gapon
3d775e193e zfs: add zio_buf_alloc_nowait and use it in vdev_queue_aggregate
This way we can avoid blocking the whole queue in the low memory
situations.  It's better to sacrifice some I/O performance by not doing
the aggregation than to add an indefinite wait for more memory.

Reviewed by:	smh
MFC after:	2 weeks
Sponsored by:	Panzura
Differential Revision: https://reviews.freebsd.org/D9999
2017-03-23 08:59:17 +00:00
Steven Hartland
c76da62acf Reduce ARC fragmentation threshold
As ZFS can request up to SPA_MAXBLOCKSIZE memory block e.g. during zfs recv,
update the threshold at which we start agressive reclamation to use
SPA_MAXBLOCKSIZE (16M) instead of the lower zfs_max_recordsize which
defaults to 1M.

PR:		194513
Reviewed by:	avg, mav
MFC after:	1 month
Sponsored by:	Multiplay
Differential Revision:	https://reviews.freebsd.org/D10012
2017-03-17 12:34:57 +00:00
Mark Johnston
9fc47d244c Fix a backwards comparison in the code to dump a DTrace debug buffer.
PR:		217739
MFC after:	1 week
2017-03-13 18:43:00 +00:00
Andriy Gapon
520758a51d zfs: provide a special vptocnp method for the .zfs vnode
vop_stdvptocnp() doesn't work properly if .zfs directory is hidden.

Reported by:	swills, des
Tested by:	des
MFC after:	1 week
MFC with:	r314048
2017-03-11 16:00:49 +00:00
Andriy Gapon
1a3c849840 MFV r314911: 7867 ARC space accounting leak
illumos/illumos-gate@6de76ce2a9
6de76ce2a9

https://www.illumos.org/issues/7867
  It seems that in the case where arc_hdr_free_pdata() sees HDR_L2_WRITING() we
  would fail to update the ARC space statistics.
  In the normal case those statistics are updated in arc_free_data_buf(). But in
  the arc_hdr_free_on_write() path we don't do that.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	10 days
2017-03-08 13:52:45 +00:00
Andriy Gapon
7e4b3a6fa2 MFV r314910: 7843 get_clones_stat() is suboptimal for lots of clones
illumos/illumos-gate@c5bde7273e
c5bde7273e

https://www.illumos.org/issues/7843
  get_clones_stat() could be very slow if a snapshot has many (thousands) clones.
  Clone names are added to an nvlist that's created with NV_UNIQUE_NAME.
  So, each time a new name is appended to the list, the whole list is searched
  linearly to see if that name is not already in the list. That results in the
  quadratic complexity.
  That should be easy to fix as we know in advance that we should not get any
  duplicate names, so we can drop NV_UNIQUE_NAME when creating the list.

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>

MFC after:	1 week
Sponsored by:	ClusterHQ
2017-03-08 13:48:26 +00:00
Martin Matuska
0132c9cd4a Fix null pointer dereference in zfs_freebsd_setacl().
Prevents unprivileged users from panicking the kernel by calling
__acl_delete_*() on files or directories inside a ZFS mount.

MFC after:	3 days
2017-03-02 23:23:28 +00:00
Alexander Motin
6d1ccf40cc Execute last ZIO of log commit synchronously.
For short transactions overhead of context switch can be too large.
Skipping it gives significant latency reduction.  For large ones,
including multiple ZIOs, latency is less critical, while throughput
there may become limited by checksumming speed of single CPU core.
To get best of both cases, execute last ZIO directly from calling
thread context to save latency, while all others (if there are any)
enqueue to taskqueues in traditional way.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-03-02 07:55:47 +00:00
Alexander Motin
e93f9c7708 Completely skip cache flushing for not supporting log devices.
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2017-03-02 07:50:06 +00:00
Andrey V. Elsukov
19b60f70c0 Do not invoke the resize event when previous provider's size was zero.
This is similar to r303637 fix for geom_disk.

Reported by:	avg
Tested by:	avg
MFC after:	1 week
2017-03-01 18:03:32 +00:00
Josh Paetzel
b98d22744f MFV 314276
7570 tunable to allow zvol SCSI unmap to return on commit of txn to ZIL

illumos/illumos-gate@1c9272b861
1c9272b861

https://www.illumos.org/issues/7570

  Based on the discovery that every unmap waits for the commit of the txn to the ZIL,
  introducing a very high latency to unmap commands, this behavior was made into a
  tunable zvol_unmap_sync_enabled and set to false. The net impact of this change is
  that by default SCSI unmap commands will result in space being freed within the zvol
  (today they are ignored and returned with good status). However, unlike the code
  today, instead of 18+ms per unmap, they take about 30us.

  With the testing done on NTFS against a Win2k12 target, the new behavior should work
  seamlessly. Files on the zvol that have already been set with the zfree application
  will continue to write 0's when deleted, and any new files created since zvol
  creation will send unmap commands when deleted. This behavior exists today, but with
  this change the unmap commands will be processed and result in reclaim of space.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
2017-02-25 20:01:17 +00:00
Andriy Gapon
9211bb327f l2arc: try to fix write size calculation broken by Compressed ARC commit
While there, make a change to not evict a first buffer outside the
requested eviciton range.

To do:
- give more consistent names to the size variables
- upstream to OpenZFS

PR:		216178
Reported by:	lev
Tested by:	lev
MFC after:	2 weeks
2017-02-25 17:03:48 +00:00
Andriy Gapon
1e1065b60f zfs: call spa_deadman on a taskqueue thread
callout(9) prohibits callout functions from sleeping.
illumos mutexes are emulated using sx(9).
spa_deadman() calls vdev_deadman() and the latter acquires vq_lock.

As a result we can get a more confusing panic instead of a specific
panic or no panic:
sleepq_add: td 0xfffff80019669960 to sleep on wchan 0xfffff8001cff4d88 with sleeping prohibited

This change adds another level of indirection where the deadman
callout schedules spa_deadman() to be executed on taskqueue_thread.

While there, use callout_schedule(0 instead of callout_reset()
in spa_sync().

Discussed with:	mav
MFC after:	1 week
Differential Revision: https://reviews.freebsd.org/D9762
2017-02-25 16:45:53 +00:00
Josh Paetzel
029c0bfdbd MFV 314243
6676 Race between unique_insert() and unique_remove() causes ZFS fsid change

illumos/illumos-gate@40510e8eba
40510e8eba

https://www.illumos.org/issues/6676

  The fsid of zfs filesystems might change after reboot or remount. The problem seems to
  be caused by a race between unique_insert() and unique_remove(). The unique_remove()
  is called from dsl_dataset_evict() which is now an asynchronous thread. In a case the
  dsl_dataset_evict() thread is very slow and calls unique_remove() too late we will end
  up with changed fsid on zfs mount.

  This problem is very likely caused by #5056.

  Steps to Reproduce
  Note: I'm able to reproduce this always on a single core (virtual) machine. On multicore
  machines it is not so easy to reproduce.

# uname -a
SunOS openindiana 5.11 illumos-633aa80 i86pc i386 i86pc Solaris
# zfs create rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
    vfs_fsid.val = [ 0x54d7028a, 0x70311508 ]
}
# zfs umount rpool/TEST
# zfs mount rpool/TEST
# FS=$(echo ::fsinfo | mdb -k | grep TEST | awk '{print $1}')
# echo $FS::print vfs_t vfs_fsid | mdb -k
vfs_fsid = {
    vfs_fsid.val = [ 0xd9454e49, 0x6b36d08 ]
}
#

  Impact
  The persistent fsid (filesystem id) is essential for proper NFS functionality.
  If the fsid of a filesystem changes on remount (or after reboot) the NFS
  clients might not be able to automatically recover from such event and the
  manual remount of the NFS filesystems on every NFS client might be needed.

Author: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Dan Vatca <dan.vatca@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
2017-02-25 14:45:54 +00:00
Andriy Gapon
7fa27112f3 zfs: clean up unused files and definitions
MFC after:	1 month
X-MFC after:	r314048
2017-02-24 07:53:56 +00:00
Toomas Soome
84a6eddc43 loader: update symlink support in zfs reader
As the current zfs file system is providing symlink via system attributes, need
to update the code accordingly.

Note, as the zfsboot code does not free the memory at this time, the
object list will put some stress on the boot2 heap, eventually we should
address the issue.

Reviewed by:	allanjude, smh
Approved by:	allanjude (mentor)
Differential Revision:	https://reviews.freebsd.org/D9706
2017-02-22 22:00:50 +00:00
Andriy Gapon
b93763e55d zfs: move zio_taskq_basedc under SYSDC
That knob is useless without SDC (or alike) scheduling class support.
That is, it's unused on FreeBSD.

MFC after:	4 days
2017-02-21 21:11:58 +00:00
Andriy Gapon
2b1bedaf06 zfs: lower priority of zio_write_issue threads by four
The difference of one was insignificant because zio_write_issue threads
ended up on the same run queues as other zio threads.
See sys/priority.h and sys/runq.h for more details.

Add a comment describing FreeBSD priority considerations and restore
the illumos variant of the code for comparison.

Obtained from:	Panzura
MFC after:	2 weeks
Sponsored by:	Panzura
2017-02-21 21:09:21 +00:00
Andriy Gapon
47c8e3d912 reimplement zfsctl (.zfs) support
The current code is written on top of GFS, a library with the generic
support for writing filesystems, which was ported from illumos.
Because of significant differences between illumos VFS and FreeBSD
VFS models, both the GFS and zfsctl code were heavily modified to
work on FreeBSD.  Nonetheless, they still contain quite a few ugly
hacks and bugs.

This is a reimplementation of the zfsctl code where the VFS-specific
bits are written from scratch and only the code that interacts with
the rest of ZFS is reused.

Some highlights.

We use two types of nodes, static and on-demand. The static nodes
are used for permanent directories like .zfs, .zfs/snapshot, etc. The
on-demand nodes are used for ephemeral directories that act as snapshot
mount points.
Initially only static nodes are created. Their vnodes are instantiated
when they are looked up. The on-demand nodes and vnodes are instantiated
as needed and the nodes are destroyed as soon as the corresponding
vnodes are reclaimed.
We also try very hard to ensure that uncovered snapshot vnodes do not
linger.  They are supposed to become inactive as soon as they are
uncovered and we try to recycle them immediately.
When a filesystem is unmounted all snapshots under .zfs are unmounted
first, then all vnodes are flushed and finally the static .zfs nodes
are destroyed.

There are some changes outside of zfsctl code too.
z_ctldir is never used directly (as it is an opaque pointer),
zfsctl_root() has to be used instead.  The function returns a locked
vnode now, so it accepts a lock flags parameter.  The function can
also fail now, e.g. during force unmounting, whereas previously it
was infallible.
zfsctl_root_lookup() is retired, instead of it VOP_LOOKUP() on the .zfs
vnode (obtained with zfsctl_root) is used.

Some ideas are picked from an independent work by will.

Reviewed by:	asomers, smh
MFC after:	1 month
Relnotes:	maybe
Differential Revision: https://reviews.freebsd.org/D7421
2017-02-21 17:47:08 +00:00
Josh Paetzel
aedc925301 MVF: 313876
7504 kmem_reap hangs spa_sync and administrative tasks

illumos/illumos-gate@405a5a0f5c
https://github.com/illumos/illumos-gate/commit/405a5a0f5c3ab36cb76559467d1a62ba648bd80

https://www.illumos.org/issues/7504

  We see long spa_sync(). We are waiting to hold dp_config_rwlock for writer. Some
  other thread holds dp_config_rwlock for reader, then calls arc_get_data_buf(),
  which finds that arc_is_overflowing()==B_TRUE. So it waits (while holding
  dp_config_rwlock for reader) for arc_reclaim_thread to signal arc_reclaim_waiters_cv.
  Before signaling, arc_reclaim_thread does arc_kmem_reap_now(), which takes ~seconds.

Author: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
2017-02-17 17:52:12 +00:00
Mark Johnston
7174af791e Directly include needed headers rather than relying on pollution.
We get machine/cpu.h via kmem.h -> proc.h -> _vm_domain.h -> seq.h.

Reported by:	Ryan Libby
Sponsored by:	Dell EMC Isilon
X-MFC with:	r313841
2017-02-17 03:27:20 +00:00
Mark Johnston
a11ac730a7 Prevent CPU migration when checking the DTrace nofault flag on x86.
dtrace_trap() consumes page and protection faults triggered by code running
in DTrace probe context. Such faults occur with interrupts disabled and are
detected using a per-CPU flag. Regular faults cause dtrace_trap() to be
called with interrupts enabled, and nothing was ensuring that the flag was
read from the correct CPU. This may result in dtrace_trap() consuming
unrelated page and protection faults when DTrace is enabled, causing the
fault handler to return without actually having handled the fault.

Diagnosed by:	Ryan Libby <rlibby@gmail.com>
MFC after:	3 days
Sponsored by:	Dell EMC Isilon
2017-02-16 23:05:20 +00:00
Josh Paetzel
c53cc7187c MFV 313786
7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid

illumos/illumos-gate@653af1b809
653af1b809

https://www.illumos.org/issues/7500
  With the integration of:

    commit 0f6d88aded0d165f5954688a9b13bac76c38da84
    Author: Alex Reece <alex@delphix.com>
    Date:   Sat Jul 26 13:40:04 2014 -0800
    4873 zvol unmap calls can take a very long time for larger datasets

  the dnode's dn_bufs field was changed from a list to a tree. As a result,
  the dn_unlisted_l0_blkid field is no longer necessary.

Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
2017-02-16 19:00:09 +00:00
Mark Johnston
28180eff9e Use pget() instead of pfind() in fasttrap_pid_{enable,disable}().
Suggested by:	mjg
MFC after:	1 week
2017-02-15 06:07:01 +00:00
Mark Johnston
2fce30fa8f Check for an exiting process when enabling PID provider probes.
MFC after:	1 week
2017-02-15 01:35:26 +00:00
Andriy Gapon
909bacfc59 remove l2_padding_needed statistic from zfs arc
It became obsolete when the Compressed ARC support was committed.

MFC after:	1 week
2017-02-12 19:45:30 +00:00
Andriy Gapon
e776c4054f check remaining space in zfs implementations of vptocnp
PR:		216939
Submitted by:	Iouri V. Ivliev <fbsd@any.com.ru>
MFC after:	1 week
2017-02-12 19:40:59 +00:00
Alan Somers
6159fb2f9c Fix setting birthtime in ZFS
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
	* In zfs_freebsd_setattr, if the caller wants to set the birthtime,
	  set the bits that zfs_settattr expects

	* In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime,
	  expected by zfs_xvattr_set.  The two levels of indirection seem
	  excessive, but it minimizes diffs vs OpenZFS.

	* In zfs_setattr, check for overflow of va_birthtime (from delphij)

	* Remove red herring in zfs_getattr

sys/cddl/contrib/opensolaris/uts/common/sys/vnode.h
	* Un-booby-trap some macros

New tests are under review at https://github.com/pjd/pjdfstest/pull/6

Reviewed by:	avg
MFC after:	3 weeks
Sponsored by:	Spectra Logic Corp
Differential Revision:	https://reviews.freebsd.org/D9353
2017-02-09 21:30:53 +00:00
George V. Neville-Neil
c499408f8b Fix the ifdef protection and remove superfluous extern statements
Reported by:	Konstantin Belousov
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-07 01:21:18 +00:00
Mark Johnston
9613442e83 Ensure that the DOF string length is divisible by 2.
It is an ASCII encoding of a hexadecimal representation of the DOF file
used to enable anonymous tracing, so its length should always be even.

MFC after:	1 week
2017-02-05 02:47:34 +00:00
Mark Johnston
e801af6fba Use PC-relative relocations for USDT probe sites on i386 and amd64.
When recording probe site addresses in the output DOF file, dtrace -G
needs to emit relocations for the .SUNW_dof section in order to obtain
the addresses of functions containing probe sites. DTrace expects the
addresses to be relative to the base address of the final ELF file,
and the amd64 USDT implementation was relying on some unspecified and
incorrect behaviour in the base system GNU ld to achieve this.

This change reimplements the probe site relocation handling to allow
USDT to be used with lld and newer GNU binutils. Specifically, it
makes use of R_X86_64_PC64/R_386_PC32 relocations to obtain the
probe site address relative to the DOF file address, and adds and uses a
new DOF relocation type which computes the final probe site address using
these relative offsets.

Reported by and discussed with:	Rafael Espíndola
MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D9374
2017-02-05 02:39:12 +00:00
George V. Neville-Neil
c613d0c2ba Files which implement the new random number system code for DTrace
Submitted by:	Graeme Jenkinson
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2017-02-03 22:40:13 +00:00
George V. Neville-Neil
00bb01a40c Replace the implementation of DTrace's RAND subroutine for generating
low-quality random numbers with a modern implementation (xoroshiro128+)
that is capable of generating better quality randomness without compromising performance.

Submitted by:	Graeme Jenkinson
Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D9051
2017-02-03 22:26:19 +00:00
Mark Johnston
b3b5bfeb22 Sync the x86 dis_tables.c with upstream.
This corresponds to the following illumos issues:

  5755 want support for Intel FMA instrs
  5756 want support for Intel BMI1 instrs
  5757 want support for Intel BMI2 instrs
  5758 want support for Intel AVX2 instrs
  7204 Want broadwell rdseed and adx support
  7208 Want stac/clac disasm support
  7733 Need SHA Instruction dis support
  7756 dis can't handle x86 SSE 3 instructions
  7757 want avx2 disasm tests
  7758 want SSE 4.1 disasm tests

MFC after:	2 weeks
2017-02-03 03:22:47 +00:00
Baptiste Daroussin
b4b4b5304b Revert crap accidentally committed 2017-01-28 16:31:23 +00:00
Baptiste Daroussin
814aaaa7da Revert r312923 a better approach will be taken later 2017-01-28 16:30:14 +00:00
Mark Johnston
da5320b9d0 Fix an off-by-one in an assertion on fasttrap tracepoint sizes.
FASTTRAP_MAX_INSTR_SIZE is the largest valid value of a tracepoint, so
correct the assertion accordingly. This limit was hit with a 15-byte NOP.

Reported by:	bdrewery
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2017-01-27 17:58:41 +00:00
Mark Johnston
61ef24a5a3 Fix initialization of "p" after r312658.
CID:	1369410
2017-01-25 16:35:57 +00:00
Mark Johnston
792e2f09ee Remove the DTRACEHIOC_ADD ioctl.
This ioctl has been considered legacy by upstream since the DTrace code
was first imported, and is unused. The removal also allows some
simplification of dtrace_helper_slurp().

Also remove a bogus copyout in the DTRACEHIOC_ADDDOF handler. Due to a
bug, it would overwrite an in-memory copy of the DOF header rather than
the passed-in DOF helper. Moreover, DTRACEHIOC_ADDDOF already copies the
helper back out automatically since its argument has the IOC_OUT attribute.
2017-01-23 02:21:06 +00:00
Josh Paetzel
f2be81e92c MFV 312436
6569 large file delete can starve out write ops

  illumos/illumos-gate@ff5177ee8b
  ff5177ee8b

  https://www.illumos.org/issues/6569
    The core issue I've found is that there is no throttle for how many
    deletes get assigned to one TXG. As a results when deleting large files
    we end up filling consecutive TXGs with deletes/frees, then write
    throttling other (more important) ops.

    There is an easy test case for this problem. Try deleting several
    large files (at least 1/2 TB) while you do write ops on the same
    pool. What we've seen is performance of these write ops (let's
    call it sideload I/O) would drop to zero.

    More specifically the problem is that dmu_free_long_range_impl()
    can/will fill up all of the dirty data in the pool "instantly",
    before many of the sideload ops can get in. So sideload
    performance will be impacted until all the files are freed.

    The solution we have tested at Nexenta (with positive results)
    creates a relatively simple throttle for how many "free" ops we let
    into one TXG.

    However this solution exposes other problems that should also be
    addressed. If we are to slow down freeing of data that means one
    has to wait even longer (assuming vnode ref count of 1) to get shell
    back after an rm or for NFS thread to finish the free-ing op.
    To avoid this the proposed solution is to call zfs_inactive() async
    for "large" files. Async freeing then begs for the reclaimed space
    to be accounted for in the zpool's "freeing" prop.

    The other issue with having a longer delete is the inability to
    export/unmount for a longer period of time. The proposed solution
    is to interrupt freeing of blocks when a fs is unmounted.

  Author: Alek Pinchuk <alek@nexenta.com>
  Reviewed by: Matt Ahrens <mahrens@delphix.com>
  Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
  Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
  Approved by: Dan McDonald <danmcd@omniti.com>

Reviewed by:	avg
Differential Revision:	D9008
2017-01-20 15:01:04 +00:00
Andrew Turner
ae69172343 Use the kernel stack in the ARM FBT DTrace provider. This is used to find
the fifth argument to functions being traced, however there was an error
where the userspace stack was being used. This may be invalid leading to
a kernel panic if this address is unmapped.

Submitted by:	Graeme Jenkinson <graeme.jenkinson@cl.cam.ac.uk>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9229
2017-01-18 13:27:24 +00:00
Mark Johnston
d01e6ad41b Have DTrace handle faults when dereferencing a lock object pointer.
MFC after:	1 week
2017-01-11 01:18:06 +00:00
Mark Johnston
4153c9b932 Ignore LC_SLEEPABLE when testing whether a mutex is adaptive.
MFC after:	1 week
2017-01-11 01:15:55 +00:00
Mateusz Guzik
619ce4d72e Revert r309619 "ifndef atomic_cas_* in cddl code"
It was a temporary change to ease an import of native atomic_cas primitives.
Instead, atomic_fcmpset was devised with different semantics. See r311168.
2017-01-03 21:02:30 +00:00
Mark Johnston
91371de1fa Remove the "unused" DIF subroutine index left after r308582.
These indices are input to a build-time script that generates code to
validate subroutine names.
2017-01-03 00:24:12 +00:00
Mark Johnston
c71c814a97 Remove an obsolete pragma from dtrace.h.
It triggers a compiler warning and has been removed upstream.

MFC after:	1 week
2016-12-27 23:31:32 +00:00
George V. Neville-Neil
805e1842c8 Remove extra DOF_SEC_XLIMPORT from the DOF_SEC_ISLOADABLE macro
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2016-12-16 20:44:14 +00:00
Alexander Motin
c5f74c4873 Revert r310023 for now.
After another look my new variable mapping was not exactly right.
2016-12-15 08:03:16 +00:00
Alexander Motin
d686b07132 Reduce diff from Illumos by better variables mapping. 2016-12-13 16:20:10 +00:00
Alexander Motin
2823b6467a Postpone ZVOL media/block size caching till first open.
At least on FreeBSD there are no legal way to access media or get its
size without opening device/provider first.  Postponing this caching
allows to skip several disk seeks per ZVOL/snapshot during import.

For HDD pool with 1 ZVOL in dev mode with 1000 snapshots this reduces
pool import time from 40 to 10 seconds.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2016-12-11 19:50:39 +00:00
Alexander Motin
2fb5d72d58 Add missed vfs.zfs.zfetch.max_idistance sysctl. 2016-12-10 21:19:27 +00:00
Mark Johnston
f99a517272 Don't create FBT probes for lock owner methods.
These functions may be called in DTrace probe context, so they cannot be
safely traced. Moreover, they are currently only used by DTrace, so their
corresponding FBT probes are not particularly useful.

MFC after:	2 weeks
2016-12-10 03:13:11 +00:00
Mark Johnston
8bb9b7f17a Consistently use fbt_excluded() on all architectures.
MFC after:	2 weeks
2016-12-10 03:11:05 +00:00
Alexander Motin
9373759d13 Fix spa_alloc_tree sorting by offset in r305331.
Original commit "7090 zfs should improve allocation order" declares alloc
queue sorted by time and offset.  But in practice io_offset is always zero,
so sorting happened only by time, while order of writes with equal time was
completely random.  On Illumos this did not affected much thanks to using
high resolution timestamps.  On FreeBSD due to using much faster but low
resolution timestamps it caused bad data placement on disks, affecting
further read performance.

This change switches zio_timestamp_compare() from comparing uninitialized
io_offset to really populated io_bookmark values.  I haven't decided yet
what to do with timestampts, but on simple tests this change gives the
same peformance results by just making code to work as declared.

MFC after:	1 week
2016-12-08 15:58:03 +00:00
George V. Neville-Neil
af463464cf Fix a kernel panic in DTrace's rw_iswriter subroutine.
On FreeBSD the sense of rw_write_held() and rw_iswriter() were reversed,
probably due to a cut and paste error. Using rw_iswriter() would cause
the kernel to panic.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D8718
2016-12-07 07:27:47 +00:00
Mateusz Guzik
ef32958e5d ifndef atomic_cas_* in cddl code in preparation for native implementations
This is a temporary change to not require all architectures to import at once.

Discussed with:	jhb
2016-12-06 14:08:49 +00:00
Andriy Gapon
0451d4e97b MFV r309249: 3821 Race in rollback, zil close, and zil flush
Note: there was a merge conflict resolved by me.

illumos/illumos-gate@43297f973a
43297f973a

https://www.illumos.org/issues/3821
  We recently had nodes with some of the latest zfs bits panic on us in a
  rollback-heavy environment. The following is from my preliminary analysis:
  Let's look at where we died:
  > $C
  ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1)
  ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1)
  ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1)
  ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1)
  ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680)
  ffffff01ea6b9c30 thread_start+8()
  If we dig in we can find that this dataset corresponds to a zone:
  > ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname
  zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233-
  9e378e89877b" ]
  Okay so we have a null taskq pointer. That only happens during the calls to
  zil_open and zil_close. If we poke around we can see that we're actually in
  midst of a rollback:
  > ::pgrep zfs | ::printf "0x%x %s\\n" proc_t . p_user.u_psargs
  0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d-
  4b8ab06213c2@marlin_init
  0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233-
  9e378e89877b@marlin_init
  0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc-
  4ac4ba8fe9f8@marlin_init
  0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e-
  10cd45ff8e9f@marlin_init
  0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c-
  6d1b2751e6f5@marlin_init
  0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24-
  2ed45cd0da70@marlin_init

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>

MFC after:	3 weeks
2016-11-28 15:14:31 +00:00
Andriy Gapon
69bac03666 MFV r308990: 7181 race between zfs_mount and zfs_ioc_rollback
illumos/illumos-gate@90f2c094b3
90f2c094b3

https://www.illumos.org/issues/7181
  zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
  dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
  before the setup is actually completed,
  thus an under-constructed zfsvfs becomes visible.
  Additionally, there is nothing to serialize the two call paths. As a result two
  threads can step on each other's toes.
  assertion failed: zilog->zl_clean_taskq == NULL, file:
  ../../common/fs/zfs/zil.c, line: 1772

  > $c
  vpanic()
  0xfffffffffbdf6928()
  zil_open+0x45(ffffff1bbc5dd000, fffffffff7993880)
  zfsvfs_setup+0x84(ffffffb378d77000, 0)
  zfs_resume_fs+0x132(ffffffb378d77000, ffffffb37ddcf000)
  zfs_ioc_rollback+0x96(ffffffb37ddcf000, ffffff01dcdc4cd0, ffffff01aa091000)
  zfsdev_ioctl+0x215(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
  ffffff0004b59e58)
  cdev_ioctl+0x39(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
  ffffff0004b59e58)
  spec_ioctl+0x60(ffffff0197737700, 5a19, 80465f8, 100003,
  ffffff01ab318368, ffffff0004b59e58)
  fop_ioctl+0x55(ffffff0197737700, 5a19, 80465f8, 100003,
  ffffff01ab318368, ffffff0004b59e58)
  ioctl+0x9b(7, 5a19, 80465f8)
  sys_syscall32+0x1f7()

  > ffffff1bbc5dd000::print objset_t os_zil
  os_zil = 0xffffff1c053cf7c0
  > 0xffffff1c053cf7c0::print zilog_t zl_clean_taskq

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>

MFC after:	2 weeks
2016-11-24 10:34:42 +00:00
Andriy Gapon
b55ae64b50 MFV r308988: 7199, 7200 dsl_dataset_rollback_sync may try to free
already free blocks

7199 dsl_dataset_rollback_sync may try to free already free blocks
7200 no blocks must be born in a txg after a snaphot is created

illumos/illumos-gate@bfaed0b91e
bfaed0b91e

https://www.illumos.org/issues/7199
  dsl_dataset_rollback_sync may try to free already freed blocks when it calls
  dsl_destroy_head_sync_impl to destroy a temporary clone.
  That happens if a snapshot to which we are rolling back and from which the
  clone is created has some ZIL records.

https://www.illumos.org/issues/7200
  No new blocks must be born in a dataset in the same TXG after a snapshot of the
  dataset is taken.
  Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
  and as such they would be presumed to belong o the snapshot while in fact they
  do not.
  All the datasets must be clean before sync tasks are run, so the described
  scenario may happen only if one of the sync tasks dirties the dataset and
  another sync task takes its snapshot.
  Then, there will be another sync pass because of the dirty data and the new
  blocks will be born in the same TXG when the data is written out.
  It seems that almost all of the existing sync tasks modify only MOS and do not
  dirty any objsets.
  The only exception that I've been able to identify so far is the rollback which
  can modify an objset when it zeroes out the objset's ZIL.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>

MFC after:	3 weeks
2016-11-24 10:29:21 +00:00
Andriy Gapon
239c22b73d MFV r308987: 7180 potential race between zfs_suspend_fs+zfs_resume_fs
and zfs_ioc_rename

illumos/illumos-gate@690041b9ca
690041b9ca

https://www.illumos.org/issues/7180
  If a filesystem is not unmounted while the rename is being performed, then, for
  example, a concurrect zfs rollback may call zfs_suspend_fs followed by
  zfs_resume_fs on the same filesystem.
  The latter takes the filesystem's name as an argument. If the filesystem name
  changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
  call in zfs_resume_fs would fail resulting in a kernel panic.
  So far I have been able to reproduce this problem on FreeBSD where zfs rename
  has -u option that skips the unmounting before doing the renaming.
  But I think that in theory the same problem can occur on illumos as well,
  because the unmounting is done in userland before invoking the rename ioctl and
  there could be a race with, e.g., zfs mount.
  panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
  == 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
  zfs/zfs_vfsops.c, line: 2210
  KDB: stack backtrace:
  db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
  vpanic() at vpanic+0x182/frame 0xfffffe004df30790
  panic() at panic+0x43/frame 0xfffffe004df307f0
  assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
  zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
  zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
  zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
  devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
  kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
  sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
  amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
  Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

MFC after:	2 weeks
2016-11-24 10:21:22 +00:00
Andriy Gapon
d15b9428bb further fix zfs_lock() diagnostics
It was very wrong to look at the vnode and znode internals without
having locked the vnode first.

Reported by:	pho
Tested by:	pho
MFC after:	1 week
X-MFC with:	r308887
2016-11-24 09:00:51 +00:00
George V. Neville-Neil
cdaa8777f7 Add tunable to disable destructive dtrace
Submitted by:	Joerg Pernfuss <code.jpe@gmail.com>
Reviewed by:	rstone, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8624
2016-11-23 22:50:20 +00:00
Alan Cox
bba39b9ae3 Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used.  More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8583
2016-11-22 18:13:46 +00:00
Andriy Gapon
17055fcda7 fix unsafe modification of zfs_vnodeops when DIAGNOSTIC is enabled
The idea was to avoid a false assertion in zfs_lock, but it was
implemented very dangerously and incorrectly.

Reported by:	pho
Tested by:	pho
MFC after:	1 week
2016-11-20 14:00:50 +00:00
Andriy Gapon
2ec31e84cc zfs: fix up after the removal of PG_CACHED pages in r308691
PR:		214629
Reported by:	mshirk@daemon-security.com
Reviewed by:	alc
Tested by:	Shawn Webb <shawn.webb@hardenedbsd.org>
X-MFC with:	308691
2016-11-19 08:12:57 +00:00
Mark Johnston
188011dbf2 Support fetching RFLAGS in fasttrap_getreg().
MFC after:	1 week
2016-11-18 03:11:11 +00:00
Alexander Motin
14b5719f6a After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Sponsored by:	iXsystems, Inc.
2016-11-17 21:01:27 +00:00
Alexander Motin
eb9bfc257d Revert r307392: I've found a way to avoid big allocations completely. 2016-11-17 20:44:51 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Mark Johnston
375c8b20dc Remove the DTrace printt and typeref actions.
These are FreeBSD-specific and were added in r178576 to provide the ability
to pretty-print instances of compound types. However, the print action has
long since been augmented to provide this functionality with a simpler
interface.

Discussed with:	gnn
Differential Revision:	https://reviews.freebsd.org/D8478
2016-11-12 19:26:12 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
Oleksandr Tymoshenko
d30e308465 Fix include order as required post r308415 2016-11-07 20:02:18 +00:00
Alexander Motin
8acf168aab Fix ZIL records ordering when ZVOL opened both with and without FSYNC.
Before this an earlier writes to a ZVOL opened without FSYNC could get to
ZIL after later writes to the same ZVOL opened with FSYNC.  Fix this by
replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt),
marking all log records sync if anybody opened the ZVOL with FSYNC.

MFC after:	2 weeks
2016-11-01 16:03:31 +00:00
Alexander Motin
2d1d8f4c8f Pass to zvol_log_truncate() same sync values as to zvol_log_write().
Surplus marking of TX_TRUNCATE records as sync could result in putting them
into ZIL before previous writes if ones were async.

MFC after:	2 weeks
2016-11-01 12:47:19 +00:00
Alexander Motin
74a148f46f Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz. 2016-10-29 23:25:12 +00:00
Andriy Gapon
97371ba2a9 zfsbootcfg: a simple tool to set next boot (one time) options for zfsboot
(gpt)zfsboot will read one-time boot directives from a special ZFS pool
area.  The area was previously described as "Boot Block Header", but
currently it is know as Pad2, marked as reserved and is zeroed out on
pool creation.  The new code interprets data in this area, if any, using
the same format as boot.config.  The area is immediately wiped out.
Failure to parse the directives results in a reboot right after the
cleanup.  Otherwise the boot sequence proceeds as usual.

zfsbootcfg writes zfsboot arguments specified on its command line to the
Pad2 area of a disk identified by vfs.zfs.boot.primary_pool and
vfs.zfs.boot.primary_vdev kenv variables that are set by loader during
boot.  Please see the manual page for more.

Thanks to all who reviewed, contributed and made suggestions!  There are
many potential improvements to the feature, please see the review for
details.

Reviewed by:	wblock (docs)
Discussed with:	jhb, tsoome
MFC after:	3 weeks
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D7612
2016-10-29 14:09:32 +00:00
Alexander Motin
471cf6ce7d Add vdev_reopening support to vdev_geom.
It allows to avoid extra GEOM providers flapping without significant need.
Since GEOM got resize support, we don't need to reopen provider to get new
size.  If provider was orphaned and no longer valid, ZFS should already
know that, and in such case reopen should be done in full as expected.

MFC after:	2 weeks
2016-10-28 17:05:14 +00:00
Alexander Motin
f106f43aa2 Matching GUIDs, handle possible race on vdev detach.
In case of vdev detach, causing top level mirror vdev destruction, leaf
vdev changes its GUID to one of the destroyed mirror, that creates race
condition when GUID in vdev label may not match one in the pool config.

This change replicates logic nuance of vdev_validate() by adding special
exception, matching the vdev GUID against the top level vdev GUID.
Since this exception is not completely reliable (may give false positives
if we fail to erase label on detached vdev), use it only as last resort.

Quick way to reproduce this scenario now is detach vdev from a pool with
enabled autoextend.  During vdev detach autoextend logic tries to reopen
remaining vdev, that always fails now since in-memory configuration is
already updated, while on-disk labels are not yet.

MFC after:	2 weeks
2016-10-28 16:21:31 +00:00
Alexander Motin
4be4cba048 Improve few debugging log messages. 2016-10-28 15:30:10 +00:00
Andriy Gapon
539fc86f2e 3746 ZRLs are racy
illumos/illumos-gate@260af64db7
260af64db7

https://www.illumos.org/issues/3746
  From the original change log:
  It was possible for a reference to be added even with the lock held, and
  for references added just after a lock release to be lost.
  This bug was also independently found and reported in wesunsolve.net
  issues 6985013 6995524.
  In zrl_add(), always use an atomic operation to update the refcount.
  The mutex in the ZRL only guarantees that wakeups occur for waiters on the
  lock. It offers no protection against concurrent updates of the refcount.
  The only refcount transition that is safe to perform without an atomic
  operation is from ZRL_LOCKED back to 0, since this can only be performed
  by the thread which has the ZRL locked.

Authored by: Will Andrews <will@freebsd.org>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Youzhong Yang <yyang@mathworks.com>
PR:		204037
MFC after:	1 week
2016-10-27 07:38:07 +00:00
Alexander Motin
f0cbbdecbc Fix panic after ZVOL renamed to name invalid for DEVFS.
MFC after:	2 weeks
2016-10-24 12:24:24 +00:00
Alexander Motin
9be66df1e1 Add vfs.zfs.zil_log_limit sysctl.
It is at least partially broken now, but that is another question.
2016-10-16 18:49:15 +00:00
Alexander Motin
a059d8ccbc Optimize ZIL itx memory allocation on FreeBSD.
These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB.  To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.

To make it possible de-inline few instances of zil_itx_destroy().
2016-10-16 10:43:12 +00:00
Alexander Motin
1899e205d1 MFV r307314:
6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates

Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().

The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.

We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".

This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.

Closes #107

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@5fc46359c5
2016-10-14 12:03:04 +00:00
Alexander Motin
b3a8b04807 MFV r307313:
5120 zfs should allow large block/gzip/raidz boot pool (loader project)

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>

openzfs/openzfs@c8811bd3e2

FreeBSD still does not support booting from gzip-compressed datasets,
so keep one chunk of this commit out.
2016-10-14 12:01:33 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Konstantin Belousov
f71d08566c Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

Reported and tested by:	jkim
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
2016-10-07 11:38:28 +00:00
Andriy Gapon
6f98c83306 implement zfs_vptocnp() using z_parent property
This should allow vn_fullpath() to work even when vfs name cache is
disabled for zfs, which is the case when zfs properties like
casesensitivity and normalization are set non-default values.

The new code should be 100% reliable for directories and "mostly"
reliable for files, that is, when hardlinks across directories are
not used.

Reported by:	Frederic Chardon <chardon.frederic@gmail.com>
Reviewed by:	kib (vfs contract)
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D8146
2016-10-07 06:29:24 +00:00
Andriy Gapon
9ba3abc30e zfs: fix a wrong assertion for extended attributes
For the extended attributes the order between z_teardown_lock and the
vnode lock is different.
The bug was triggered only with DIAGNOSTIC turned on.
This fix is developed in cooperation with avos.

PR:		213112
Reported by:	avos
Tested by:	avos
MFC after:	1 week
2016-10-04 08:09:25 +00:00
Mark Johnston
4538cee5bf Allow tracing of functions prefixed by "__".
This restriction was inherited from upstream but is not relevant on FreeBSD.
Furthermore, it hindered the tracing of locking primitive subroutines.

MFC after:	1 week
2016-10-02 00:35:00 +00:00
Alexander Motin
863ef2ca62 Add #ifdef _KERNEL around send_holes_without_birth_time sysctl.
Reported by:	avg@
2016-09-29 17:48:53 +00:00
Alexander Motin
226a11f81e MFV r306423: 7402 Create tunable to ignore hole_birth feature
Until we can resolve the numerous hole_birth bugs that have cropped up
recently, and come up with a way going forwards to protect users from
corruption, we should disable the hole_birth feature.  Using a tunable
allows those who are confident that their data is correct to continue to
take advantage of the feature.

Closes #188

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
2016-09-29 00:00:37 +00:00
Alexander Motin
bb97118138 MFV r306422: 7254 ztest failed assertion in ztest_dataset_dirobj_verify: dirobjs + 1 == usedobjs
dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.

Closes #180

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
2016-09-28 23:54:47 +00:00
Mark Johnston
9e579a58c3 Move implementations of uread() and uwrite() to the illumos compat layer.
MFC after:	1 week
2016-09-24 21:40:14 +00:00
Andriy Gapon
d26312a4e4 fix vnode lock assertion for extended attributes directory
Background.  In ZFS a file with extended attributes has a special
directory associated with it where each extended attribute is a file.
The attribute's name is a file name and its value is a file content.
When the ownership of a file with extended attributes is changed, ZFS
also changes ownership of the special directory.  This is where the bug
was hit.

The bug was introduced in r209158.

Nota bene.  ZFS vnode locks are typically acquired before
z_teardown_lock (i.e., before ZFS_ENTER).  But this is not the case for
the vnodes that represent the extended attribute directory and files.
Those are always locked after ZFS_ENTER.  This is confusing and fragile.

PR:		212702
Reported by:	Christian Fuss to FreeNAS
Tested by:	mav
MFC after:	1 week
2016-09-24 08:13:15 +00:00
Mark Johnston
36f5d07745 Re-check the systrace probe ID before calling dtrace_probe().
Otherwise there exists a narrow window during which a syscall probe can be
disabled and cause a concurrently-running thread to call dtrace_probe()
with an invalid probe ID.

Reported by:	ngie
MFC after:	1 week
Sponsored by:	Dell EMC Isilon
2016-09-22 23:22:53 +00:00
Allan Jude
c2b475d0ee MFV r268120:
4936 lz4 could theoretically overflow a pointer with a certain input

  illumos/illumos-gate@58d0718061

Reviewed by:	delphij
MFC after:	2 weeks
Sponsored by:	ScaleEngine Inc.
Differential Revision:	https://reviews.freebsd.org/D7850
2016-09-11 17:48:06 +00:00
Alexander Motin
20e45e033c Switch random_get_pseudo_bytes() shim to arc4rand().
Our shim for Solaris random_get_bytes() uses read_random(), that looks
reasonable, since it guaranties reliably seeded random data.  On the other
side Solaris random_get_pseudo_bytes() does not provide this guarantie,
and its original Solaris implementation is equivalent to our arc4rand(),
using software crypto without stressing slower hardware RNG.
2016-09-10 09:37:41 +00:00
Alexander Motin
4605bf63c4 MFV r305562: 7259 DS_FIELD_LARGE_BLOCKS is unused
The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch:

    commit ca0cc3918a1789fa839194af2a9245f801a06b1a
    Author: Matthew Ahrens <mahrens@delphix.com>
    Date:   Fri Jul 24 09:53:55 2015 -0700

        5959 clean up per-dataset feature count code
        Reviewed by: Toomas Soome <tsoome@me.com>
        Reviewed by: George Wilson <george@delphix.com>
        Reviewed by: Alex Reece <alex@delphix.com>
        Approved by: Richard Lowe <richlowe@richlowe.net>

This patch simply removes this macro from dsl_dataset.h.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-07 20:09:24 +00:00
Alexander Motin
de1fdddeda MFV r305560: 7278 tuning zfs_arc_max does not impact arc_c_min
When changing zfs_arc_max (e.g. as zdb does), it may be set to less
than the default arc_c_min. arc_c_min should decrease to not be more than
arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective.

Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@608764bead
2016-09-07 20:05:10 +00:00
Andriy Gapon
1a82707cd7 fix zfs pool creation accidentally broken by r305331
The upstream change introduced a new load state, SPA_LOAD_CREATE,
and vdev_geom code needs to be aware of it.

Tested by:	cy
MFC after:	1 week
X-MFC with:	r305331
2016-09-06 06:09:12 +00:00
Alexander Motin
9b9258a12a Missed FreeBSD-specific piece of r305338. 2016-09-03 11:17:33 +00:00
Alexander Motin
d7e781bda3 MFC r305337: 7004 dmu_tx_hold_zap() does dnode_hold() 7x on same object
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).

dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:

dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()

This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.

Sponsored by: Intel Corp.

Closes #109

Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@d3e523d489
2016-09-03 11:00:29 +00:00
Alexander Motin
4ad4b70e77 MFV r305336: 7247 zfs receive of deduplicated stream fails
This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.

Closes #159

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>

openzfs/openzfs@b09697c8c1
2016-09-03 10:59:05 +00:00
Alexander Motin
070da3f779 MFV r305335: 7003 zap_lockdir() should tag hold
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.

Sponsored by: Intel Corp.

Closes #108

Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@0780b3eab5
2016-09-03 10:58:14 +00:00
Alexander Motin
d3ec2cdb4a MFV r304157:
7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records

illumos/illumos-gate@12b90ee2d3
https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f
e1d9cfa

https://www.illumos.org/issues/7230
  A test failure occurred where a send stream had only a BEGIN record. This
  should not be possible if the send returns without error. Prevented this from
  happening in the future by adding an assertion to dmu_send_impl() to verify
  that if the function returns 0 (success) both a BEGIN and END record are
  present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
  END records sent), having dump_record() set flags appropriately, adding VERIFY
  statement to dmu_send_impl().

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>
2016-09-03 10:10:58 +00:00
Alexander Motin
7aafc9d4c8 MFV r304156: 7235 remove unused func dsl_dataset_set_blkptr
illumos/illumos-gate@bd56f80007
https://github.com/illumos/illumos-gate/commit/bd56f80007857b960e0981ed0797ad8ec
844a96b

https://www.illumos.org/issues/7235
  The function dsl_dataset_set_blkptr() is unused. We should remove it.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-03 10:09:23 +00:00
Alexander Motin
c9fa25c110 MFV r304155: 7090 zfs should improve allocation order and throttle allocations
illumos/illumos-gate@0f7643c737
https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b
10c846a

https://www.illumos.org/issues/7090
  When write I/Os are issued, they are issued in block order but the ZIO pipelin
e
  will drive them asynchronously through the allocation stage which can result i
n
  blocks being allocated out-of-order. It would be nice to preserve as much of
  the logical order as possible.
  In addition, the allocations are equally scattered across all top-level VDEVs
  but not all top-level VDEVs are created equally. The pipeline should be able t
o
  detect devices that are more capable of handling allocations and should
  allocate more blocks to those devices. This allows for dynamic allocation
  distribution when devices are imbalanced as fuller devices will tend to be
  slower than empty devices.
  The change includes a new pool-wide allocation queue which would throttle and
  order allocations in the ZIO pipeline. The queue would be ordered by issued
  time and offset and would provide an initial amount of allocation of work to
  each top-level vdev. The allocation logic utilizes a reservation system to
  reserve allocations that will be performed by the allocator. Once an allocatio
n
  is successfully completed it's scheduled on a given top-level vdev. Each top-
  level vdev maintains a maximum number of allocations that it can handle
  (mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
  mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
  groups and round robin across all eligible metaslab groups to distribute the
  work. As top-levels complete their work, they receive additional work from the
  pool-wide allocation queue until the allocation queue is emptied.

Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 10:04:37 +00:00
Alexander Motin
0b51a59fc7 MFV r303078:
7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer

illumos/illumos-gate@926549256b
https://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7
8fa08ef

https://www.illumos.org/issues/7086
  In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the
  db_blkptr, to prevent it from being changed by syncing context.
  Otherwise we may see that ztest got a segfault from this stack:
  libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20,
0)
  libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0)
  libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530)
  libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1)
  libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1)
  libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780)
  libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a)
  ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0)
  ztest_remove+0x9f(ea294588, ea293f50, 4, 3)
  ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1)
  ztest_dmu_object_alloc_free+0x71(ea294588, 13)
  ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0)
  ztest_execute+0x89(a, 807c720, 13, 0)
  ztest_thread+0xea(13, 0, 0, 0)
  libc.so.1`_thrp_setup+0x88(f0983240)
  libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0)
  Looking into it a bit, we see that this is an embedded blockpointer, so
  BP_GET_NDVAS should have returned 0:
       b32b240::blkptr
  EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L
  Instead, it looks like another thread is modifying this blockpointer:
       b32b240::ugrep | ::whatis
  f47a0e0c is in [ stack tid=0x19f ]
  ebd6ec40 is in [ stack tid=0x226 ]
  ea293bd0 is in [ stack tid=0x244 ]
  ea293be4 is in [ stack tid=0x244 ]

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-03 08:43:43 +00:00
Alexander Motin
84c3781ac9 MFV r303077:
7072 zfs fails to expand if lun added when os is in shutdown state

illumos/illumos-gate@c39a2aae1e
c39a2aae1e

https://www.illumos.org/issues/7072
  upstream:
  38733 zfs fails to expand if lun added when os is in shutdown state
  DLPX-36910 spares and caches should not display expandable space
  DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer

Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 08:42:12 +00:00
Alexander Motin
efa0867fb0 MFV r302991: 6950 ARC should cache compressed data
illumos/illumos-gate@dcbf3bd6a1
dcbf3bd6a1

https://www.illumos.org/issues/6950
  When reading compressed data from disk, the ARC should keep the compressed
  block cached and only decompress it when consumers access the block. The
  uncompressed data should be short-lived allowing the ARC to cache a much larger
  amount of data. The DMU would also maintain a smaller cache of uncompressed
  blocks to minimize the impact of decompressing frequently accessed blocks.

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
2016-09-03 08:30:51 +00:00
Alexander Motin
c543b519be MFV r304158:
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information

7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often

illumos/illumos-gate@b72b6bb10a
https://github.com/illumos/illumos-gate/commit/b72b6bb10ad55121a1b352c6f68ebdc8e
20c9086

https://www.illumos.org/issues/7136
  6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents
  whenever an aux device gets removed from a pool. However, those sysevents will
  be created without the vdev_guid and vdev_path fields. It would be better to
  always populate those fields.

https://www.illumos.org/issues/7115
  The addition of spa_event_notify in vdev removal code (see #6922) causes event
s
  to be generated even if the spare failed to be removed with EBUSY.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>
2016-09-01 18:37:11 +00:00
Alexander Motin
25584d12e7 MFV r302993: 7104 increase indirect block size
illumos/illumos-gate@4b5c8e93ca
https://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3
be1db1c

https://www.illumos.org/issues/7104
  The current default indirect block size is 16KB. We can improve
  performance by increasing it to 128KB. This is especially helpful for
  any workload that needs to read most of the metadata, e.g.
  scrub/resilver, file deletion, filesystem deletion, and zfs send.
  We also need to fix a few space estimation errors to make the tests
  pass.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
2016-09-01 18:33:39 +00:00