Commit Graph

2251 Commits

Author SHA1 Message Date
John Baldwin
889ad0b890 Use a callout instead of timeout(9) for delayed zio's.
Reviewed by:	avg
Differential Revision:	https://reviews.freebsd.org/D22597
2019-12-13 19:27:51 +00:00
Mateusz Guzik
c8b29d1212 vfs: locking primitives which elide ->v_vnlock and shared locking disablement
Both of these features are not needed by many consumers and result in avoidable
reads which in turn puts them on profiles due to cache-line ping ponging.

On top of that the current lockgmr entry point is slower than necessary
single-threaded. As an attempted clean up preparing for other changes,
provide new routines which don't support any of the aforementioned features.

With these patches in place vop_stdlock and vop_stdunlock disappear from
flamegraphs during -j 104 buildkernel.

Reviewed by:	jeff (previous version)
Tested by:	pho
Differential Revision:	https://reviews.freebsd.org/D22665
2019-12-11 23:11:21 +00:00
Mateusz Guzik
abd80ddb94 vfs: introduce v_irflag and make v_type smaller
The current vnode layout is not smp-friendly by having frequently read data
avoidably sharing cachelines with very frequently modified fields. In
particular v_iflag inspected for VI_DOOMED can be found in the same line with
v_usecount. Instead make it available in the same cacheline as the v_op, v_data
and v_type which all get read all the time.

v_type is avoidably 4 bytes while the necessary data will easily fit in 1.
Shrinking it frees up 3 bytes, 2 of which get used here to introduce a new
flag field with a new value: VIRF_DOOMED.

Reviewed by:	kib, jeff
Differential Revision:	https://reviews.freebsd.org/D22715
2019-12-08 21:30:04 +00:00
Mark Johnston
bf10551606 Fix an inverted condition introduced in r353539.
This would have most likely resulted in read errors causing page leaks.

Submitted by:	jeff
2019-12-06 23:49:37 +00:00
Konstantin Belousov
fdc6b10d44 Add a VN_OPEN_INVFS flag.
vn_open_cred() assumes that it is called from the top-level of a VFS
syscall.  Writers must call bwillwrite() before locking any VFS
resource to wait for cleanup of dirty buffers.

ZFS getextattr() and setextattr() VOPs do call vn_open_cred(), which
results in wait for unrelated buffers while owning ZFS vnode lock (and
ZFS does not use buffer cache).  VN_OPEN_INVFS allows caller to skip
bwillwrite.

Note that ZFS is still incorrect there, because it starts write on an
mp and locks a vnode while holding another vnode lock.

Reported by:	Willem Jan Withagen <wjw@digiware.nl>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2019-11-29 14:02:32 +00:00
Alexander Motin
5008399c14 Fix use-after-free in case of L2ARC prefetch failure.
In case L2ARC read failed, l2arc_read_done() creates _different_ ZIO
to read data from the original storage device.  Unfortunately pointer
to the failed ZIO remains in hdr->b_l1hdr.b_acb->acb_zio_head, and if
some other read try to bump the ZIO priority, it will crash.

The problem is reproducible by corrupting L2ARC content and reading
some data with prefetch if l2arc_noprefetch tunable is changed to 0.
With the default setting the issue is probably not reproducible now.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-11-28 18:28:35 +00:00
Andriy Gapon
8491540808 MFV r354383: 10592 misc. metaslab and vdev related ZoL bug fixes
illumos/illumos-gate@555d674d5d
555d674d5d

https://www.illumos.org/issues/10592
  This is a collection of recent fixes from ZoL:
  8eef997679 Error path in metaslab_load_impl() forgets to drop ms_sync_lock
  928e8ad47d Introduce auxiliary metaslab histograms
  425d3237ee Get rid of space_map_update() for ms_synced_length
  6c926f426a Simplify log vdev removal code
  21e7cf5da8 zdb -L should skip leak detection altogether
  df72b8bebe Rename range_tree_verify to range_tree_verify_not_present
  75058f3303 Remove unused vdev_t fields

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
MFC after:	4 weeks
2019-11-21 13:35:43 +00:00
Andriy Gapon
489912da7b MFV r354382,r354385: 10601 10757 Pool allocation classes
illumos/illumos-gate@663207adb1
663207adb1

10601 Pool allocation classes
https://www.illumos.org/issues/10601
  illumos port of ZoL Pool allocation classes. Includes at least these two
  commits:
  441709695 Pool allocation classes misplacing small file blocks
  cc99f275a Pool allocation classes

10757 Add -gLp to zpool subcommands for alt vdev names
https://www.illumos.org/issues/10757
  Port from ZoL of
  d2f3e292d Add -gLp to zpool subcommands for alt vdev names
  Note that a subsequent ZoL commit changed -p to -P
  a77f29f93 Change full path subcommand flag from -p to -P

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Portions contributed by: Håkan Johansson <f96hajo@chalmers.se>
Portions contributed by: Richard Yao <ryao@gentoo.org>
Portions contributed by: Chunwei Chen <david.chen@nutanix.com>
Portions contributed by: loli10K <ezomori.nozomu@gmail.com>
Author: Don Brady <don.brady@delphix.com>

11541 allocation_classes feature must be enabled to add log device

illumos/illumos-gate@c1064fd7ce
c1064fd7ce

https://www.illumos.org/issues/11541
  After the allocation_classes feature was integrated, one can no longer add a
  log device to a pool unless that feature is enabled. There is an explicit check
  for this, but it is unnecessary in the case of log devices, so we should handle
  this better instead of forcing the feature to be enabled.

Author: Jerry Jelinek <jerry.jelinek@joyent.com>

FreeBSD notes.
I faithfully added the new -g, -L, -P flags, but only -g does something:
vdev GUIDs are displayed instead of device names.  -L, resolve symlinks,
and -P, display full disk paths, do nothing at the moment.
The use of special vdevs is backward compatible for read-only access, so
root pools should be bootable, but exercise caution.

MFC after:	4 weeks
2019-11-21 08:20:05 +00:00
Andriy Gapon
a8c08e008a MFV r354378,r354379,r354386: 10499 Multi-modifier protection (MMP)
10499 Multi-modifier protection (MMP)
illumos/illumos-gate@e0f1c0afa4
e0f1c0afa4
https://www.illumos.org/issues/10499
  Port the following ZFS commits from ZoL to illumos.
  379ca9cf2 Multi-modifier protection (MMP)
  bbffb59ef Fix multihost stale cache file import
  0d398b256 Do not initiate MMP writes while pool is suspended

10701 Correct lock ASSERTs in vdev_label_read/write
illumos/illumos-gate@58447f688d
58447f688d
https://www.illumos.org/issues/10701
  Port of ZoL commit:
  0091d66f4e Correct lock ASSERTs in vdev_label_read/write
  At a minimum, this fixes a blown assert during an MMP test run when running on
  a DEBUG build.

11770 additional mmp fixes
illumos/illumos-gate@4348eb9012
4348eb9012
https://www.illumos.org/issues/11770
  Port a few additional MMP fixes from ZoL that came in after our
  initial MMP port.
  4ca457b065 ZTS: Fix mmp_interval failure
  ca95f70dff zpool import progress kstat
  (only minimal changes from above can be pulled in right now)
  060f0226e6 MMP interval and fail_intervals in uberblock

Note from the committer (me).
I do not have any use for this feature and I have not tested it.  I only
did smoke testing with multihost=off.
Please be aware.
I merged the code only to make future merges easier.

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Portions contributed by: Tim Chase <tim@chase2k.com>
Portions contributed by: sanjeevbagewadi <sanjeev.bagewadi@gmail.com>
Portions contributed by: John L. Hammond <john.hammond@intel.com>
Portions contributed by: Giuseppe Di Natale <dinatale2@llnl.gov>
Portions contributed by: Prakash Surya <surya1@llnl.gov>
Portions contributed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Olaf Faaland <faaland1@llnl.gov>

MFC after:	4 weeks
2019-11-18 09:38:35 +00:00
Konstantin Belousov
a7af4a3e7d amd64: move GDT into PCPU area.
Reviewed by:	jhb, markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D22302
2019-11-12 15:51:47 +00:00
Andriy Gapon
930db3e338 MFV r354377: 10554 Implemented zpool sync command
illumos/illumos-gate@9c2acf00e2
9c2acf00e2

https://www.illumos.org/issues/10554
  During the port of MMP (illumos bug 10499) from ZoL, I found this
  earlier ZoL project is a prerequisite. Here is the original
  description.  This addition will enable us to sync an open TXG to the
  main pool on demand. The functionality is similar to 'sync(2)' but
  'zpool sync' will return when data has hit the main storage instead of
  potentially just the ZIL as is the case with the 'sync(2)' cmd.

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Author: Alek Pinchuk <apinchuk@datto.com>
MFC after:	3 weeks
Relnotes:	possibly
2019-11-07 11:18:28 +00:00
Alexander Motin
4cd20c3b08 Add vfs.zfs.zio.taskq_batch_pct tunable.
MFC after:	1 week
2019-11-05 15:19:05 +00:00
Andriy Gapon
ec03988887 fix up r354333, make zfsproc visible to dtrace, rename to system_proc
I overlooked the fact that zfsproc is required by dtrace modules that
use illumos compatible taskq KPI.  So, move the symbol definition to
the opensolaris module that provides compatibility support for both ZFS
and DTrace.  Also, rename zfsproc to system_proc to reflect that it is
not specific to ZFS.

Reported by:	ae
MFC after:	5 weeks
X-MFC with:	ae
2019-11-05 14:34:59 +00:00
Andriy Gapon
eb819923ec zfs: enable SPA_PROCESS on the kernel side
The purpose of this change is to group kernelthreads specific to a
particular ZFS pool under a kernel process.  There can be many dozens of
threads per pool.  This change improves observability of those threads.

This change consists of several subchanges:
1. illumos taskq_create_proc can now pass its process parameter to
taskqueue.  Also, use zfsproc instead of NULL for taskq_create.  Caveat:
zfsproc might not be initialized yet.  But in that case it is still NULL,
so not worse than before.

2. illumos sys/proc.h: kthread id is stored in t_did field, not t_tid.

3. zfs: enable SPA_PROCESS on the kernel side.  The change is a bit hairy
as newproc() is implemented privately to spa.c.  I couldn't think of a
better way to populate process name than to poke inside the argument for
the process routine.

4. illumos thread_create: allow assigning thread to process other than
zfsproc.

5. zfs: expose spa_proc to other users, assign sync and quiesce threads
to it.

Pool-specific threads created using (relatively new) zthr mechanism are
still assigned to the zfskern process rather than to a respective
zpool-xxx process.  I am going to address this a bit later.

Reviewed by:	no one
MFC after:	5 weeks
Relnotes:	perhaps
Differential Revision: https://reviews.freebsd.org/D9720
2019-11-04 13:30:37 +00:00
Toomas Soome
79a4bf8975 loader: factor out label and uberblock load from vdev_probe, add MMP checks
Clean up the label read.
2019-11-03 21:19:52 +00:00
Toomas Soome
0c0a882c7a loader: we do not support booting from pool with log device
If pool has log device, stop there and tell about it.
2019-11-03 13:25:47 +00:00
Toomas Soome
abca0bd501 loader: calculate physical vdev psize from asize
Since physical device asize is calculated from psize and the asize is stored
in pool label, we can use asize to set the value of psize, which is used to
calculate the location of the pool labels.

MFC after:	1 week
2019-11-03 11:09:06 +00:00
Toomas Soome
24e1a7ac77 r354253 did miss the fact that libzpool is built as fake kernel
We build libzpool as kernel like, use _FAKE_KERNEL check to include
kernel api in libzpool.
2019-11-02 21:02:54 +00:00
Toomas Soome
25cf531ecd r354253 did miss lz4.c from sys/cddl/boot/zfs. 2019-11-02 15:08:19 +00:00
Toomas Soome
e499793e76 Remove duplicate lz4 implementations
Port illumos change: https://www.illumos.org/issues/11667

Move lz4.c out of zfs tree to opensolaris/common/lz4, adjust it to be
usable from kernel/stand/userland builds, so we can use just one single
source. Add lz4.h to declare lz4_compress() and lz4_decompress().

MFC after:	1 month
Differential Revision:	https://reviews.freebsd.org/D22037
2019-11-02 12:28:04 +00:00
Alexander Motin
a4d5fcadd8 FreeBSD'fy ZFS zlib zalloc/zfree callbacks.
The previous code came from OpenSolaris, which in my understanding require
allocation size to be known to free memory.  To store that size previous
code allocated additional 8 byte header.  But I have noticed that zlib
with present settings allocates 64KB context buffers for each call, that
could be efficiently cached by UMA, but addition of those 8 bytes makes
them fall back to physical RAM allocations, that cause huge overhead and
lock congestion on small blocks.  Since FreeBSD's free() does not have
the size argument, switching to it solves the problem, increasing write
speed to ZVOLs with 4KB block size and GZIP compression on my 40-threads
test system from ~60MB/s to ~600MB/s.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-10-29 21:25:19 +00:00
Toomas Soome
903fe2b762 loader: zio_checksum_verify should check byteswap
We do have both native and byteswap checksum callbacks in place but the
selection is not wired.

MFC after:	1 week
2019-10-27 08:35:29 +00:00
Alan Somers
1af3a11218 MFZoL: Avoid retrieving unused snapshot props
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes #8077
zfsonlinux/zfs@4c0883fb4a

MFC after:	2 weeks
2019-10-26 17:11:02 +00:00
Konstantin Belousov
5b87ecc643 Assert that vnode_pager_setsize() is called with the vnode exclusively locked
except for filesystems that set the MNTK_VMSETSIZE_BUG,  Set the flag for ZFS.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
Differential revision:	https://reviews.freebsd.org/D21883
2019-10-22 16:21:24 +00:00
Andriy Gapon
b6528d546f MFV r353637: 10844 Serialize ZTHR operations to eliminate races
illumos/illumos-gate@6a316e1f6d
6a316e1f6d

https://www.illumos.org/issues/10844
  ZoL 61c3391acc Serialize ZTHR operations to eliminate races

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
Obtained from:	illumos, ZoL
MFC after:	3 weeks
2019-10-16 09:29:01 +00:00
Andriy Gapon
428c45f156 MFV r353630: 10809 Performance optimization of AVL tree comparator functions
illumos/illumos-gate@c4ab0d3f46
c4ab0d3f46

https://www.illumos.org/issues/10809
  Port ZoL ee36c709c3 Performance optimization of AVL tree comparator functions

This is a followup to r337567 that imported the ZoL commit directly into
FreeBSD.  It seems that at the time we did not have some of the earlier
changes, so some pieces of the ZoL change were not applicable.  Also,
the illumos version got a few style cleanups.  Some changes were missed
or incorrectly merged (e.g., vdev_cache_lastused_compare and
metaslab_rangesize_compare).

Obtained from:	ZoL, illumos
MFC after:	25 days
X-MFC after:	r353634
2019-10-16 09:20:08 +00:00
Andriy Gapon
786c532a8f MFV r348596: 9689 zfs range lock code should not be zpl-specific
illumos/illumos-gate@7931524763

FreeBSD note: some tweaking was needed to avoid a conflict with
sys/rangelock.h.

Author:	Matthew Ahrens <mahrens@delphix.com>
Obtained from:	illumos
MFC after:	3 weeks
2019-10-16 09:04:53 +00:00
Andriy Gapon
f6a4b91c75 MFV r353628:
10842 Mutex leak in dsl_dataset_hold_obj()

illumos/illumos-gate@ad027c0ff9
ad027c0ff9

https://www.illumos.org/issues/10842
  ZoL d10b2f1d35 Mutex leak in dsl_dataset_hold_obj()

Portions contributed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Author: Jorgen Lundman <lundman@lundman.net>
Obtained from:	illumos, ZoL
MFC after:	15 days
2019-10-16 07:57:58 +00:00
Andriy Gapon
0c4f60b734 MFV r353619: 9691 fat zap should prefetch when iterating
illumos/illumos-gate@52abb70e07
52abb70e07

https://www.illumos.org/issues/9691
  When iterating over a ZAP object, we're almost always certain to
  iterate over the entire object. If there are multiple leaf blocks, we
  can realize a performance win by issuing reads for all the leaf blocks
  in parallel when the iteration begins.
  For example, if we have 10,000 snapshots, "zfs destroy -nv
  pool/fs@1%9999" can take 30 minutes when the cache is cold. This
  change provides a >3x performance improvement, by issuing the reads
  for all ~64 blocks of each ZAP object in parallel.

Author: Matthew Ahrens <mahrens@delphix.com>
Obtained from:	illumos
MFC after:	2 weeks
2019-10-16 07:09:00 +00:00
Andriy Gapon
9efb961d9a MFV r353617: 9425 allow channel programs to be stopped via signals
illumos/illumos-gate@d0cb1fb926
d0cb1fb926

https://www.illumos.org/issues/9425
  Problem Statement
  ZFS Channel program scripts currently require a timeout, so that hung
  or long-running scripts return a timeout error instead of causing ZFS
  to get wedged.  This limit can currently be set up to 100 million Lua
  instructions. Even with a limit in place, it would be desirable to
  have a sys admin (support engineer) be able to cancel a script that is
  taking a long time.

  Proposed Solution
  Make it possible to abort a channel program by sending an interrupt
  signal.In the underlying txg_wait_sync function, switch the cv_wait to
  a cv_wait_sig to catch the signal. Once a signal is encountered, the
  dsl_sync_task function can install a Lua hook that will get called
  before the Lua interpreter executes a new line of code. The
  dsl_sync_task can resume with a standard txg_wait_sync call and wait
  for the txg to complete. Meanwhile, the hook will abort the script and
  indicate that the channel program was canceled. The kernel returns a
  EINTR to indicate that the channel program run was canceled.

FreeBSD note: the return value of cv_wait_sig() has inverted meaning
between us and illumos.

Author: Don Brady <don.brady@delphix.com>
Obtained from:	illumos
MFC after:	4 weeks
2019-10-16 07:00:18 +00:00
Andriy Gapon
179e6dab09 MFV r353615: 9485 Optimize possible split block search space
illumos/illumos-gate@a21fe34979
a21fe34979

https://www.illumos.org/issues/9485
  Port this commit from ZoL:
  4589f3ae4c

Author: Brian Behlendorf <behlendorf1@llnl.gov>
Obtained from:	illumos, ZoL
MFC after:	3 weeks
2019-10-16 06:43:22 +00:00
Andriy Gapon
7149963e95 MFV r353613: 10731 zfs: NULL pointer errors
FreeBSD already had these changes locally.
This commit removes a small formatting difference.

MFC after:	1 week
2019-10-16 06:38:05 +00:00
Andriy Gapon
6cb9ab2bad MFC r353611: 10330 merge recent ZoL vdev and metaslab changes
illumos/illumos-gate@a0b03b161c
a0b03b161c

https://www.illumos.org/issues/10330
  3 recent ZoL changes in the vdev and metaslab code which we can pull over:
  PR 8324 c853f382db 8324 Change target size of metaslabs from 256GB to 16GB
  PR 8290 b194fab0fb 8290 Factor metaslab_load_wait() in metaslab_load()
  PR 8286 419ba59145 8286 Update vdev_is_spacemap_addressable() for new spacemap
  encoding

Author: Serapheim Dimitropoulos <serapheimd@gmail.com>
Obtained from:	illumos, ZoL
MFC after:	2 weeks
2019-10-16 06:26:51 +00:00
Andriy Gapon
b399ca755a MFV r353608: 10165 libzpool: passing argument 1 to restrict-qualified parameter
illumos/illumos-gate@f91fcf59ac
f91fcf59ac

https://www.illumos.org/issues/10165

Author: Toomas Soome <tsoome@me.com>
MFC after:	10 days
2019-10-16 06:09:00 +00:00
Andriy Gapon
67f8ab8ebb fix up r353565, somehow a few files did not get committed
MFC after:	3 weeks
X-MFC with:	r353565
2019-10-15 15:52:01 +00:00
Andriy Gapon
6f2721b907 MFV r353561: 10343 ZoL: Prefix all refcount functions with zfs_
illumos/illumos-gate@e914ace2e9
e914ace2e9

https://www.illumos.org/issues/10343
  On the openzfs feature/porting matrix, this is listed as:
  prefix to refcount funcs/types
  Having these changes will make it easier to share other work across the
  different ZFS operating systems.
  PR 7963 424fd7c3e Prefix all refcount functions with zfs_
  PR 7885 & 7932 c13060e47 Linux 4.19-rc3+ compat: Remove refcount_t compat
  PR 5823 & 5842 4859fe796 Linux 4.11 compat: avoid refcount_t name conflict

Author: Tim Schumacher <timschumi@gmx.de>
Obtained from:	illumos, ZoL
MFC after:	3 weeks
2019-10-15 15:09:36 +00:00
Andriy Gapon
563db1a947 MFV r353558: 10572 10579 Fix race in dnode_check_slots_free()
illumos/illumos-gate@aa02ea0194
aa02ea0194

10572 Fix race in dnode_check_slots_free()
https://www.illumos.org/issues/10572
  The Fix from ZoL:
  Currently, dnode_check_slots_free() works by checking dn->dn_type
  in the dnode to determine if the dnode is reclaimable. However,
  there is a small window of time between dnode_free_sync() in the
  first call to dsl_dataset_sync() and when the useraccounting code
  is run when the type is set DMU_OT_NONE, but the dnode is not yet
  evictable, leading to crashes. This patch adds the ability for
  dnodes to track which txg they were last dirtied in and adds a
  check for this before performing the reclaim.

  This patch also corrects several instances when dn_dirty_link was
  treated as a list_node_t when it is technically a multilist_node_t.

10579 Don't allow dnode allocation if dn_holds != 0
https://www.illumos.org/issues/10579
  The fix from ZoL:
  This patch simply fixes a small bug where dnode_hold_impl() could
  attempt to allocate a dnode that was in the process of being freed,
  but which still had active references. This patch simply adds the
  required check.

Author: Tom Caputi <tcaputi@datto.com>
Reported by:	delphij
MFC after:	2 weeks
X-MFC with:	r353176
2019-10-15 14:29:18 +00:00
Andriy Gapon
4368589338 MFV r353551: 10452 ZoL: merge in large dnode feature fixes
illumos/illumos-gate@946342a260
946342a260

https://www.illumos.org/issues/10452
  illumos is missing a few small follow up ZoL bug fixes for the large dnode
  feature. We should pull those in.
  Those commits are in the ZoL tree as (newest to oldest):
  PR 8435 - 75d6b7ddca - Add missing copyright
  notice to large_dnode tests
  PR 7433 - e14a32b1c8 - Fix object reclaim when
  using large dnodes
  PR 6616 - 48fbb9ddbf - Free objects when
  receiving full stream as clone
  PR 6695 - 39f56627ae - receive_freeobjects()
  skips freeing some object

Portions contributed by: Ned Bass <bass6@llnl.gov>
Portions contributed by: Tom Caputi <tcaputi@datto.com>
Author: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Obtained from:	illumos, ZoL
MFC after:	2 weeks
X-MFC with:	r353176
2019-10-15 14:20:11 +00:00
Jeff Roberson
0012f373e4 (4/6) Protect page valid with the busy lock.
Atomics are used for page busy and valid state when the shared busy is
held.  The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.

Reviewed by:    kib, markj
Tested by:      pho
Sponsored by:   Netflix, Intel
Differential Revision:        https://reviews.freebsd.org/D21594
2019-10-15 03:45:41 +00:00
Mateusz Guzik
8fd727827c zfs: use MNTK_NOMSYNC
Reviewed by:	kib
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D22009
2019-10-13 15:41:30 +00:00
Andriy Gapon
d0a9542494 fix up r353340, don't assume that fcmpset has strong semantics
fcmpset can have two kinds of semantics, weak and strong.
For practical purposes, strong semantics means that if fcmpset fails
then the reported current value is always different from the expected
value.  Weak semantics means that the reported current value may be the
same as the expected value even though fcmpset failed.  That's a so
called "sporadic" failure.

I originally implemented atomic_cas expecting strong semantics, but many
platforms actually have weak one.

Reported by:	pkubaj (not confirmed if same issue)
Discussed with:	kib, mjg
MFC after:	19 days
X-MFC with:	r353340
2019-10-11 17:01:02 +00:00
Alan Somers
34e9a37f4d MFZol: Fix performance of "zfs recv" with many deletions
This patch fixes 2 issues with the DMU free throttle implemented
in dmu_free_long_range(). The first issue is that get_next_chunk()
was calculating the number of L1 blocks the free would dirty
incorrectly. In some cases involving extremely large files, this
code would greatly overestimate the number of affected L1 blocks,
causing excessive calls to txg_wait_open(). This patch corrects
the calculation.

The second issue is that the free throttle uses the total number
of free'd blocks in all (open, quiescing, and syncing) txgs to
determine whether to throttle. This causes large frees (such as
those created by the first issue) to cause 4 txg syncs before
any further frees were allowed to proceed. This patch ensures
that the accounting is done entirely in a per-txg fashion, so
that frees from a given txg don't affect those that immediately
follow it.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
zfsonlinux/zfs@f4c594da94

Freeing throttle should account for holes

Deletion throttle currently does not account for holes in a file.
This means that it can activate when it shouldn't.
To fix it we switch the throttle to be based on the number of
L1 blocks we will have to dirty when freeing

Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
zfsonlinux/zfs@65282ee9e0

Submitted by:	Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by:	allanjude
MFC after:	2 weeks
Sponsored by:	Axcient
Differential Revision:	https://reviews.freebsd.org/D21895
2019-10-11 14:59:28 +00:00
Andriy Gapon
d0c0856f63 emulate illumos membar_producer with atomic_thread_fence_rel
membar_producer is supposed to be a store-store barrier.
Also, in the code that FreeBSD has ported from illumos membar_producer
is used only with regular stores to regular memory (with respect to
caching).

We do not have an MI primitive for the store-store barrier, so
atomic_thread_fence_rel is the closest we have as it provides
(load | store) -> store barrier.

Previously, membar_producer was an empty function call on all 32-bit
arm-s, 32-bit powerpc, riscv and all mips variants.  I think that it was
inadequate.
On other platforms, such as amd64, arm64, i386, powerpc64, sparc64,
membar_producer was implemented using stronger primitives than required
for a store-store barrier with respect to regular memory access.
For example, it used sfence on amd64 and lock-ed nop in i386 (despite TSO).
On powerpc64 we now use recommended lwsync instead of eieio.
On sparc64 FreeBSD uses TSO mode.
On arm64/aarch64 we now use dmb sy instead of dmb ish.  Not sure if this
is an improvement, actually.

After this change we can drop opensolaris_atomic.S for aarch64, amd64,
powerpc64 and sparc64 as all required atomic operations have either
direct or light-weight mapping to FreeBSD native atomic operations.

Discussed with:	kib
MFC after:	4 weeks
2019-10-10 07:39:41 +00:00
Andriy Gapon
f5c4c7209b cleanup of illumos compatibility atomics
atomic_cas_32 is implemented using atomic_fcmpset_32 on all platforms.
Ditto for atomic_cas_64 and atomic_fcmpset_64 on platforms that have it.
The only exception is sparc64 that provides MD atomic_cas_32 and
atomic_cas_64.
This is slightly inefficient as fcmpset reports whether the operation
updated the target and that information is not needed for cas.
Nevertheless, there is less code to maintain and to add for new platforms.
Also, the operations are done inline now as opposed to function calls before.

atomic_add_64_nv is implemented using atomic_fetchadd_64 on platforms
that provide it.

casptr, cas32, atomic_or_8, atomic_or_8_nv are completely removed as they
have no users.

atomic_mtx that is used to emulate 64-bit atomics on platforms that lack
them is defined only on those platforms.

As a result, platform specific opensolaris_atomic.S files have lost most of
their code.  The only exception is i386 where the compat+contrib code
provides 64-bit atomics for userland use.  That code assumes availability of
cmpxchg8b instruction.  FreeBSD does not have that assumption for i386
userland and does not provide 64-bit atomics.  Hopefully, this can and will
be fixed.

MFC after:	3 weeks
2019-10-09 11:26:36 +00:00
Andriy Gapon
ac99b25298 zfs: use atomic_load_64 to read atomic variable in dmu_object_alloc_impl
As long as we support ZFS on 32-bit platforms we should do this for all
64-bit variables that are modified in a lockless fashion using atomic
operations.  Otherwise, there is a risk of a reading a torn value.

Here is a rationale for why I am doing this in dmu_object_alloc_impl:
- it's very recent code
- the code deals with object IDs and a number of objects in a file
  system can overflow 32 bits
- incorrect allocation of an object ID may result in hard to debug
  problems
- fixing all plain reads of 64-bit atomic variables is not a trivial
  undertaking to do in one shot, so I chose to do it incrementally

MFC after:	3 weeks
X-MFC after:	r353301, r353176
2019-10-08 11:27:48 +00:00
Andriy Gapon
3251c5ae51 fix up r353168, add atomic_swap_64 to i386 version of opensolaris_atomic.S
The compatibility code for the atomic operations in ZFS code is a bit
messy.  In some cases the native definitions are directly made
available, in some cases there are emulated operations in
opensolaris_atomic.c and in yet other cases there are atomic operations
implemented in assembly that were obtained from OpenSolaris / illumos.

This commit adds atomic_swap_64 for use with i386 userland.
The code is copied from illumos.

I am not sure why FreeBSD does not provide that operation natively.
Maybe because we try (or pretend) to support processors that did not
have the necessary instructions.

While here I also added atomic_load_64 for the same reasons.
This is original code based on iilumos atomic_swap_64 and FreeBSD
atomic_load_acq_64_i586.

Pointyhat to:	avg
MFC after:	1 week
2019-10-07 12:53:27 +00:00
Andriy Gapon
862c20fd89 MFV r350898, r351075: 8423 8199 7432 Implement large_dnode pool feature
8423 8199 7432 Implement large_dnode pool feature

7432 Large dnode pool feature
8199 multi-threaded dmu_object_alloc()
8423 Implement large_dnode pool feature
10406 large_dnode changes broke zfs recv of legacy stream

llumos/illumos-gate@54811da5ac
54811da5ac
https://www.illumos.org/issues/8423
https://www.illumos.org/issues/8199
https://www.illumos.org/issues/7432

illumos/illumos-gate@811964cd9f
811964cd9f
https://www.illumos.org/issues/10406

  ZoL issues:
  Improved dnode allocation #6564
  Clean up large dnode code #6262
  Fix dnode_hold() freeing dnode behavior #8172
  Fix dnode allocation race #6414, #6439
  Partial: Raw sends must be able to decrease nlevels #6821, #6864
  Remove unnecessary txg syncs from receive_object() Closes #7197

This updates FreeBSD large_dnode code (that was imported from ZoL) to a
version that was committed to illumos.  It has some cleanups,
improvements and fixes comparing to what we have in FreeBSD now.
I think that the most significant update is 8199 multi-threaded
dmu_object_alloc().

This commit reverts r351077 that was a revert of r351074 and r351076 and
restores those changes.  Required atomic operations should be available
now on all platforms where we build ZFS.

Obtained from:	illumos
MFC after:	3 weeks
2019-10-07 08:14:45 +00:00
Andriy Gapon
f1fd3654e7 ZFS: unconditionally use atomic_swap_64
Previously, the code used a plain store on platforms that lacked
atomic_swap_64 and possibly some other platforms as the condition worked
only if atomic_swap_64 was a macro.

MFC after:	1 week
X-MFC after:	r353166, r353167
2019-10-07 08:00:54 +00:00
Andriy Gapon
cf78c4beae ZFS: add emulation of atomic_swap_64 and atomic_load_64
Some 32-bit platforms do not provide 64-bit atomic operations that ZFS
requires, either in userland or at all.  We emulate those operations for
those platforms using a mutex.  That is not entirely correct and it's
very efficient.  Besides, the loads are plain loads, so torn values are
possible.

Nevertheless, the emulation seems to work for some definition of work.

This change adds atomic_swap_64, which is already used in ZFS code, and
atomic_load_64 that can be used to prevent torn reads.

MFC after:	1 week
2019-10-07 07:54:34 +00:00
Mateusz Guzik
818631b634 zfs: add root vnode caching
This replaces the approach added in r338927.

See r353150.

Sponsored by:	The FreeBSD Foundation
2019-10-06 22:16:00 +00:00
Mariusz Zaborski
5eb65c4ce5 dtrace: 64-bits registers support
The registers in ilumos and FreeBSD have a different number.
In the illumos, last 32-bits register defined is SS an in FreeBSD is GS.
While translating register we should comper it to the highest one.

PR:             240358
Reported by:    lwhsu@
MFC after:      2 weeks
2019-10-04 16:17:00 +00:00
Andriy Gapon
912c3fe715 ZFS: add bookmark renaming
The feature is implemented as an extension of the existing
ZFS_IOC_RENAME ioctl.  Both the userland and the DSL interfaces support
renaming only a single bookmark at a time.  As of now, there is no ZCP
interface to the new functionality.  I am going to add it once the DSL
interface passes a test of time.

This change picks up support for zfs_ioc_namecheck_t::ENTITY_NAME that
was added to ZoL as part of Redacted Send/Receive feature by Paul
Dagnelie <pcd@delphix.com>.  This is needed to allow a bookmark name in
zc_name.

Discussed with:	mahrens
Reviewed by:	bcr (man page)
Sponsored by:	CyberSecure
Differential Revision: https://reviews.freebsd.org/D21795
2019-10-03 11:08:45 +00:00
Alexander Motin
e9f4580d92 Improve latency of synchronous 128KB writes.
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks.  After that change it became
~124KB+/4KB+ in two 128KB log blocks to free space in the second block
for another record.  Unfortunately in case of 128KB only writes, when space
in the second block remained unused, that change increased write latency by
imbalancing checksum computation time between parallel threads.

This change introduces new 68KB log block size, used for both writes below
67KB and 128KB-sharp writes.  Writes of 68-127KB are still using one 128KB
block to not increase processing overhead.  Writes above 131KB are still
using full 128KB blocks, since possible saving there is small.  Mixed loads
will likely also fall back to previous 128KB, since code uses maximum of
the last 10 requested block sizes.

On a simple 128KB write test with queue depth of 1 this change demonstrates
~15-20% performance improvement.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-10-01 20:09:25 +00:00
Mark Johnston
9093dd9a66 Implement x86 dtrace_invop_(un)init() in C.
There is no reason for these routines to be written in assembly.  In
the ports of DTrace to other platforms, they are already written in C.
No functional change intended.

MFC after:	1 week
Sponsored by:	Netflix
2019-09-23 15:08:17 +00:00
Andriy Gapon
38a1def12f MFZoL: Retire send space estimation via ZFS_IOC_SEND
Add a small wrapper around libzfs_core's lzc_send_space() to libzfs so
that every legacy ZFS_IOC_SEND consumer, along with their userland
counterpart estimate_ioctl(), can leverage ZFS_IOC_SEND_SPACE to
request send space estimation.

The legacy functionality in zfs_ioc_send() is left untouched for
compatibility purposes.

Obtained from:	ZoL
Obtained from:	zfsonlinux/zfs@cf7684bc8d
Author:		loli10K <ezomori.nozomu@gmail.com>
MFC after:	2 weeks
2019-09-22 08:44:41 +00:00
Andriy Gapon
6caa629e73 fix dsl_scan_ds_clone_swapped logic
It was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.

In both cases, if only ds1 was already present, then it would be first
replaced with ds2 and then ds2 would be replaced back with ds1.  Also,
both cases did not properly handle a situation where both ds1 and ds2
are already queued.  A duplicate insertion would be attempted and its
failure would result in a panic.

This change has also been submitted to ZoL as zfsonlinux/zfs@dd262c9

PR:		239566
Reported by:	pascal.guitierrez@gmail.com
MFC after:	4 days
Sponsored by:	CyberSecure
2019-09-19 09:43:56 +00:00
Alexander Motin
f4897c94dd Fix typo, setting hidden flag instead of reparse.
Submitted by:	Ryan Moeller <ryan@ixsystems.com>
MFC after:	3 days
Sponsored by:	iXsystems, Inc.
2019-09-18 19:33:08 +00:00
Mateusz Guzik
a8c8e44bf0 vfs: manage mnt_ref with atomics
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.

mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.

Reviewed by:	kib, jeff
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21425
2019-09-16 21:31:02 +00:00
Mark Johnston
e8bcf6966b Revert r352406, which contained changes I didn't intend to commit. 2019-09-16 15:04:45 +00:00
Mark Johnston
41fd4b9422 Fix a couple of nits in r352110.
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.

Reported by:	alc
Reviewed by:	alc, kib
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21639
2019-09-16 15:03:12 +00:00
Jeff Roberson
c75757481f Replace redundant code with a few new vm_page_grab facilities:
- VM_ALLOC_NOCREAT will grab without creating a page.
 - vm_page_grab_valid() will grab and page in if necessary.
 - vm_page_busy_acquire() automates some busy acquire loops.

Discussed with:	alc, kib, markj
Tested by:	pho (part of larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21546
2019-09-10 19:08:01 +00:00
Jeff Roberson
4cdea4a853 Use the sleepq lock rather than the page lock to protect against wakeup
races with page busy state.  The object lock is still used as an interlock
to ensure that the identity stays valid.  Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.

Reviewed by:	alc, kib, markj
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21255
2019-09-10 18:27:45 +00:00
Mark Johnston
fee2a2fa39 Change synchonization rules for vm_page reference counting.
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator.  In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well.  These references are protected by the page lock, which must
therefore be acquired for many per-page operations.  This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.

Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter.  A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held.  As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.

The vm_page_wire() and vm_page_unwire() KPIs are changed.  The former
requires that either the object lock or the busy lock is held.  The
latter no longer has a return value and may free the page if it releases
the last reference to that page.  vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate.  vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold().  It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler.  vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state).  In particular, synchronization details are no longer
leaked into the caller.

The change excises the page lock from several frequently executed code
paths.  In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock.  In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.

__FreeBSD_version is bumped.  The DRM ports have been updated to
accomodate the KPI changes.

Reviewed by:	jeff (earlier version)
Tested by:	gallatin (earlier version), pho
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D20486
2019-09-09 21:32:42 +00:00
Andriy Gapon
b539c9bfbd ZFS: Always refuse receving non-resume stream when resume state exists
This fixes a hole in the situation where the resume state is left from
receiving a new dataset and, so, the state is set on the dataset itself
(as opposed to %recv child).

Additionally, distinguish incremental and resume streams in error
messages.

This was also committed to ZoL:
zfsonlinux/zfs@ebeb6f23bf

MFC after:	2 weeks
Sponsored by:	CyberSecure
2019-09-04 07:33:22 +00:00
Mateusz Guzik
e3c3248cc7 vfs: implement usecount implying holdcnt
vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.

Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.

Reviewed by:	kib
Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D21471
2019-09-03 15:42:11 +00:00
Mark Johnston
08cfa56ea3 Extend uma_reclaim() to permit different reclamation targets.
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure.  This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.

With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets.  Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure.  In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim().  When trimming, only
items in excess of the estimate are reclaimed.  Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.

Now, when under memory pressure, the page daemon will trim zones
rather than draining them.  As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.

Reviewed by:	jeff
Tested by:	pho (an earlier version)
MFC after:	2 months
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D16667
2019-09-01 22:22:43 +00:00
Mateusz Guzik
e9fff745a7 zfs: fix snapshot dir destruction after introducion of VOP_NEED_INACTIVE
Reported by:	lwhsu
PR:		240221
Sponsored by:	The FreeBSD Foundation
2019-08-31 13:24:22 +00:00
Konstantin Belousov
6470c8d3db Rework v_object lifecycle for vnodes.
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode.  Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.

Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM().  This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation.  Remove code from
vnode_create_vobject() to handle races with the parallel destruction.

Reviewed by:	markj
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D21412
2019-08-29 07:50:25 +00:00
Andriy Gapon
4b4d1b818e zfs_ioc_snapshot: check user-prop permissions on snapshotted datasets
Previously, the permissions were checked on the pool which was obviously
incorrect.

After this change, zfs_check_userprops() only validates the properties
without any permission checks.  The permissions are checked individually
for each snapshotted dataset.

This was also committed to ZoL: zfsonlinux/zfs@e6203d2

Reported by:	CyberSecure
MFC after:	1 week
Sponsored by:	CyberSecure
2019-08-29 07:19:06 +00:00
Mark Johnston
772dd133c6 Avoid direct accesses of the vm_page wire_count field.
No functional change intended.

Sponsored by:	Netflix
2019-08-28 18:01:54 +00:00
Jeff Roberson
cf27e0d125 Use an atomic reference count for paging in progress so that callers do not
require the object lock.

Reviewed by:	markj
Tested by:	pho (as part of a larger branch)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D21311
2019-08-19 23:09:38 +00:00
Andriy Gapon
97b342a310 zfs_vget: fix vnode reference count leak in error path
If vn_lock() failed, then the function returned the error but the vnode
obtained via zfs_zget() was never released.

MFC after:	10 days
Sponsored by:	Panzura
2019-08-17 09:23:03 +00:00
Andriy Gapon
5a75f51a1f Revert r351076 and r351074 because of atomic_swap_64 on 32-bit platforms
Trying to sort it out.
2019-08-15 15:27:58 +00:00
Andriy Gapon
30f7381b8e MFV r351075: 10406 large_dnode changes broke zfs recv of legacy stream
illumos/illumos-gate@811964cd9f
811964cd9f

https://www.illumos.org/issues/10406
  The large dnode changes from 8423 caused problems in zfs recv for a legacy
  stream. This manifests when attempting to mount the received stream, but the
  problem is in the receive code. We missed the following commit from ZoL which
  fixes this.
  commit da2feb42fb
  Author: Tom Caputi <tcaputi@datto.com>
  Date: Thu Jun 28 17:55:11 2018 -0400
  Fix 'zfs recv' of non large_dnode send streams
  Currently, there is a bug where older send streams without the
      DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
      The code in receive_object() fails to handle cases where
      drro->drr_dn_slots is set to 0, which is always the case when the
      sending code does not support this feature flag. This patch fixes
      the issue by ensuring that that a value of 0 is treated as
      DNODE_MIN_SLOTS.

Author: Tom Caputi <tcaputi@datto.com>

MFC after:	3 weeks
X-MFC after:	r351074
2019-08-15 15:11:20 +00:00
Andriy Gapon
93132b76cd MFV r350898: 8423 8199 7432 Implement large_dnode pool feature
8423 8199 7432 Implement large_dnode pool feature

8423 Implement large_dnode pool feature
8199 multi-threaded dmu_object_alloc()
7432 Large dnode pool feature

llumos/illumos-gate@54811da5ac
54811da5ac
https://www.illumos.org/issues/8423
https://www.illumos.org/issues/8199
https://www.illumos.org/issues/7432

  ZoL issues:
  Improved dnode allocation #6564
  Clean up large dnode code #6262
  Fix dnode_hold() freeing dnode behavior #8172
  Fix dnode allocation race #6414, #6439
  Partial: Raw sends must be able to decrease nlevels #6821, #6864
  Remove unnecessary txg syncs from receive_object() Closes #7197

This updates FreeBSD large_dnode code (that was imported from ZoL) to a version
that was committed to illumos.  It has some cleanups, improvements and fixes
comparing to what we have in FreeBSD now.  I think that the most significant
update is 8199 multi-threaded dmu_object_alloc().

Obtained from:	illumos
MFC after:	3 weeks
2019-08-15 14:57:27 +00:00
Andriy Gapon
4139761bb5 MFV r350896: 6585 sha512, skein, and edonr have an unenforced dependency on extensible dataset
illumos/illumos-gate@892586e8a1
892586e8a1

https://www.illumos.org/issues/6585
  In any pool without the extensible dataset feature flag already enabled,
  creating a dataset with dedup set to use one of the new checksums would result
  in the following panic as soon as any data was added:
  panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature,
  &refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c line 390

  ffffff0006761830 fffffffffba8fbdd ()
  ffffff0006761890 zfs:feature_do_action+11a ()
  ffffff00067618c0 zfs:spa_feature_incr+1e ()
  ffffff0006761920 zfs:dmu_object_zapify+b7 ()
  ffffff00067619b0 zfs:dsl_dataset_activate_feature+97 ()
  ffffff0006761a20 zfs:dsl_dataset_sync+ba ()
  ffffff0006761ab0 zfs:dsl_pool_sync+153 ()
  ffffff0006761b70 zfs:spa_sync+26e ()
  ffffff0006761c20 zfs:txg_sync_thread+227 ()
  ffffff0006761c30 unix:thread_start+8 ()
  Inspection showed that feature->fi_feature was 7, which is the value of
  SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum.
  Testing shows that the panic can be prevented by explicitly setting extensible
  dataset as a dependency for the sha512, edonr, and skein feature flags.
  Alternatively, the new checksums code could possibly be changed to obviate the
  need for the dependency.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: ilovezfs <ilovezfs@icloud.com>

Note that FreeBSD does not support ednor yet.

MFC after:	2 weeks
2019-08-12 11:42:16 +00:00
Andriy Gapon
cfe94339f2 a stop gap fix for a race between dnode_hold and dnode_sync_free
The race was introduced in r337669, the large dnode feature import from
ZoL.  The problem was debugged by ZoL developers and then,
independently, on FreeBSD.

The fix is an early proposal by Brian Behlendorf:
50f32ed74e
This fix never went into ZoL.  A larger change that was committed later
included a different solution because of the re-worked code.

Ideally, we want to revert this fix and re-synchronize FreeBSD large
dnode code with that in illumos (or newer ZoL).  illumos has a later
import of the feature from ZoL that does not have the bug.

PR:		236480
Obtained from:	Brian Behlendorf <behlendorf1@llnl.gov>
Submitted by:	ncrogers@gmail.com (patch adaptation)
Reported by:	ncrogers@gmail.com
Tested by:	ncrogers@gmail.com,
		Dennis Noordsij <dennis.noordsij@alumni.helsinki.fi>,
		Julien Cigar <julien@perdition.city>
MFC after:	10 days
2019-08-12 10:30:00 +00:00
Toomas Soome
b1b9326846 loader: support com.delphix:removing
We should support removing vdev from boot pool. Update loader zfs reader
to support com.delphix:removing.

Reviewed by:	allanjude
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D18901
2019-08-08 18:08:13 +00:00
Xin LI
0ed1d6fb00 Allow Kernel to link in both legacy libkern/zlib and new sys/contrib/zlib,
with an eventual goal to convert all legacl zlib callers to the new zlib
version:

 * Move generic zlib shims that are not specific to zlib 1.0.4 to
   sys/dev/zlib.
 * Connect new zlib (1.2.11) to the zlib kernel module, currently built
   with Z_SOLO.
 * Prefix the legacy zlib (1.0.4) with 'zlib104_' namespace.
 * Convert sys/opencrypto/cryptodeflate.c to use new zlib.
 * Remove bundled zlib 1.2.3 from ZFS and adapt it to new zlib and make
   it depend on the zlib module.
 * Fix Z_SOLO build of new zlib.

PR:		229763
Submitted by:	Yoshihiro Ota <ota j email ne jp>
Reviewed by:	markm (sys/dev/zlib/zlib_kmod.c)
Relnotes:	yes
Differential Revision:	https://reviews.freebsd.org/D19706
2019-08-01 06:35:33 +00:00
Mark Johnston
61f2f0bae6 Fix FASTTRAPIOC_GETINSTR.
This ioctl is used when a breakpoint is encountered while disassembling
a symbol in the target process.  Since only one DTrace consumer can
toggle or enumerate fasttrap probes from a given process at time, this
ioctl does not appear to be used in practice.
2019-07-17 16:38:29 +00:00
Mark Johnston
eeacb3b02f Merge the vm_page hold and wire mechanisms.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics.  The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.

This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead.  Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.

No functional change is intended.  __FreeBSD_version is bumped.

Reviewed by:	alc, kib
Discussed with:	jeff
Discussed with:	jhb, np (cxgbe)
Tested by:	pho (previous version)
Sponsored by:	Netflix
Differential Revision:	https://reviews.freebsd.org/D19247
2019-07-08 19:46:20 +00:00
Alexander Motin
419110374a Avoid extra taskq_dispatch() calls by DMU.
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes.  Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.

This change adds check for empty sublists to avoid this.
2019-06-25 18:35:23 +00:00
Alexander Motin
3bae917061 Minimize aggsum_compare(&arc_size, arc_c) calls.
For busy ARC situation when arc_size close to arc_c is desired.  But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result.  Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.

Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.

I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-14 20:04:28 +00:00
Alexander Motin
b3b3aa2e29 Alike to ZoL disable metaslab allocation tracing code.
It is too generous to collect in production debug traces that can only
be read with kernel debugger.  Illumos includes special code in their
mdb debugger to read it, we don't.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-06-14 19:57:32 +00:00
Alexander Motin
284e53a401 Properly align struct multilist_sublist to cache line.
Manual Illumos alignment does not fit us due to different kmutex_t size.

MFC after:	1 week
Sponsored by:	iXsystems, Inc.
2019-06-14 17:09:39 +00:00
Alexander Motin
913095dc56 Move write aggregation memory copy out of vq_lock.
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.

Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.

Reviewed by:	ahrens, behlendorf, ryao
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-13 01:21:32 +00:00
Alexander Motin
35251e9c28 Fix comparison signedness in arc_is_overflowing().
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size).  Use
of signed comparison there fixes the problem.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2019-06-07 20:59:24 +00:00
Alexander Motin
61586dd647 Explicitly start ARC adjustment on limits change.
While formally it is not necessary, but the sooner it start, the sooner it
finish, and supposedly less disturbing for workload it will be.

MFC after:	2 weeks
2019-06-07 19:03:17 +00:00
Andriy Gapon
d8b12a2162 Restore ARC MFU/MRU pressure
Before r305323 (MFV r302991: 6950 ARC should cache compressed data)
arc_read() code did this for access to a ghost buffer:
 arc_adapt() (from arc_get_data_buf())
 arc_access(hdr, hash_lock)
I.e., we first checked access to the MFU ghost/MRU ghost buffer and
adapt MFU/MRU sizes (in arc_adapt()) and next move buffer from the ghost
state to regular.

After r305323 the sequence is different:
 arc_access(hdr, hash_lock);
 arc_hdr_alloc_pabd(hdr);
I.e., we first move the buffer from the ghost state in arc_access() and
then we check access to buffer in ghost state (in arc_hdr_alloc_pabd()
-> arc_get_data_abd() -> arc_get_data_impl() -> arc_adapt()).  This is
incorrect: arc_adapt() never see access to the ghost buffer because
arc_access() already migrated the buffer from the ghost state to
regular.

So, the fix is to restore a call to arc_adapt() before arc_access() and
to suppress the call to arc_adapt() after arc_access().

Submitted by:	Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after:	2 weeks
Sponsored by:	Integros [integros.com]
Differential Revision: https://reviews.freebsd.org/D19094
2019-06-07 06:35:42 +00:00
Mark Johnston
c080655467 Fix a race between fasttrap and the user breakpoint handler.
When disabling the last enabled userspace probe, fasttrap clears the
function pointers which hook in to the breakpoint handler.  If a traced
thread hit a fasttrap breakpoint before it was removed, we must ensure
that it is able to call the hook; otherwise fasttrap will not consume
the trap and SIGTRAP will be delievered to the thread.  Synchronize
with such threads by ensuring that they load the hook pointer with
interrupts disabled, and by completing an SMP rendezvous after removing
breakpoints and before clearing the pointers.

Reported by:	Alexander Alexeev <Alexander.Alexeev@dell.com>
Tested by:	Alexander Alexeev (earlier version)
Reviewed by:	cem, kib
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D20526
2019-06-06 16:03:25 +00:00
Mariusz Zaborski
8da024d941 dtrace: 64-bits registers support
The registers in ilumos and FreeBSD have a different number.
In the illumos, last 32-bits register defined is SS an in FreeBSD is GS.
This off-by-one caused the uregs array to returns the wrong 64-bits register
on amd64.

Reviewed by:	markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D20363
2019-06-05 22:29:05 +00:00
Alexander Motin
0b5319dda0 MFV r348585: 9683 Allow bypassing devid in vdev_disk_open()
illumos/illumos-gate@6fe4f3002c

Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Pavel Zakharov <pavel.zakharov@delphix.com>

This is irrelevant to FreeBSD, just to reduce divergence.
2019-06-03 20:55:52 +00:00
Alexander Motin
9b048dd219 MFV r348583: 9847 leaking dd_clones (DMU_OT_DSL_CLONES) objects
illumos/illumos-gate@17fb938fd6

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author:     Matthew Ahrens <mahrens@delphix.com>
2019-06-03 20:49:20 +00:00
Alexander Motin
07a5c938c9 MFV r348578: 9962 zil_commit should omit cache thrash
illumos/illumos-gate@cab3a55e15

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author:     Prakash Surya <prakash.surya@delphix.com>
2019-06-03 20:24:40 +00:00
Alexander Motin
a66a7143d4 MFV r348576: 9963 Seperate tunable for disabling ZIL vdev flush
illumos/illumos-gate@f8fdf68125

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     Prakash Surya <prakash.surya@delphix.com>
2019-06-03 20:05:43 +00:00
Alexander Motin
c9719c9a6d MFV r348573: 9993 zil writes can get delayed in zio pipeline
illumos/illumos-gate@2258ad0b75

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author:     George Wilson <george.wilson@delphix.com>
2019-06-03 19:25:53 +00:00
Alexander Motin
4d6afba5e0 MFV r348555: 9690 metaslab of vdev with no space maps was flushed during removal
illumos/illumos-gate@4e75ba6826

Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2019-06-03 19:03:24 +00:00
Alexander Motin
1b61262505 MFC r348554: 9688 aggsum_fini leaks memory
illumos/illumos-gate@29bf2d68be

Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Jorgen Lundman <lundman@lundman.net>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Paul Dagnelie <pcd@delphix.com>
2019-06-03 19:00:24 +00:00
Alexander Motin
677ef2563d MFV r348553: 9681 ztest failure in spa_history_log_internal due to spa_rename()
illumos/illumos-gate@6aee0ad769

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Matthew Ahrens <mahrens@delphix.com>
2019-06-03 18:32:56 +00:00
Alexander Motin
74f7070445 MFV r348552: 9682 page fault in dsl_async_clone_destroy() while opening pool
illumos/illumos-gate@ade2c82828

Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author:     Serapheim Dimitropoulos <serapheim@delphix.com>
2019-06-03 17:56:44 +00:00