Commit Graph

2055 Commits

Author SHA1 Message Date
mav
e9a5feab20 Revert r340096: 9952 Block size change during zfs receive drops spill block
It was reported, and I easily reproduced it, that this change triggers panic
when receiving replication stream with enabled embedded blocks, when short
file compressing into one embedded block changes its block size.  I am not
sure that the problem is in this particuler patch, not just triggered by it,
but since investigation and fix will take some time, I've decided to revert
this for now.

PR:		198457, 233277
2018-11-21 18:18:57 +00:00
markj
746f5464d4 Use taskqueue_quiesce(9) to implement taskq_wait().
PR:		227784
Reviewed by:	cem
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D17975
2018-11-21 17:19:08 +00:00
jhibbits
fa1f71dac0 DTrace/powerpc: Fix FBT return probes
The FBT fuction boundary prober was setting one return probe marker value,
but the dtrace handler was expecting another.  This causes a hang when
tracing return probes.
2018-11-21 16:47:11 +00:00
kib
7d33ec3750 Add d_off support for multiple filesystems.
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature.  Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.

Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs.  At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.

Submitted by:	Jack Halford <jack@gandi.net>
Reviewed by:	mckusick, pfg, rmacklem (previous versions)
Sponsored by:	Gandi.net
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17917
2018-11-14 14:18:35 +00:00
mav
7ddbc411f2 Do not ignore arc_adjust() return value.
This covers scenario when ARC may not shrink as fast as it could:
1. arc_size < arc_c and arc_adjust() does not evict anything, returning
   zero to arc_reclaim_thread();
2. arc_available_memory() reports memory pressure, which can not be
   satisfied by arc_kmem_reap_now();
3. arc_shrink() reduces arc_c and calls arc_adjust(), return of which is
   ignored;
4. even if the last arc_adjust() could not satisfy arc_size < arc_c,
   arc_reclaim_thread() will still go to sleep, since the first one
   returned zero.

Reviewed by:	allanjude, markj, sef
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
Differential Revision:	https://reviews.freebsd.org/D17927
2018-11-10 01:58:37 +00:00
mav
514db89a54 9952 Block size change during zfs receive drops spill block
Replication code in receive_object() falsely assumes that if received
object block size is different from local, then it must be a new object
and calls dmu_object_reclaim() to wipe it out. In most cases it is not a
problem, since all dnode, bonus buffer and data block(s) are immediately
rewritten any way, but the problem is that spill block (if used) is not.
This means loss of ACLs, extended attributes, etc.

This issue can be triggered in very simple way:
1. create 4KB file with 10+ ACL entries;
2. take snapshot and send it to different dataset;
3. append another 4KB to the file;
4. take another snapshot and send incrementally;
5. witness ACL loss on receive side.

PR:		198457
Discussed with:	mahrens
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2018-11-03 03:10:06 +00:00
brooks
7698ed5201 Make vop_symlink take a const target path.
This will enable callers to take const paths as part of syscall
decleration improvements.

Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.

Bump __FreeBSD_version for external API consumers.

Reviewed by:	kib (prior version)
Obtained from:	CheriBSD
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D17805
2018-11-02 14:42:36 +00:00
kib
de9d57cf38 Implement O_BENEATH and AT_BENEATH.
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory.  Supposedly the interface is similar
to the same proposed Linux flags.

Reviewed by:	jilles (code, previous version of manpages), 0mp (manpages)
Discussed with:	allanjude, emaste, jonathan
Sponsored by:	The FreeBSD Foundation
Differential revision:	https://reviews.freebsd.org/D17547
2018-10-25 22:16:34 +00:00
jhibbits
ffc1dda83c powerpc/dtrace: Use explicit bit numbers to mask out PSL_EE
There seems to be a race in CI, such that dtrace_asm.S might be assembled
before the genassym is completed.  This causes a build failure when PSL_EE
doesn't exist, and is read as 0.  Get around this by explicitly specifying
the bits in the mask instead.
2018-10-21 02:08:57 +00:00
mav
714a10a768 Skip VDEV_IO_DONE stage only for ZIO_TYPE_FREE.
Device removal code uses zio_vdev_child_io() with ZIO_TYPE_NULL parent,
that never happened before.  It confused FreeBSD-specific TRIM code,
which does not use VDEV_IO_DONE for logical ZIO_TYPE_FREE ZIOs.  As
result of that stage being skipped device removal ZIOs leaked references
and memory that supposed to be freed by VDEV_IO_DONE, making it stuck.

It is a quick patch rather then a nice fix, but hopefully we'll be able
to drop it all together when alternative TRIM implementation finally get
landed.

PR:		228750, 229007
Discussed with:	allanjude, avg, smh
Approved by:	re (delphij)
MFC after:	5 days
Sponsored by:	iXsystems, Inc.
2018-10-15 21:59:24 +00:00
jhb
c741b8babc Various fixes for TLB management on RISC-V.
- Remove the arm64-specific cpu_*cache* and cpu_tlb_flush* functions.
  Instead, add RISC-V specific inline functions in cpufunc.h for the
  fence.i and sfence.vma instructions.
- Catch up to changes in the arm64 pmap and remove all the cpu_dcache_*
  calls, pmap_is_current, pmap_l3_valid_cacheable, and PTE_NEXT bits from
  pmap.
- Remove references to the unimplemented riscv_setttb().
- Remove unused cpu_nullop.
- Add a link to the SBI doc to sbi.h.
- Add support for a 4th argument in SBI calls.  It's not documented but
  it seems implied for the asid argument to SBI_REMOVE_SFENCE_VMA_ASID.
- Pass the arguments from sbi_remote_sfence*() to the SEE.  BBL ignores
  them so this is just cosmetic.
- Flush icaches on other CPUs when they resume from kdb in case the
  debugger wrote any breakpoints while the CPUs were paused in the IPI_STOP
  handler.
- Add SMP vs UP versions of pmap_invalidate_* similar to amd64.  The
  UP versions just use simple fences.  The SMP versions use the
  sbi_remove_sfence*() functions to perform TLB shootdowns.  Since we
  don't have a valid pm_active field in the riscv pmap, just IPI all
  CPUs for all invalidations for now.
- Remove an extraneous TLB flush from the end of pmap_bootstrap().
- Don't do a TLB flush when writing new mappings in pmap_enter(), only if
  modifying an existing mapping.  Note that for COW faults a TLB flush is
  only performed after explicitly clearing the old mapping as is done in
  other pmaps.
- Sync the i-cache on all harts before updating the PTE for executable
  mappings in pmap_enter and pmap_enter_quick.  Previously the i-cache was
  only sync'd after updating the PTE in pmap_enter.
- Use sbi_remote_fence() instead of smp_rendezvous in pmap_sync_icache().

Reviewed by:	markj
Approved by:	re (gjb, kib)
Sponsored by:	DARPA
Differential Revision:	https://reviews.freebsd.org/D17414
2018-10-15 18:56:54 +00:00
mjg
01b5ac964b zfs: fix a panic after failed mount
r338927("zfs: depessimize zfs_root with rmlocks") failed to error check
the mount before caching root vnode.

Results in crashes in rrw_enter_read_impl tracing back to zfs_mount.

Reported by:	Mike Tancsa
Tested by:	allanjude
Approved by:	re (kib)
2018-10-14 16:14:01 +00:00
mav
94097c608e Avoid zero-sized kmem_alloc() in vdev_compact_children().
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children.  Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.

This is a part of ZoL port of device removal patch:

commit a1d477c24c7badc89c60955995fd84d311938486
Author: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Tim Chase <tim@chase2k.com>

Approved by:	re (kib)
MFC after:	1 week
2018-10-12 16:55:28 +00:00
mav
aa54e1afb1 Add ZIO_TYPE_FREE support for indirect vdevs.
Upstream code expects only ZIO_TYPE_READ and some ZIO_TYPE_WRITE
requests to removed (indirect) vdevs, while on FreeBSD there is also
ZIO_TYPE_FREE (TRIM).  ZIO_TYPE_FREE requests do not have the data
buffers, so don't need the pointer adjustment.

PR:		228750, 229007
Reviewed by:	allanjude, sef
Approved by:	re (kib)
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D17523
2018-10-12 15:14:22 +00:00
allanjude
5d8d02785b Pull in a follow-on commit to resolve a deadlock in ZFS sequential
resilver (r334844)

MFV/ZoL: Fix deadlock in IO pipeline

commit a76f3d0437e5e974f0f748f8735af3539443b388
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Fri Mar 16 16:46:06 2018 -0700

    Fix deadlock in IO pipeline

    In vdev_queue_aggregate() the zio_execute() bypass should not be
    called under the vdev queue lock.  This can result in a deadlock
    as shown in the stack traces below.

    Drop the vdev queue lock then walk the parents of the aggregate IO
    to determine the list of component IOs to be bypassed.  This can
    be done safely without holding the io_lock since the new aggregate
    IO has not yet been returned and its parents cannot change.

    ---  THREAD 1 ---
    arc_read()
      zio_nowait()
        zio_vdev_io_start()
          vdev_queue_io() <--- mutex_enter(vq->vq_lock)
            vdev_queue_io_to_issue()
              vdev_queue_aggregate()
                zio_execute()
            vdev_queue_io_to_issue()
              vdev_queue_aggregate()
                zio_execute()
                  zio_vdev_io_assess()
                    zio_wait_for_children() <- mutex_enter(zio->io_lock)

    --- THREAD 2 --- (inverse order)
    arc_read()
      zio_change_priority() <- mutex_enter(zio->zio_lock)
        vdev_queue_change_io_priority() <- mutex_enter(vq->vq_lock)

    Reviewed-by: Tom Caputi <tcaputi@datto.com>
    Reviewed-by: Don Brady <don.brady@delphix.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Reported by:	ZFS Leadership Meeting
Reviewed by:	mav
Approved by:	re (kib)
Obtained from:	ZFS-on-Linux
MFC after:	2 weeks
Sponsored by:	Klara Systems
Differential Revision:	https://reviews.freebsd.org/D17495
2018-10-10 22:59:15 +00:00
allanjude
56bd3acf23 Add missing sysctls for tuning vdev queue depths for new I/O types
This connects new tunables that were added but not exposed in:
r329502 (zpool remove)
r337007 (zpool initialize)

Reviewed by:	avg
Approved by:	re (kib)
MFC after:	2 weeks
Sponsored by:	Klara Systems
Differential Revision:	https://reviews.freebsd.org/D17494
2018-10-10 22:55:31 +00:00
allanjude
8c87ae0952 Resolve a hang in ZFS during vnode reclaimation
This is caused by a deadlock between zil_commit() and zfs_zget()

Add a way for zfs_zget() to break out of the retry loop in the common case

PR:		229614
Reported by:	grembo, Andreas Sommer, many others
Tested by:	Andreas Sommer, Vicki Pfau
Reviewed by:	avg (no objection)
Approved by:	re (gjb)
MFC after:	2 months
Sponsored by:	Klara Systems
Differential Revision:	https://reviews.freebsd.org/D17460
2018-10-10 19:39:47 +00:00
mav
07cf75ae92 Remove extra thread_exit() call left after r329802.
spa_condense_indirect_thread() is no longer a thread function, but just
a callback for new zthr KPI.

Submitted by:	allanjude
Approved by:	re (gjb)
MFC after:	3 days
2018-10-10 16:34:53 +00:00
mav
1bf3ea6b7d Fix r336951 mismerge -- use of uninitialized variable.
Reported by:	tsoome
Approved by:	re (gjb)
MFC after:	3 days
2018-10-08 15:19:03 +00:00
mav
dd429b1c32 Add sysctls for dbuf metadata cache variables added in r336959.
Approved by:	re (gjb)
MFC after:	1 week
2018-10-05 16:05:59 +00:00
allanjude
6b71fbd3ab Avoid panic when adjusting priority of a read in the face of an IO error
PR:		231516
Reported by:	sbruno
Approved by:	re (rgrimes)
Obtained from:	ZFS-on-Linux
X-MFC-with:	334844
Sponsored by:	Klara Systems

MFV/ZoL:	Fix zio->io_priority failed (7 < 6) assert

commit c26cf0966d131b722c32f8ccecfe5791a789d975
Author: Tony Hutter <hutter2@llnl.gov>
Date:   Tue May 29 18:13:48 2018 -0700

  Fix zio->io_priority failed (7 < 6) assert

  This fixes an assert in vdev_queue_change_io_priority():

    VERIFY3(zio->io_priority < ZIO_PRIORITY_NUM_QUEUEABLE) failed (7 < 6)
    PANIC at vdev_queue.c:832:vdev_queue_change_io_priority()

  Reviewed-by: Tom Caputi <tcaputi@datto.com>
  Reviewed-by: George Melikov <mail@gmelikov.ru>
  Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
  Signed-off-by: Tony Hutter <hutter2@llnl.gov>
2018-09-29 01:26:07 +00:00
mjg
32bf729b0a zfs: depessimize zfs_root with rmlocks
Currently vfs calls the root method on each absolute lookup and when
crossing mount points.

zfs_root ends up looking up the inode internally as if it was not
instantianted which results in significant lock contention on systems
like EPYC.

Store the vnode in the mount point and protect the access with rmlocks.
This is a temporary hack for 12.0.

Sample result:

before:
make -s -j 128 buildkernel 2778.09s user 3319.45s system 8370% cpu 1:12.85 total

after:
make -s -j 128 buildkernel 3199.57s user 1772.78s system 8232% cpu 1:00.40 total

Tested by:	pho (zfs mount/unmount tests)
Reviewed by:	kib, mav, sef (different parts)
Approved by:	re (gjb)
Differential Revision:	https://reviews.freebsd.org/D17233
2018-09-25 17:58:06 +00:00
mav
f9be2a2d63 MFV r338866: 9700 ZFS resilvered mirror does not balance reads
illumos/illumos-gate@82f63c3c2b

Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author:     Jerry Jelinek <jerry.jelinek@joyent.com>

Approved by:	re (delphij)
2018-09-21 21:56:00 +00:00
kib
d59d610b2e Fix ZFS VFS op quotactl to follow busy protocol.
Reviewed by:	avg, mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
Approved by:	re (gjb)
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D17208
2018-09-19 14:38:01 +00:00
markj
290cf1ad13 Fix an nvpair leak in vdev_geom_read_config().
Also change the behaviour slightly: instead of freeing "config" if the
last nvlist doesn't pass the tests, return the last config that did pass
those tests.  This matches the comment at the beginning of the function.

PR:		230704
Diagnosed by:	avg
Reviewed by:	asomers, avg
Tested by:	Mark Martinec <Mark.Martinec@ijs.si>
Approved by:	re (gjb)
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
Differential revision:  https://reviews.freebsd.org/D17202
2018-09-17 16:16:57 +00:00
vangyzen
4d88bbba13 Set zfs_arc_meta_strategy to metadata only
The previous default of "balanced" appears to have caused pathological
behavior, including very poor performance and 100% CPU load in the
arc_reclaim_thread.

The symptoms appeared when the daily periodic run started.
With this change, the system--and the ARC in particular--behaved
normally during a manual daily periodic run.

From Mark Johnston:  The port of the balanced strategy is incomplete,
since arc_prune_async() is a no-op on FreeBSD.  (This also seems
to imply that r337653 is a no-op.)  After 12 is branched we can
port the remaining bits and consider changing the default back.

Submitted by:	markj (essentially)
Reviewed by:	markj
Approved by:	re (gjb)
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D17156
2018-09-13 17:56:48 +00:00
br
639c5f49ef Add support for 'C'-compressed ISA extension to DTrace FBT provider.
Approved by:	re (kib)
Sponsored by:	DARPA, AFRL
2018-09-03 14:34:09 +00:00
markj
0f81d204ed Fix the hash table lookup in fbt_destroy().
Reported and tested by:	pho
Approved by:	re (kib)
X-MFC with:	r338359
2018-09-02 17:02:13 +00:00
markj
6423d74a2e Re-compute the ARC size before computing the MFU target.
This fixes an upstream regression introduced in r331404, causing overly
aggressive reclamation of the ARC when under pressure.

Diagnosed by:	Paul <devgs@ukr.net>
Approved by:	re (gjb)
MFC after:	3 days
2018-08-31 21:45:05 +00:00
markj
e0e589a554 Re-add kstat.zfs.misc.arcstats.other_size under COMPAT_FREEBSD11.
It is used by a number of applications, notably top(1).

Reported by:	netchild
Reviewed by:	allanjude
Approved by:	re (delphij)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16943
2018-08-30 13:42:01 +00:00
kib
acc0761238 Remove {max/min}_offset() macros, use vm_map_{max/min}() inlines.
Exposing max_offset and min_offset defines in public headers is
causing clashes with variable names, for example when building QEMU.

Based on the submission by:	royger
Reviewed by:	alc, markj (previous version)
Sponsored by:	The FreeBSD Foundation (kib)
MFC after:	1 week
Approved by:	re (marius)
Differential revision:	https://reviews.freebsd.org/D16881
2018-08-29 12:24:19 +00:00
markj
d4ef0a5d5f Add a sysctl for the ZFS abd_scatter_enabled setting.
Submitted by:	Yamagi Burmeister <lists@yamagi.org> (original version)
Approved by:	re (rgrimes)
MFC after:	3 days
2018-08-29 02:49:18 +00:00
markj
651480fae0 Allow multiple FBT probes to share a tracepoint.
With GNU ifuncs, multiple FBT probes may correspond to the same
instruction.  fbt_invop() assumed that this could not happen and
would return after the first probe found in the global FBT hash
table, which might not be the one that's enabled.  Fix the problem
on x86 by linking probes that share a tracepoint and having each
linked probe fire when the tracepoint is hit.

PR:		230846
Approved by:	re (gjb)
Sponsored by:	The FreeBSD Foundation
Differential Revision:	https://reviews.freebsd.org/D16921
2018-08-28 20:21:36 +00:00
mav
6dafcae1dd Unblock speculative prefetcher also on pool creation.
Fix at r331950 appeared to be incomplete, fixing only case of pool
import, but not pool creation, leaving prefetcher still blocked for
newly created pools.

Approved by:	re (gjb)
MFC after:	1 week
2018-08-24 01:59:25 +00:00
mav
8e549d31c8 Add dmu_tx_assign() error handling in zfs_unlinked_drain().
The error handling got lost during r334810, while according to the report
error there may happen in case of dataset being over quota.  In such case
just leave the node in the unlinked list to be freed sometimes later.

PR:		229887
Sponsored by:	iXsystems, Inc.
2018-08-22 16:32:53 +00:00
mav
c1c7ce6e54 Create separate taskqueue to call zfs_unlinked_drain().
r334810 introduced zfs_unlinked_drain() dispatch to taskqueue on every
deletion of a file with extended attributes.  Using system_taskq for that
with its multiple threads in case of multiple files deletion caused all
available CPU threads to uselessly spin on busy locks, completely blocking
the system.

Use of single dedicated taskqueue is the only easy solution I've found,
while in would be great if we could specify that some task should be
executed only once at a time, but never in parallel, while many tasks
could use different threads same time.

Sponsored by:	iXsystems, Inc.
2018-08-22 16:27:24 +00:00
markj
34a02309fa Set arc_kmem_cache_reap_retry_ms to 0 and make it configurable.
r329759 introduced this parameter, which controls the rate at which ZFS
UMA zones are drained when the ARC reclaim thread is shrinking the ARC.
The reclamation target is derived from the global free page count, and
arc_shrink() only frees buffers back to UMA, so the free page count is
not updated until the zones are drained.  Thus, back-to-back calls to
arc_shrink() within the arc_kmem_cache_reap_retry_ms interval do not
provide immediate feedback to the arc_reclaim control loop, so we may
free more of the ARC than needed to address a transient page shortage.

As we do not implement the asynchronous zone draining added in r329759,
disable the retry interval, restoring pre-r329759 behaviour.  That is,
we will drain the ZFS UMA zones before each attempt to shrink the ARC.

Reviewed by:	mav
MFC after:	1 week
Sponsored by:	The FreeBSD Foundation
2018-08-21 16:37:37 +00:00
mmacy
d654586e4d Make dnode definition uniform on !x86
gcc4 requires -fms-extensions to accept anonymous union members
2018-08-21 03:45:09 +00:00
mav
8253837ac3 9751 Allocation throttling misplacing ditto blocks
Relax allocation throttling for ditto blocks.  Due to random imbalances
in allocation it tends to push block copies to one vdev, that looks
slightly better at the moment.  Slightly less strict policy allows both
improve data security and surprisingly write performance, since we don't
need to touch extra metaslabs on each vdev to respect the min distance.

Sponsored by:	iXsystems, Inc.
2018-08-17 15:17:09 +00:00
mav
63e7138971 9738 Fix third block copy allocations, broken at 9112.
Use METASLAB_WEIGHT_CLAIM weight to allocate tertiary blocks.
Previous use of METASLAB_WEIGHT_SECONDARY for that caused errors
later on metaslab_activate_allocator() call, leading to massive
load of unneeded metaslabs and write freezes.

Reviewed by:	Paul Dagnelie <pcd@delphix.com>
2018-08-17 15:00:41 +00:00
mav
ba79aac952 Make vfs.zfs.zio.dva_throttle_enabled sysctl writable.
Not sure what I thought originally, but as I see now runtime changes are
working fine, and the code seems like even designed for this.
2018-08-16 18:44:50 +00:00
jamie
94a36bb7c1 Put jail(2) under COMPAT_FREEBSD11. It has been the "old" way of creating
jails since FreeBSD 7.

Along with the system call, put the various security.jail.allow_foo and
security.jail.foo_allowed sysctls partly under COMPAT_FREEBSD11 (or
BURN_BRIDGES).  These sysctls had two disparate uses: on the system side,
they were global permissions for jails created via jail(2) which lacked
fine-grained permission controls; inside a jail, they're read-only
descriptions of what the current jail is allowed to do.  The first use
is obsolete along with jail(2), but keep them for the second-read-only use.

Differential Revision:	D14791
2018-08-16 18:40:16 +00:00
mav
0bd4cf2fc5 Add couple tunables/sysctl, missed in r336949. 2018-08-16 00:50:14 +00:00
mav
e200af5dd2 Fix mismerge in r337196.
ZoL did the same mistake, and fixed it with separate commit 863522b1f9:

dsl_scan_scrub_cb: don't double-account non-embedded blocks

We were doing count_block() twice inside this function, once
unconditionally at the beginning (intended to catch the embedded block
case) and once near the end after processing the block.

The double-accounting caused the "zpool scrub" progress statistics in
"zpool status" to climb from 0% to 200% instead of 0% to 100%, and
showed double the I/O rate it was actually seeing.

This was apparently a regression introduced in commit 00c405b4b5e8,
which was an incorrect port of this OpenZFS commit:

    https://github.com/openzfs/openzfs/commit/d8a447a7

Reviewed by: Thomas Caputi <tcaputi@datto.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Steven Noonan <steven@uplinklabs.net>
Closes #7720
Closes #7738

Reported by:	sef
2018-08-15 21:01:57 +00:00
mmacy
6238400c36 MFV/ZoL: Add dbuf hash and dbuf cache kstats
TODO: KSTAT_TYPE_NAMED support

commit 5e021f56d3437d3523904652fe3cc23ea1f4cb70
Author: Giuseppe Di Natale <dinatale2@users.noreply.github.com>
Date:   Mon Jan 29 10:24:52 2018 -0800

    Add dbuf hash and dbuf cache kstats

    Introduce kstats about the dbuf hash and dbuf cache
    to make it easier to inspect state. This should help
    with debugging and understanding of these portions
    of the codebase.

    Correct format of dbuf kstat file.

    Introduce a dbc column to dbufs kstat to indicate if
    a dbuf is in the dbuf cache.

    Introduce field filtering in the dbufstat python script.

    Introduce a no header option to the dbufstat python script.

    Introduce a test case to test basic mru->mfu list movement
    in the ARC.

    Reviewed-by: Tony Hutter <hutter2@llnl.gov>
    Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Signed-off-by: Giuseppe Di Natale <dinatale2@llnl.gov>
    Closes #6906
2018-08-12 03:15:30 +00:00
mmacy
62770aebd3 MFV/ZoL: Fix stack dbuf_hold_impl()
commit fc5bb51f08a6c91ff9ad3559d0266eeeab0b1f61
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu Aug 26 10:52:00 2010 -0700

    Fix stack dbuf_hold_impl()

    This commit preserves the recursive function dbuf_hold_impl() but moves
    the local variables and function arguments to the heap to minimize
    the stack frame size.  Enough space is initially allocated on the
    stack for 20 levels of recursion.  This technique was based on commit
    34229a2f2ac07363f64ddd63e014964fff2f0671 which reduced stack usage of
    traverse_visitbp().

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-12 02:24:18 +00:00
mmacy
1e157ffa37 fix build DN_MAX_BONUSLEN -> DN_OLD_MAX_BONUSLEN 2018-08-12 02:12:44 +00:00
mmacy
2fda9a5191 Restore legacy dnode_phys layout on tier 2 arches
Evidently gcc4 doesn't support anonymous union members
2018-08-12 02:09:06 +00:00
mmacy
41d5d46062 MFV/ZoL: Fix stack noinline
commit 60948de1ef976aabaa3630707bcc8b5867508507
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Date:   Thu Aug 26 10:58:36 2010 -0700

    Fix stack noinline

    Certain function must never be automatically inlined by gcc because
    they are stack heavy or called recursively.  This patch flags all
    such functions I've found as 'noinline' to prevent gcc from making
    the optimization.

    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2018-08-12 01:29:30 +00:00
mmacy
0264e64dd1 MFV/ZoL: Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z
commit 81edd3e83409218879e7af293daa86b0c40eb015
Author: Peng <peng.hse@xtaotech.com>
Date:   Wed Jun 8 15:22:07 2016 +0800

    Fix PANIC: metaslab_free_dva(): bad DVA X:Y:Z

    The following scenario can result in garbage in the dn_spill field.
    The db->db_blkptr must be set to NULL when DNODE_FLAG_SPILL_BLKPTR
    is clear to ensure the dn_spill field is cleared.

    Current txg = A.
    * A new spill buffer is created. Its dbuf is initialized with
      db_blkptr = NULL and it's dirtied.

    Current txg = B.
    * The spill buffer is modified. It's marked as dirty in this txg.
    * Additional changes make the spill buffer unnecessary because the
      xattr fits into the bonus buffer, so it's removed. The dbuf is
      undirtied in this txg, but it's still referenced and cannot be
      destroyed.

    Current txg = C.
    * Starts syncing of txg A
    * dbuf_sync_leaf() is called for the spill buffer. Since db_blkptr
      is NULL, dbuf_check_blkptr() is called.
    * The dbuf starts being written and it reaches the ready state
      (not done yet).
    * A new change makes the spill buffer necessary again.
      sa_build_layouts() ends up calling dbuf_find() to locate the
      dbuf.  It finds the old dbuf because it has not been destroyed yet
      (it will be destroyed when the previous write is done and there
      are no more references). The old dbuf has db_blkptr != NULL.
    * txg A write is complete and the dbuf released. However it's still
      referenced, so it's not destroyed.

    Current txg = D.
    * Starts syncing of txg B
    * dbuf_sync_leaf() is called for the bonus buffer. Its contents are
      directly copied into the dnode, overwriting the blkptr area because,
      in txg B, the bonus buffer was big enough to hold the entire xattr.
    * At this point, the db_blkptr of the spill buffer used in txg C
      gets corrupted.

    Signed-off-by: Peng <peng.hse@xtaotech.com>
    Signed-off-by: Tim Chase <tim@chase2k.com>
    Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
    Closes #3937
2018-08-12 01:17:32 +00:00