Commit Graph

1646 Commits

Author SHA1 Message Date
Josh Paetzel
f2be81e92c MFV 312436
6569 large file delete can starve out write ops

  illumos/illumos-gate@ff5177ee8b
  ff5177ee8b

  https://www.illumos.org/issues/6569
    The core issue I've found is that there is no throttle for how many
    deletes get assigned to one TXG. As a results when deleting large files
    we end up filling consecutive TXGs with deletes/frees, then write
    throttling other (more important) ops.

    There is an easy test case for this problem. Try deleting several
    large files (at least 1/2 TB) while you do write ops on the same
    pool. What we've seen is performance of these write ops (let's
    call it sideload I/O) would drop to zero.

    More specifically the problem is that dmu_free_long_range_impl()
    can/will fill up all of the dirty data in the pool "instantly",
    before many of the sideload ops can get in. So sideload
    performance will be impacted until all the files are freed.

    The solution we have tested at Nexenta (with positive results)
    creates a relatively simple throttle for how many "free" ops we let
    into one TXG.

    However this solution exposes other problems that should also be
    addressed. If we are to slow down freeing of data that means one
    has to wait even longer (assuming vnode ref count of 1) to get shell
    back after an rm or for NFS thread to finish the free-ing op.
    To avoid this the proposed solution is to call zfs_inactive() async
    for "large" files. Async freeing then begs for the reclaimed space
    to be accounted for in the zpool's "freeing" prop.

    The other issue with having a longer delete is the inability to
    export/unmount for a longer period of time. The proposed solution
    is to interrupt freeing of blocks when a fs is unmounted.

  Author: Alek Pinchuk <alek@nexenta.com>
  Reviewed by: Matt Ahrens <mahrens@delphix.com>
  Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
  Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
  Approved by: Dan McDonald <danmcd@omniti.com>

Reviewed by:	avg
Differential Revision:	D9008
2017-01-20 15:01:04 +00:00
Andrew Turner
ae69172343 Use the kernel stack in the ARM FBT DTrace provider. This is used to find
the fifth argument to functions being traced, however there was an error
where the userspace stack was being used. This may be invalid leading to
a kernel panic if this address is unmapped.

Submitted by:	Graeme Jenkinson <graeme.jenkinson@cl.cam.ac.uk>
MFC after:	1 week
Differential Revision:	https://reviews.freebsd.org/D9229
2017-01-18 13:27:24 +00:00
Mark Johnston
d01e6ad41b Have DTrace handle faults when dereferencing a lock object pointer.
MFC after:	1 week
2017-01-11 01:18:06 +00:00
Mark Johnston
4153c9b932 Ignore LC_SLEEPABLE when testing whether a mutex is adaptive.
MFC after:	1 week
2017-01-11 01:15:55 +00:00
Mateusz Guzik
619ce4d72e Revert r309619 "ifndef atomic_cas_* in cddl code"
It was a temporary change to ease an import of native atomic_cas primitives.
Instead, atomic_fcmpset was devised with different semantics. See r311168.
2017-01-03 21:02:30 +00:00
Mark Johnston
91371de1fa Remove the "unused" DIF subroutine index left after r308582.
These indices are input to a build-time script that generates code to
validate subroutine names.
2017-01-03 00:24:12 +00:00
Mark Johnston
c71c814a97 Remove an obsolete pragma from dtrace.h.
It triggers a compiler warning and has been removed upstream.

MFC after:	1 week
2016-12-27 23:31:32 +00:00
George V. Neville-Neil
805e1842c8 Remove extra DOF_SEC_XLIMPORT from the DOF_SEC_ISLOADABLE macro
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
2016-12-16 20:44:14 +00:00
Alexander Motin
c5f74c4873 Revert r310023 for now.
After another look my new variable mapping was not exactly right.
2016-12-15 08:03:16 +00:00
Alexander Motin
d686b07132 Reduce diff from Illumos by better variables mapping. 2016-12-13 16:20:10 +00:00
Alexander Motin
2823b6467a Postpone ZVOL media/block size caching till first open.
At least on FreeBSD there are no legal way to access media or get its
size without opening device/provider first.  Postponing this caching
allows to skip several disk seeks per ZVOL/snapshot during import.

For HDD pool with 1 ZVOL in dev mode with 1000 snapshots this reduces
pool import time from 40 to 10 seconds.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2016-12-11 19:50:39 +00:00
Alexander Motin
2fb5d72d58 Add missed vfs.zfs.zfetch.max_idistance sysctl. 2016-12-10 21:19:27 +00:00
Mark Johnston
f99a517272 Don't create FBT probes for lock owner methods.
These functions may be called in DTrace probe context, so they cannot be
safely traced. Moreover, they are currently only used by DTrace, so their
corresponding FBT probes are not particularly useful.

MFC after:	2 weeks
2016-12-10 03:13:11 +00:00
Mark Johnston
8bb9b7f17a Consistently use fbt_excluded() on all architectures.
MFC after:	2 weeks
2016-12-10 03:11:05 +00:00
Alexander Motin
9373759d13 Fix spa_alloc_tree sorting by offset in r305331.
Original commit "7090 zfs should improve allocation order" declares alloc
queue sorted by time and offset.  But in practice io_offset is always zero,
so sorting happened only by time, while order of writes with equal time was
completely random.  On Illumos this did not affected much thanks to using
high resolution timestamps.  On FreeBSD due to using much faster but low
resolution timestamps it caused bad data placement on disks, affecting
further read performance.

This change switches zio_timestamp_compare() from comparing uninitialized
io_offset to really populated io_bookmark values.  I haven't decided yet
what to do with timestampts, but on simple tests this change gives the
same peformance results by just making code to work as declared.

MFC after:	1 week
2016-12-08 15:58:03 +00:00
George V. Neville-Neil
af463464cf Fix a kernel panic in DTrace's rw_iswriter subroutine.
On FreeBSD the sense of rw_write_held() and rw_iswriter() were reversed,
probably due to a cut and paste error. Using rw_iswriter() would cause
the kernel to panic.

Reviewed by:	markj
MFC after:	2 weeks
Sponsored by:	DARPA, AFRL
Differential Revision:	https://reviews.freebsd.org/D8718
2016-12-07 07:27:47 +00:00
Mateusz Guzik
ef32958e5d ifndef atomic_cas_* in cddl code in preparation for native implementations
This is a temporary change to not require all architectures to import at once.

Discussed with:	jhb
2016-12-06 14:08:49 +00:00
Andriy Gapon
0451d4e97b MFV r309249: 3821 Race in rollback, zil close, and zil flush
Note: there was a merge conflict resolved by me.

illumos/illumos-gate@43297f973a
43297f973a

https://www.illumos.org/issues/3821
  We recently had nodes with some of the latest zfs bits panic on us in a
  rollback-heavy environment. The following is from my preliminary analysis:
  Let's look at where we died:
  > $C
  ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1)
  ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1)
  ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1)
  ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1)
  ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680)
  ffffff01ea6b9c30 thread_start+8()
  If we dig in we can find that this dataset corresponds to a zone:
  > ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname
  zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233-
  9e378e89877b" ]
  Okay so we have a null taskq pointer. That only happens during the calls to
  zil_open and zil_close. If we poke around we can see that we're actually in
  midst of a rollback:
  > ::pgrep zfs | ::printf "0x%x %s\\n" proc_t . p_user.u_psargs
  0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d-
  4b8ab06213c2@marlin_init
  0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233-
  9e378e89877b@marlin_init
  0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc-
  4ac4ba8fe9f8@marlin_init
  0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e-
  10cd45ff8e9f@marlin_init
  0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c-
  6d1b2751e6f5@marlin_init
  0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24-
  2ed45cd0da70@marlin_init

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>

MFC after:	3 weeks
2016-11-28 15:14:31 +00:00
Andriy Gapon
69bac03666 MFV r308990: 7181 race between zfs_mount and zfs_ioc_rollback
illumos/illumos-gate@90f2c094b3
90f2c094b3

https://www.illumos.org/issues/7181
  zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
  dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
  before the setup is actually completed,
  thus an under-constructed zfsvfs becomes visible.
  Additionally, there is nothing to serialize the two call paths. As a result two
  threads can step on each other's toes.
  assertion failed: zilog->zl_clean_taskq == NULL, file:
  ../../common/fs/zfs/zil.c, line: 1772

  > $c
  vpanic()
  0xfffffffffbdf6928()
  zil_open+0x45(ffffff1bbc5dd000, fffffffff7993880)
  zfsvfs_setup+0x84(ffffffb378d77000, 0)
  zfs_resume_fs+0x132(ffffffb378d77000, ffffffb37ddcf000)
  zfs_ioc_rollback+0x96(ffffffb37ddcf000, ffffff01dcdc4cd0, ffffff01aa091000)
  zfsdev_ioctl+0x215(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
  ffffff0004b59e58)
  cdev_ioctl+0x39(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
  ffffff0004b59e58)
  spec_ioctl+0x60(ffffff0197737700, 5a19, 80465f8, 100003,
  ffffff01ab318368, ffffff0004b59e58)
  fop_ioctl+0x55(ffffff0197737700, 5a19, 80465f8, 100003,
  ffffff01ab318368, ffffff0004b59e58)
  ioctl+0x9b(7, 5a19, 80465f8)
  sys_syscall32+0x1f7()

  > ffffff1bbc5dd000::print objset_t os_zil
  os_zil = 0xffffff1c053cf7c0
  > 0xffffff1c053cf7c0::print zilog_t zl_clean_taskq

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>

MFC after:	2 weeks
2016-11-24 10:34:42 +00:00
Andriy Gapon
b55ae64b50 MFV r308988: 7199, 7200 dsl_dataset_rollback_sync may try to free
already free blocks

7199 dsl_dataset_rollback_sync may try to free already free blocks
7200 no blocks must be born in a txg after a snaphot is created

illumos/illumos-gate@bfaed0b91e
bfaed0b91e

https://www.illumos.org/issues/7199
  dsl_dataset_rollback_sync may try to free already freed blocks when it calls
  dsl_destroy_head_sync_impl to destroy a temporary clone.
  That happens if a snapshot to which we are rolling back and from which the
  clone is created has some ZIL records.

https://www.illumos.org/issues/7200
  No new blocks must be born in a dataset in the same TXG after a snapshot of the
  dataset is taken.
  Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
  and as such they would be presumed to belong o the snapshot while in fact they
  do not.
  All the datasets must be clean before sync tasks are run, so the described
  scenario may happen only if one of the sync tasks dirties the dataset and
  another sync task takes its snapshot.
  Then, there will be another sync pass because of the dirty data and the new
  blocks will be born in the same TXG when the data is written out.
  It seems that almost all of the existing sync tasks modify only MOS and do not
  dirty any objsets.
  The only exception that I've been able to identify so far is the rollback which
  can modify an objset when it zeroes out the objset's ZIL.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>

MFC after:	3 weeks
2016-11-24 10:29:21 +00:00
Andriy Gapon
239c22b73d MFV r308987: 7180 potential race between zfs_suspend_fs+zfs_resume_fs
and zfs_ioc_rename

illumos/illumos-gate@690041b9ca
690041b9ca

https://www.illumos.org/issues/7180
  If a filesystem is not unmounted while the rename is being performed, then, for
  example, a concurrect zfs rollback may call zfs_suspend_fs followed by
  zfs_resume_fs on the same filesystem.
  The latter takes the filesystem's name as an argument. If the filesystem name
  changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
  call in zfs_resume_fs would fail resulting in a kernel panic.
  So far I have been able to reproduce this problem on FreeBSD where zfs rename
  has -u option that skips the unmounting before doing the renaming.
  But I think that in theory the same problem can occur on illumos as well,
  because the unmounting is done in userland before invoking the rename ioctl and
  there could be a race with, e.g., zfs mount.
  panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
  == 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
  zfs/zfs_vfsops.c, line: 2210
  KDB: stack backtrace:
  db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
  vpanic() at vpanic+0x182/frame 0xfffffe004df30790
  panic() at panic+0x43/frame 0xfffffe004df307f0
  assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
  zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
  zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
  zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
  devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
  kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
  sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
  amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
  Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0

Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>

MFC after:	2 weeks
2016-11-24 10:21:22 +00:00
Andriy Gapon
d15b9428bb further fix zfs_lock() diagnostics
It was very wrong to look at the vnode and znode internals without
having locked the vnode first.

Reported by:	pho
Tested by:	pho
MFC after:	1 week
X-MFC with:	r308887
2016-11-24 09:00:51 +00:00
George V. Neville-Neil
cdaa8777f7 Add tunable to disable destructive dtrace
Submitted by:	Joerg Pernfuss <code.jpe@gmail.com>
Reviewed by:	rstone, markj
MFC after:	2 weeks
Differential Revision:	https://reviews.freebsd.org/D8624
2016-11-23 22:50:20 +00:00
Alan Cox
bba39b9ae3 Remove PG_CACHED-related fields from struct vmmeter, because they are no
longer used.  More precisely, they are always zero because the code that
decremented and incremented them no longer exists.

Bump __FreeBSD_version to mark this change.

Reviewed by:	kib, markj
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8583
2016-11-22 18:13:46 +00:00
Andriy Gapon
17055fcda7 fix unsafe modification of zfs_vnodeops when DIAGNOSTIC is enabled
The idea was to avoid a false assertion in zfs_lock, but it was
implemented very dangerously and incorrectly.

Reported by:	pho
Tested by:	pho
MFC after:	1 week
2016-11-20 14:00:50 +00:00
Andriy Gapon
2ec31e84cc zfs: fix up after the removal of PG_CACHED pages in r308691
PR:		214629
Reported by:	mshirk@daemon-security.com
Reviewed by:	alc
Tested by:	Shawn Webb <shawn.webb@hardenedbsd.org>
X-MFC with:	308691
2016-11-19 08:12:57 +00:00
Mark Johnston
188011dbf2 Support fetching RFLAGS in fasttrap_getreg().
MFC after:	1 week
2016-11-18 03:11:11 +00:00
Alexander Motin
14b5719f6a After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.

Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.

Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.

While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.

Sponsored by:	iXsystems, Inc.
2016-11-17 21:01:27 +00:00
Alexander Motin
eb9bfc257d Revert r307392: I've found a way to avoid big allocations completely. 2016-11-17 20:44:51 +00:00
Alan Cox
7667839a7e Remove most of the code for implementing PG_CACHED pages. (This change does
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments.  Those changes will come later.)

Reviewed by:	kib, markj
Tested by:	pho
Sponsored by:	Dell EMC Isilon
Differential Revision:	https://reviews.freebsd.org/D8497
2016-11-15 18:22:50 +00:00
Mark Johnston
375c8b20dc Remove the DTrace printt and typeref actions.
These are FreeBSD-specific and were added in r178576 to provide the ability
to pretty-print instances of compound types. However, the print action has
long since been augmented to provide this functionality with a simpler
interface.

Discussed with:	gnn
Differential Revision:	https://reviews.freebsd.org/D8478
2016-11-12 19:26:12 +00:00
Bryan Drewery
28323add09 Fix improper use of "its".
Sponsored by:	Dell EMC Isilon
2016-11-08 23:59:41 +00:00
Oleksandr Tymoshenko
d30e308465 Fix include order as required post r308415 2016-11-07 20:02:18 +00:00
Alexander Motin
8acf168aab Fix ZIL records ordering when ZVOL opened both with and without FSYNC.
Before this an earlier writes to a ZVOL opened without FSYNC could get to
ZIL after later writes to the same ZVOL opened with FSYNC.  Fix this by
replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt),
marking all log records sync if anybody opened the ZVOL with FSYNC.

MFC after:	2 weeks
2016-11-01 16:03:31 +00:00
Alexander Motin
2d1d8f4c8f Pass to zvol_log_truncate() same sync values as to zvol_log_write().
Surplus marking of TX_TRUNCATE records as sync could result in putting them
into ZIL before previous writes if ones were async.

MFC after:	2 weeks
2016-11-01 12:47:19 +00:00
Alexander Motin
74a148f46f Add sysctls for zfs_immediate_write_sz and zvol_immediate_write_sz. 2016-10-29 23:25:12 +00:00
Andriy Gapon
97371ba2a9 zfsbootcfg: a simple tool to set next boot (one time) options for zfsboot
(gpt)zfsboot will read one-time boot directives from a special ZFS pool
area.  The area was previously described as "Boot Block Header", but
currently it is know as Pad2, marked as reserved and is zeroed out on
pool creation.  The new code interprets data in this area, if any, using
the same format as boot.config.  The area is immediately wiped out.
Failure to parse the directives results in a reboot right after the
cleanup.  Otherwise the boot sequence proceeds as usual.

zfsbootcfg writes zfsboot arguments specified on its command line to the
Pad2 area of a disk identified by vfs.zfs.boot.primary_pool and
vfs.zfs.boot.primary_vdev kenv variables that are set by loader during
boot.  Please see the manual page for more.

Thanks to all who reviewed, contributed and made suggestions!  There are
many potential improvements to the feature, please see the review for
details.

Reviewed by:	wblock (docs)
Discussed with:	jhb, tsoome
MFC after:	3 weeks
Relnotes:	yes
Differential Revision: https://reviews.freebsd.org/D7612
2016-10-29 14:09:32 +00:00
Alexander Motin
471cf6ce7d Add vdev_reopening support to vdev_geom.
It allows to avoid extra GEOM providers flapping without significant need.
Since GEOM got resize support, we don't need to reopen provider to get new
size.  If provider was orphaned and no longer valid, ZFS should already
know that, and in such case reopen should be done in full as expected.

MFC after:	2 weeks
2016-10-28 17:05:14 +00:00
Alexander Motin
f106f43aa2 Matching GUIDs, handle possible race on vdev detach.
In case of vdev detach, causing top level mirror vdev destruction, leaf
vdev changes its GUID to one of the destroyed mirror, that creates race
condition when GUID in vdev label may not match one in the pool config.

This change replicates logic nuance of vdev_validate() by adding special
exception, matching the vdev GUID against the top level vdev GUID.
Since this exception is not completely reliable (may give false positives
if we fail to erase label on detached vdev), use it only as last resort.

Quick way to reproduce this scenario now is detach vdev from a pool with
enabled autoextend.  During vdev detach autoextend logic tries to reopen
remaining vdev, that always fails now since in-memory configuration is
already updated, while on-disk labels are not yet.

MFC after:	2 weeks
2016-10-28 16:21:31 +00:00
Alexander Motin
4be4cba048 Improve few debugging log messages. 2016-10-28 15:30:10 +00:00
Andriy Gapon
539fc86f2e 3746 ZRLs are racy
illumos/illumos-gate@260af64db7
260af64db7

https://www.illumos.org/issues/3746
  From the original change log:
  It was possible for a reference to be added even with the lock held, and
  for references added just after a lock release to be lost.
  This bug was also independently found and reported in wesunsolve.net
  issues 6985013 6995524.
  In zrl_add(), always use an atomic operation to update the refcount.
  The mutex in the ZRL only guarantees that wakeups occur for waiters on the
  lock. It offers no protection against concurrent updates of the refcount.
  The only refcount transition that is safe to perform without an atomic
  operation is from ZRL_LOCKED back to 0, since this can only be performed
  by the thread which has the ZRL locked.

Authored by: Will Andrews <will@freebsd.org>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Youzhong Yang <yyang@mathworks.com>
PR:		204037
MFC after:	1 week
2016-10-27 07:38:07 +00:00
Alexander Motin
f0cbbdecbc Fix panic after ZVOL renamed to name invalid for DEVFS.
MFC after:	2 weeks
2016-10-24 12:24:24 +00:00
Alexander Motin
9be66df1e1 Add vfs.zfs.zil_log_limit sysctl.
It is at least partially broken now, but that is another question.
2016-10-16 18:49:15 +00:00
Alexander Motin
a059d8ccbc Optimize ZIL itx memory allocation on FreeBSD.
These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB.  To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.

To make it possible de-inline few instances of zil_itx_destroy().
2016-10-16 10:43:12 +00:00
Alexander Motin
1899e205d1 MFV r307314:
6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates

Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().

The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.

We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".

This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.

Closes #107

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com>
Author: Matthew Ahrens <mahrens@delphix.com>

openzfs/openzfs@5fc46359c5
2016-10-14 12:03:04 +00:00
Alexander Motin
b3a8b04807 MFV r307313:
5120 zfs should allow large block/gzip/raidz boot pool (loader project)

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>

openzfs/openzfs@c8811bd3e2

FreeBSD still does not support booting from gzip-compressed datasets,
so keep one chunk of this commit out.
2016-10-14 12:01:33 +00:00
Konstantin Belousov
5975e53d40 Fix a race in vm_page_busy_sleep(9).
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page.  In this case, typical code waiting for the
page xbusy state to pass is
again:
	VM_OBJECT_WLOCK(object);
	...
	if (vm_page_xbusied(m)) {
		vm_page_lock(m);
 		VM_OBJECT_WUNLOCK(object);    <---1
		vm_page_busy_sleep(p, "vmopax");
 		goto again;
	}

Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep().  If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.

More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep().  Again, that thread only needs vm_object lock
to proceed.  Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).

In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.

Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.

Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().

Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.

Reported and tested by:	pho (previous version)
Reviewed by:	alc, markj
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
Differential revision:	https://reviews.freebsd.org/D8196
2016-10-13 14:41:05 +00:00
Konstantin Belousov
f71d08566c Limit scope of the optimization in r306608 to dounmount() caller only.
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.

Reported and tested by:	jkim
Discussed with:	mjg
Sponsored by:	The FreeBSD Foundation
2016-10-07 11:38:28 +00:00
Andriy Gapon
6f98c83306 implement zfs_vptocnp() using z_parent property
This should allow vn_fullpath() to work even when vfs name cache is
disabled for zfs, which is the case when zfs properties like
casesensitivity and normalization are set non-default values.

The new code should be 100% reliable for directories and "mostly"
reliable for files, that is, when hardlinks across directories are
not used.

Reported by:	Frederic Chardon <chardon.frederic@gmail.com>
Reviewed by:	kib (vfs contract)
MFC after:	2 weeks
Differential Revision: https://reviews.freebsd.org/D8146
2016-10-07 06:29:24 +00:00
Andriy Gapon
9ba3abc30e zfs: fix a wrong assertion for extended attributes
For the extended attributes the order between z_teardown_lock and the
vnode lock is different.
The bug was triggered only with DIAGNOSTIC turned on.
This fix is developed in cooperation with avos.

PR:		213112
Reported by:	avos
Tested by:	avos
MFC after:	1 week
2016-10-04 08:09:25 +00:00