7500 Simplify dbuf_free_range by removing dn_unlisted_l0_blkid
illumos/illumos-gate@653af1b809653af1b809https://www.illumos.org/issues/7500
With the integration of:
commit 0f6d88aded0d165f5954688a9b13bac76c38da84
Author: Alex Reece <alex@delphix.com>
Date: Sat Jul 26 13:40:04 2014 -0800
4873 zvol unmap calls can take a very long time for larger datasets
the dnode's dn_bufs field was changed from a list to a tree. As a result,
the dn_unlisted_l0_blkid field is no longer necessary.
Author: Stephen Blinick <stephen.blinick@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
* In zfs_freebsd_setattr, if the caller wants to set the birthtime,
set the bits that zfs_settattr expects
* In zfs_setattr, if XAT_CREATETIME is set, set xoa_createtime,
expected by zfs_xvattr_set. The two levels of indirection seem
excessive, but it minimizes diffs vs OpenZFS.
* In zfs_setattr, check for overflow of va_birthtime (from delphij)
* Remove red herring in zfs_getattr
sys/cddl/contrib/opensolaris/uts/common/sys/vnode.h
* Un-booby-trap some macros
New tests are under review at https://github.com/pjd/pjdfstest/pull/6
Reviewed by: avg
MFC after: 3 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D9353
It is an ASCII encoding of a hexadecimal representation of the DOF file
used to enable anonymous tracing, so its length should always be even.
MFC after: 1 week
When recording probe site addresses in the output DOF file, dtrace -G
needs to emit relocations for the .SUNW_dof section in order to obtain
the addresses of functions containing probe sites. DTrace expects the
addresses to be relative to the base address of the final ELF file,
and the amd64 USDT implementation was relying on some unspecified and
incorrect behaviour in the base system GNU ld to achieve this.
This change reimplements the probe site relocation handling to allow
USDT to be used with lld and newer GNU binutils. Specifically, it
makes use of R_X86_64_PC64/R_386_PC32 relocations to obtain the
probe site address relative to the DOF file address, and adds and uses a
new DOF relocation type which computes the final probe site address using
these relative offsets.
Reported by and discussed with: Rafael Espíndola
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D9374
low-quality random numbers with a modern implementation (xoroshiro128+)
that is capable of generating better quality randomness without compromising performance.
Submitted by: Graeme Jenkinson
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D9051
This corresponds to the following illumos issues:
5755 want support for Intel FMA instrs
5756 want support for Intel BMI1 instrs
5757 want support for Intel BMI2 instrs
5758 want support for Intel AVX2 instrs
7204 Want broadwell rdseed and adx support
7208 Want stac/clac disasm support
7733 Need SHA Instruction dis support
7756 dis can't handle x86 SSE 3 instructions
7757 want avx2 disasm tests
7758 want SSE 4.1 disasm tests
MFC after: 2 weeks
FASTTRAP_MAX_INSTR_SIZE is the largest valid value of a tracepoint, so
correct the assertion accordingly. This limit was hit with a 15-byte NOP.
Reported by: bdrewery
MFC after: 1 week
Sponsored by: Dell EMC Isilon
This ioctl has been considered legacy by upstream since the DTrace code
was first imported, and is unused. The removal also allows some
simplification of dtrace_helper_slurp().
Also remove a bogus copyout in the DTRACEHIOC_ADDDOF handler. Due to a
bug, it would overwrite an in-memory copy of the DOF header rather than
the passed-in DOF helper. Moreover, DTRACEHIOC_ADDDOF already copies the
helper back out automatically since its argument has the IOC_OUT attribute.
6569 large file delete can starve out write ops
illumos/illumos-gate@ff5177ee8bff5177ee8bhttps://www.illumos.org/issues/6569
The core issue I've found is that there is no throttle for how many
deletes get assigned to one TXG. As a results when deleting large files
we end up filling consecutive TXGs with deletes/frees, then write
throttling other (more important) ops.
There is an easy test case for this problem. Try deleting several
large files (at least 1/2 TB) while you do write ops on the same
pool. What we've seen is performance of these write ops (let's
call it sideload I/O) would drop to zero.
More specifically the problem is that dmu_free_long_range_impl()
can/will fill up all of the dirty data in the pool "instantly",
before many of the sideload ops can get in. So sideload
performance will be impacted until all the files are freed.
The solution we have tested at Nexenta (with positive results)
creates a relatively simple throttle for how many "free" ops we let
into one TXG.
However this solution exposes other problems that should also be
addressed. If we are to slow down freeing of data that means one
has to wait even longer (assuming vnode ref count of 1) to get shell
back after an rm or for NFS thread to finish the free-ing op.
To avoid this the proposed solution is to call zfs_inactive() async
for "large" files. Async freeing then begs for the reclaimed space
to be accounted for in the zpool's "freeing" prop.
The other issue with having a longer delete is the inability to
export/unmount for a longer period of time. The proposed solution
is to interrupt freeing of blocks when a fs is unmounted.
Author: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Reviewed by: avg
Differential Revision: D9008
the fifth argument to functions being traced, however there was an error
where the userspace stack was being used. This may be invalid leading to
a kernel panic if this address is unmapped.
Submitted by: Graeme Jenkinson <graeme.jenkinson@cl.cam.ac.uk>
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D9229
It was a temporary change to ease an import of native atomic_cas primitives.
Instead, atomic_fcmpset was devised with different semantics. See r311168.
At least on FreeBSD there are no legal way to access media or get its
size without opening device/provider first. Postponing this caching
allows to skip several disk seeks per ZVOL/snapshot during import.
For HDD pool with 1 ZVOL in dev mode with 1000 snapshots this reduces
pool import time from 40 to 10 seconds.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
These functions may be called in DTrace probe context, so they cannot be
safely traced. Moreover, they are currently only used by DTrace, so their
corresponding FBT probes are not particularly useful.
MFC after: 2 weeks
Original commit "7090 zfs should improve allocation order" declares alloc
queue sorted by time and offset. But in practice io_offset is always zero,
so sorting happened only by time, while order of writes with equal time was
completely random. On Illumos this did not affected much thanks to using
high resolution timestamps. On FreeBSD due to using much faster but low
resolution timestamps it caused bad data placement on disks, affecting
further read performance.
This change switches zio_timestamp_compare() from comparing uninitialized
io_offset to really populated io_bookmark values. I haven't decided yet
what to do with timestampts, but on simple tests this change gives the
same peformance results by just making code to work as declared.
MFC after: 1 week
On FreeBSD the sense of rw_write_held() and rw_iswriter() were reversed,
probably due to a cut and paste error. Using rw_iswriter() would cause
the kernel to panic.
Reviewed by: markj
MFC after: 2 weeks
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D8718
Note: there was a merge conflict resolved by me.
illumos/illumos-gate@43297f973a43297f973ahttps://www.illumos.org/issues/3821
We recently had nodes with some of the latest zfs bits panic on us in a
rollback-heavy environment. The following is from my preliminary analysis:
Let's look at where we died:
> $C
ffffff01ea6b9a10 taskq_dispatch+0x3a(0, fffffffff7d20450, ffffff5551dea920, 1)
ffffff01ea6b9a60 zil_clean+0xce(ffffff4b7106c080, 7e0f1)
ffffff01ea6b9aa0 dsl_pool_sync_done+0x47(ffffff4313065680, 7e0f1)
ffffff01ea6b9b70 spa_sync+0x55f(ffffff4310c1d040, 7e0f1)
ffffff01ea6b9c20 txg_sync_thread+0x20f(ffffff4313065680)
ffffff01ea6b9c30 thread_start+8()
If we dig in we can find that this dataset corresponds to a zone:
> ffffff4b7106c080::print zilog_t zl_os->os_dsl_dataset->ds_dir->dd_myname
zl_os->os_dsl_dataset->ds_dir->dd_myname = [ "8ffce16a-13c2-4efa-a233-
9e378e89877b" ]
Okay so we have a null taskq pointer. That only happens during the calls to
zil_open and zil_close. If we poke around we can see that we're actually in
midst of a rollback:
> ::pgrep zfs | ::printf "0x%x %s\\n" proc_t . p_user.u_psargs
0xffffff43262800a0 zfs rollback zones/15714eb6-f5ea-469f-ac6d-
4b8ab06213c2@marlin_init
0xffffff54e22a1028 zfs rollback zones/8ffce16a-13c2-4efa-a233-
9e378e89877b@marlin_init
0xffffff4362f3a058 zfs rollback zones/0ddb8e49-ca7e-42e1-8fdc-
4ac4ba8fe9f8@marlin_init
0xffffff5748e8d020 zfs rollback zones/426357b5-832d-4430-953e-
10cd45ff8e9f@marlin_init
0xffffff436b867008 zfs rollback zones/8f36bf37-8a9c-4a44-995c-
6d1b2751e6f5@marlin_init
0xffffff4381ad4090 zfs rollback zones/6c8eca18-fbd6-46dd-ac24-
2ed45cd0da70@marlin_init
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@90f2c094b390f2c094b3https://www.illumos.org/issues/7181
zfsvfs_setup() is called in both zfs_mount and zfs_resume_fs paths.
dmu_objset_set_user(zfsvfs->z_os, zfsvfs) is called early in zfsvfs_setup()
before the setup is actually completed,
thus an under-constructed zfsvfs becomes visible.
Additionally, there is nothing to serialize the two call paths. As a result two
threads can step on each other's toes.
assertion failed: zilog->zl_clean_taskq == NULL, file:
../../common/fs/zfs/zil.c, line: 1772
> $c
vpanic()
0xfffffffffbdf6928()
zil_open+0x45(ffffff1bbc5dd000, fffffffff7993880)
zfsvfs_setup+0x84(ffffffb378d77000, 0)
zfs_resume_fs+0x132(ffffffb378d77000, ffffffb37ddcf000)
zfs_ioc_rollback+0x96(ffffffb37ddcf000, ffffff01dcdc4cd0, ffffff01aa091000)
zfsdev_ioctl+0x215(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
ffffff0004b59e58)
cdev_ioctl+0x39(10a00000000, 5a19, 80465f8, 100003, ffffff01ab318368,
ffffff0004b59e58)
spec_ioctl+0x60(ffffff0197737700, 5a19, 80465f8, 100003,
ffffff01ab318368, ffffff0004b59e58)
fop_ioctl+0x55(ffffff0197737700, 5a19, 80465f8, 100003,
ffffff01ab318368, ffffff0004b59e58)
ioctl+0x9b(7, 5a19, 80465f8)
sys_syscall32+0x1f7()
> ffffff1bbc5dd000::print objset_t os_zil
os_zil = 0xffffff1c053cf7c0
> 0xffffff1c053cf7c0::print zilog_t zl_clean_taskq
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
MFC after: 2 weeks
already free blocks
7199 dsl_dataset_rollback_sync may try to free already free blocks
7200 no blocks must be born in a txg after a snaphot is created
illumos/illumos-gate@bfaed0b91ebfaed0b91ehttps://www.illumos.org/issues/7199
dsl_dataset_rollback_sync may try to free already freed blocks when it calls
dsl_destroy_head_sync_impl to destroy a temporary clone.
That happens if a snapshot to which we are rolling back and from which the
clone is created has some ZIL records.
https://www.illumos.org/issues/7200
No new blocks must be born in a dataset in the same TXG after a snapshot of the
dataset is taken.
Those blocks would have the same blk_birth as the dataset's ds_prev_snap_txg
and as such they would be presumed to belong o the snapshot while in fact they
do not.
All the datasets must be clean before sync tasks are run, so the described
scenario may happen only if one of the sync tasks dirties the dataset and
another sync task takes its snapshot.
Then, there will be another sync pass because of the dirty data and the new
blocks will be born in the same TXG when the data is written out.
It seems that almost all of the existing sync tasks modify only MOS and do not
dirty any objsets.
The only exception that I've been able to identify so far is the rollback which
can modify an objset when it zeroes out the objset's ZIL.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
MFC after: 3 weeks
and zfs_ioc_rename
illumos/illumos-gate@690041b9ca690041b9cahttps://www.illumos.org/issues/7180
If a filesystem is not unmounted while the rename is being performed, then, for
example, a concurrect zfs rollback may call zfs_suspend_fs followed by
zfs_resume_fs on the same filesystem.
The latter takes the filesystem's name as an argument. If the filesystem name
changes as a result of the rename, then dmu_objset_hold(osname, zfsvfs, &os)
call in zfs_resume_fs would fail resulting in a kernel panic.
So far I have been able to reproduce this problem on FreeBSD where zfs rename
has -u option that skips the unmounting before doing the renaming.
But I think that in theory the same problem can occur on illumos as well,
because the unmounting is done in userland before invoking the rename ioctl and
there could be a race with, e.g., zfs mount.
panic: solaris assert: dmu_objset_hold(osname, zfsvfs, &zfsvfs->z_os) == 0 (0x2
== 0x0), file: /usr/devel/svn/head/sys/cddl/contrib/opensolaris/uts/common/fs/
zfs/zfs_vfsops.c, line: 2210
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe004df30710
vpanic() at vpanic+0x182/frame 0xfffffe004df30790
panic() at panic+0x43/frame 0xfffffe004df307f0
assfail3() at assfail3+0x2c/frame 0xfffffe004df30810
zfs_resume_fs() at zfs_resume_fs+0xb9/frame 0xfffffe004df30860
zfs_ioc_rollback() at zfs_ioc_rollback+0x61/frame 0xfffffe004df308a0
zfsdev_ioctl() at zfsdev_ioctl+0x65c/frame 0xfffffe004df30940
devfs_ioctl_f() at devfs_ioctl_f+0x156/frame 0xfffffe004df309a0
kern_ioctl() at kern_ioctl+0x246/frame 0xfffffe004df30a00
sys_ioctl() at sys_ioctl+0x171/frame 0xfffffe004df30ae0
amd64_syscall() at amd64_syscall+0x2db/frame 0xfffffe004df30bf0
Xfast_syscall() at Xfast_syscall+0xfb/frame 0xfffffe004df30bf0
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
MFC after: 2 weeks
It was very wrong to look at the vnode and znode internals without
having locked the vnode first.
Reported by: pho
Tested by: pho
MFC after: 1 week
X-MFC with: r308887
longer used. More precisely, they are always zero because the code that
decremented and incremented them no longer exists.
Bump __FreeBSD_version to mark this change.
Reviewed by: kib, markj
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8583
The idea was to avoid a false assertion in zfs_lock, but it was
implemented very dangerously and incorrectly.
Reported by: pho
Tested by: pho
MFC after: 1 week
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Sponsored by: iXsystems, Inc.
not remove user-space visible fields from vm_cnt or all of the references to
cached pages from comments. Those changes will come later.)
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Dell EMC Isilon
Differential Revision: https://reviews.freebsd.org/D8497
These are FreeBSD-specific and were added in r178576 to provide the ability
to pretty-print instances of compound types. However, the print action has
long since been augmented to provide this functionality with a simpler
interface.
Discussed with: gnn
Differential Revision: https://reviews.freebsd.org/D8478
Before this an earlier writes to a ZVOL opened without FSYNC could get to
ZIL after later writes to the same ZVOL opened with FSYNC. Fix this by
replicating functionality of ZPL (zv_sync_cnt equivalent to z_sync_cnt),
marking all log records sync if anybody opened the ZVOL with FSYNC.
MFC after: 2 weeks
(gpt)zfsboot will read one-time boot directives from a special ZFS pool
area. The area was previously described as "Boot Block Header", but
currently it is know as Pad2, marked as reserved and is zeroed out on
pool creation. The new code interprets data in this area, if any, using
the same format as boot.config. The area is immediately wiped out.
Failure to parse the directives results in a reboot right after the
cleanup. Otherwise the boot sequence proceeds as usual.
zfsbootcfg writes zfsboot arguments specified on its command line to the
Pad2 area of a disk identified by vfs.zfs.boot.primary_pool and
vfs.zfs.boot.primary_vdev kenv variables that are set by loader during
boot. Please see the manual page for more.
Thanks to all who reviewed, contributed and made suggestions! There are
many potential improvements to the feature, please see the review for
details.
Reviewed by: wblock (docs)
Discussed with: jhb, tsoome
MFC after: 3 weeks
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D7612
It allows to avoid extra GEOM providers flapping without significant need.
Since GEOM got resize support, we don't need to reopen provider to get new
size. If provider was orphaned and no longer valid, ZFS should already
know that, and in such case reopen should be done in full as expected.
MFC after: 2 weeks
In case of vdev detach, causing top level mirror vdev destruction, leaf
vdev changes its GUID to one of the destroyed mirror, that creates race
condition when GUID in vdev label may not match one in the pool config.
This change replicates logic nuance of vdev_validate() by adding special
exception, matching the vdev GUID against the top level vdev GUID.
Since this exception is not completely reliable (may give false positives
if we fail to erase label on detached vdev), use it only as last resort.
Quick way to reproduce this scenario now is detach vdev from a pool with
enabled autoextend. During vdev detach autoextend logic tries to reopen
remaining vdev, that always fails now since in-memory configuration is
already updated, while on-disk labels are not yet.
MFC after: 2 weeks
illumos/illumos-gate@260af64db7260af64db7https://www.illumos.org/issues/3746
From the original change log:
It was possible for a reference to be added even with the lock held, and
for references added just after a lock release to be lost.
This bug was also independently found and reported in wesunsolve.net
issues 6985013 6995524.
In zrl_add(), always use an atomic operation to update the refcount.
The mutex in the ZRL only guarantees that wakeups occur for waiters on the
lock. It offers no protection against concurrent updates of the refcount.
The only refcount transition that is safe to perform without an atomic
operation is from ZRL_LOCKED back to 0, since this can only be performed
by the thread which has the ZRL locked.
Authored by: Will Andrews <will@freebsd.org>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Reviewed by: Pavel Zakharov <pavel.zakha@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Justin T. Gibbs <gibbs@scsiguy.com>
Approved by: Matt Ahrens <mahrens@delphix.com>
Author: Youzhong Yang <yyang@mathworks.com>
PR: 204037
MFC after: 1 week
These allocations can reach up to 128KB, while FreeBSD kernel allocator
can cache allocations only up to 64KB. To avoid expensive allocations
for each large ZIL write use caching zio_buf_alloc() allocator instead.
To make it possible de-inline few instances of zil_itx_destroy().
6988 spa_sync() spends half its time in dmu_objset_do_userquota_updates
Using a benchmark which creates 2 million files in one TXG, I observe
that the thread running spa_sync() is on CPU almost the entire time we
are syncing, and therefore can be a performance bottleneck. About 50% of
the time in spa_sync() is in dmu_objset_do_userquota_updates().
The problem is that dmu_objset_do_userquota_updates() calls
zap_increment_int(DMU_USERUSED_OBJECT) once for every file that was
modified (or created). In this benchmark, all the files are owned by the
same user/group, so all 2 million calls to zap_increment_int() are
modifying the same entry in the zap. The same issue exists for the
DMU_GROUPUSED_OBJECT.
We should keep an in-memory map from user to space delta while we are
syncing, and when we finish, iterate over the in-memory map and modify
the ZAP once per entry. This reduces the number of calls to
zap_increment_int() from "number of objects modified" to "number of
owners/groups of modified files".
This reduced the time spent in spa_sync() in the file create benchmark
by ~33%, from 11 seconds to 7 seconds.
Closes#107
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Jinshan Xiong <jinshan.xiong@intel.com>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@5fc46359c5
5120 zfs should allow large block/gzip/raidz boot pool (loader project)
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Toomas Soome <tsoome@me.com>
openzfs/openzfs@c8811bd3e2
FreeBSD still does not support booting from gzip-compressed datasets,
so keep one chunk of this commit out.
Suppose that we have an exclusively busy page, and a thread which can
accept shared-busy page. In this case, typical code waiting for the
page xbusy state to pass is
again:
VM_OBJECT_WLOCK(object);
...
if (vm_page_xbusied(m)) {
vm_page_lock(m);
VM_OBJECT_WUNLOCK(object); <---1
vm_page_busy_sleep(p, "vmopax");
goto again;
}
Suppose that the xbusy state owner locked the object, unbusied the
page and unlocked the object after we are at the line [1], but before we
executed the load of the busy_lock word in vm_page_busy_sleep(). If it
happens that there is still no waiters recorded for the busy state,
the xbusy owner did not acquired the page lock, so it proceeded.
More, suppose that some other thread happen to share-busy the page
after xbusy state was relinquished but before the m->busy_lock is read
in vm_page_busy_sleep(). Again, that thread only needs vm_object lock
to proceed. Then, vm_page_busy_sleep() reads busy_lock value equal to
the VPB_SHARERS_WORD(1).
In this case, all tests in vm_page_busy_sleep(9) pass and we are going
to sleep, despite the page being share-busied.
Update check for m->busy_lock == VPB_UNBUSIED in vm_page_busy_sleep(9)
to also accept shared-busy state if we only wait for the xbusy state to
pass.
Merge sequential if()s with the same 'then' clause in
vm_page_busy_sleep().
Note that the current code does not share-busy pages from parallel
threads, the only way to have more that one sbusy owner is right now
is to recurse.
Reported and tested by: pho (previous version)
Reviewed by: alc, markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D8196
Other uses of cache_purgevfs() do rely on the cache purge for correct
operations, when paths are invalidated without unmount.
Reported and tested by: jkim
Discussed with: mjg
Sponsored by: The FreeBSD Foundation
This should allow vn_fullpath() to work even when vfs name cache is
disabled for zfs, which is the case when zfs properties like
casesensitivity and normalization are set non-default values.
The new code should be 100% reliable for directories and "mostly"
reliable for files, that is, when hardlinks across directories are
not used.
Reported by: Frederic Chardon <chardon.frederic@gmail.com>
Reviewed by: kib (vfs contract)
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D8146
For the extended attributes the order between z_teardown_lock and the
vnode lock is different.
The bug was triggered only with DIAGNOSTIC turned on.
This fix is developed in cooperation with avos.
PR: 213112
Reported by: avos
Tested by: avos
MFC after: 1 week
This restriction was inherited from upstream but is not relevant on FreeBSD.
Furthermore, it hindered the tracing of locking primitive subroutines.
MFC after: 1 week
Until we can resolve the numerous hole_birth bugs that have cropped up
recently, and come up with a way going forwards to protect users from
corruption, we should disable the hole_birth feature. Using a tunable
allows those who are confident that their data is correct to continue to
take advantage of the feature.
Closes#188
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
dsl_dataset_space is looking at the ds_bp's fill count while
dmu_objset_write_ready() is concurrently modifying it. This fix adds an
rrwlock to protect the ds_bp.
Closes#180
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
Background. In ZFS a file with extended attributes has a special
directory associated with it where each extended attribute is a file.
The attribute's name is a file name and its value is a file content.
When the ownership of a file with extended attributes is changed, ZFS
also changes ownership of the special directory. This is where the bug
was hit.
The bug was introduced in r209158.
Nota bene. ZFS vnode locks are typically acquired before
z_teardown_lock (i.e., before ZFS_ENTER). But this is not the case for
the vnodes that represent the extended attribute directory and files.
Those are always locked after ZFS_ENTER. This is confusing and fragile.
PR: 212702
Reported by: Christian Fuss to FreeNAS
Tested by: mav
MFC after: 1 week
Otherwise there exists a narrow window during which a syscall probe can be
disabled and cause a concurrently-running thread to call dtrace_probe()
with an invalid probe ID.
Reported by: ngie
MFC after: 1 week
Sponsored by: Dell EMC Isilon
Our shim for Solaris random_get_bytes() uses read_random(), that looks
reasonable, since it guaranties reliably seeded random data. On the other
side Solaris random_get_pseudo_bytes() does not provide this guarantie,
and its original Solaris implementation is equivalent to our arc4rand(),
using software crypto without stressing slower hardware RNG.
The DS_FIELD_LARGE_BLOCKS macro has been unused since the integration of
this patch:
commit ca0cc3918a1789fa839194af2a9245f801a06b1a
Author: Matthew Ahrens <mahrens@delphix.com>
Date: Fri Jul 24 09:53:55 2015 -0700
5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
This patch simply removes this macro from dsl_dataset.h.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
When changing zfs_arc_max (e.g. as zdb does), it may be set to less
than the default arc_c_min. arc_c_min should decrease to not be more than
arc_c_max, but it doesn't; therefore tuning of arc_c_max is ineffective.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@608764bead
The upstream change introduced a new load state, SPA_LOAD_CREATE,
and vdev_geom code needs to be aware of it.
Tested by: cy
MFC after: 1 week
X-MFC with: r305331
Using a benchmark which has 32 threads creating 2 million files in the
same directory, on a machine with 16 CPU cores, I observed poor
performance. I noticed that dmu_tx_hold_zap() was using about 30% of
all CPU, and doing dnode_hold() 7 times on the same object (the ZAP
object that is being held).
dmu_tx_hold_zap() keeps a hold on the dnode_t the entire time it is
running, in dmu_tx_hold_t:txh_dnode, so it would be nice to use the
dnode_t that we already have in hand, rather than repeatedly calling
dnode_hold(). To do this, we need to pass the dnode_t down through
all the intermediate calls that dmu_tx_hold_zap() makes, making these
routines take the dnode_t* rather than an objset_t* and a uint64_t
object number. In particular, the following routines will need to have
analogous *_by_dnode() variants created:
dmu_buf_hold_noread()
dmu_buf_hold()
zap_lookup()
zap_lookup_norm()
zap_count_write()
zap_lockdir()
zap_count_write()
This can improve performance on the benchmark described above by 100%,
from 30,000 file creations per second to 60,000. (This improvement is on
top of that provided by working around the object allocation issue. Peak
performance of ~90,000 creations per second was observed with 8 CPUs;
adding CPUs past that decreased performance due to lock contention.) The
CPU used by dmu_tx_hold_zap() was reduced by 88%, from 340 CPU-seconds
to 40 CPU-seconds.
Sponsored by: Intel Corp.
Closes#109
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@d3e523d489
This resolves two 'zfs recv' issues. First, when receiving into an
existing filesystem, a snapshot created during the receive process is
not added to the guid->dataset map for the stream, resulting in failed
lookups for deduped streams when a WRITE_BYREF record refers to a
snapshot received earlier in the stream. Second, the newly created
snapshot was also not set properly, referencing the snapshot before the
new receiving dataset rather than the existing filesystem.
Closes#159
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Author: Chris Williamson <chris.williamson@delphix.com>
openzfs/openzfs@b09697c8c1
zap_lockdir() / zap_unlockdir() should take a "void *tag" argument which
tags the hold on the zap. This will help diagnose programming errors
which misuse the hold on the ZAP.
Sponsored by: Intel Corp.
Closes#108
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Author: Matthew Ahrens <mahrens@delphix.com>
openzfs/openzfs@0780b3eab5
7230 add assertions to dmu_send_impl() to verify that stream includes BEGIN and END records
illumos/illumos-gate@12b90ee2d3https://github.com/illumos/illumos-gate/commit/12b90ee2d3b10689fc45f4930d2392f5f
e1d9cfa
https://www.illumos.org/issues/7230
A test failure occurred where a send stream had only a BEGIN record. This
should not be possible if the send returns without error. Prevented this from
happening in the future by adding an assertion to dmu_send_impl() to verify
that if the function returns 0 (success) both a BEGIN and END record are
present. Did this by adding flags to dmu_sendarg_t (indicating whether BEGIN o
r
END records sent), having dump_record() set flags appropriately, adding VERIFY
statement to dmu_send_impl().
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matt Krantz <matt.krantz@delphix.com>
illumos/illumos-gate@0f7643c737https://github.com/illumos/illumos-gate/commit/0f7643c7376dd69a08acbfc9d1d7d548b
10c846a
https://www.illumos.org/issues/7090
When write I/Os are issued, they are issued in block order but the ZIO pipelin
e
will drive them asynchronously through the allocation stage which can result i
n
blocks being allocated out-of-order. It would be nice to preserve as much of
the logical order as possible.
In addition, the allocations are equally scattered across all top-level VDEVs
but not all top-level VDEVs are created equally. The pipeline should be able t
o
detect devices that are more capable of handling allocations and should
allocate more blocks to those devices. This allows for dynamic allocation
distribution when devices are imbalanced as fuller devices will tend to be
slower than empty devices.
The change includes a new pool-wide allocation queue which would throttle and
order allocations in the ZIO pipeline. The queue would be ordered by issued
time and offset and would provide an initial amount of allocation of work to
each top-level vdev. The allocation logic utilizes a reservation system to
reserve allocations that will be performed by the allocator. Once an allocatio
n
is successfully completed it's scheduled on a given top-level vdev. Each top-
level vdev maintains a maximum number of allocations that it can handle
(mg_alloc_queue_depth). The pool-wide reserved allocations (top-levels *
mg_alloc_queue_depth) are distributed across the top-level vdevs metaslab
groups and round robin across all eligible metaslab groups to distribute the
work. As top-levels complete their work, they receive additional work from the
pool-wide allocation queue until the allocation queue is emptied.
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
7086 ztest attempts dva_get_dsize_sync on an embedded blockpointer
illumos/illumos-gate@926549256bhttps://github.com/illumos/illumos-gate/commit/926549256b71acd595f69b236779ff6b7
8fa08ef
https://www.illumos.org/issues/7086
In dbuf_dirty(), we need to grab the dn_struct_rwlock before looking at the
db_blkptr, to prevent it from being changed by syncing context.
Otherwise we may see that ztest got a segfault from this stack:
libzpool.so.1`dva_get_dsize_sync+0x98(872f000, b32b240, fed7811b, 0, b4cda20,
0)
libzpool.so.1`bp_get_dsize+0x60(872f000, b32b240, 0, 97cb780, 9d4c1a8, 0)
libzpool.so.1`dbuf_dirty+0x9b3(ce0a100, 97cb780, 9, fecd2530)
libzpool.so.1`dmu_buf_will_dirty+0xc3(ce0a100, 97cb780, ea293d6c, 1)
libzpool.so.1`zap_lockdir+0x1a0(8aaa3c0, 1, 0, 97cb780, 1, 1)
libzpool.so.1`zap_remove_norm+0x30(8aaa3c0, 1, 0, 8728b10, 0, 97cb780)
libzpool.so.1`zap_remove+0x29(8aaa3c0, 1, 0, 8728b10, 97cb780, a)
ztest_replay_remove+0x225(ea294588, 8728ae8, 0, 38010000, 0, 0)
ztest_remove+0x9f(ea294588, ea293f50, 4, 3)
ztest_object_init+0x78(ea294588, ea293f50, 4e0, 1)
ztest_dmu_object_alloc_free+0x71(ea294588, 13)
ztest_dmu_objset_create_destroy+0x224(80cef08, 13, 0, 805d36c, 9017ad44, 0)
ztest_execute+0x89(a, 807c720, 13, 0)
ztest_thread+0xea(13, 0, 0, 0)
libc.so.1`_thrp_setup+0x88(f0983240)
libc.so.1`_lwp_start(f0983240, 0, 0, 0, 0, 0)
Looking into it a bit, we see that this is an embedded blockpointer, so
BP_GET_NDVAS should have returned 0:
b32b240::blkptr
EMBEDDED [L0 ZAP_OTHER] et=0 LZ4 size=200L/4aP birth=80L
Instead, it looks like another thread is modifying this blockpointer:
b32b240::ugrep | ::whatis
f47a0e0c is in [ stack tid=0x19f ]
ebd6ec40 is in [ stack tid=0x226 ]
ea293bd0 is in [ stack tid=0x244 ]
ea293be4 is in [ stack tid=0x244 ]
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
7072 zfs fails to expand if lun added when os is in shutdown state
illumos/illumos-gate@c39a2aae1ec39a2aae1ehttps://www.illumos.org/issues/7072
upstream:
38733 zfs fails to expand if lun added when os is in shutdown state
DLPX-36910 spares and caches should not display expandable space
DLPX-39262 vdev_disk_open spam zfs_dbgmsg buffer
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@dcbf3bd6a1dcbf3bd6a1https://www.illumos.org/issues/6950
When reading compressed data from disk, the ARC should keep the compressed
block cached and only decompress it when consumers access the block. The
uncompressed data should be short-lived allowing the ARC to cache a much larger
amount of data. The DMU would also maintain a smaller cache of uncompressed
blocks to minimize the impact of decompressing frequently accessed blocks.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: George Wilson <george.wilson@delphix.com>
7136 ESC_VDEV_REMOVE_AUX ought to always include vdev information
7115 6922 generates ESC_ZFS_VDEV_REMOVE_AUX a bit too often
illumos/illumos-gate@b72b6bb10ahttps://github.com/illumos/illumos-gate/commit/b72b6bb10ad55121a1b352c6f68ebdc8e
20c9086
https://www.illumos.org/issues/7136
6922 added ESC_ZFS_VDEV_REMOVE_AUX and ESC_ZFS_VDEV_REMOVE_DEV sysevents
whenever an aux device gets removed from a pool. However, those sysevents will
be created without the vdev_guid and vdev_path fields. It would be better to
always populate those fields.
https://www.illumos.org/issues/7115
The addition of spa_event_notify in vdev removal code (see #6922) causes event
s
to be generated even if the spare failed to be removed with EBUSY.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@4b5c8e93cahttps://github.com/illumos/illumos-gate/commit/4b5c8e93cab28d3c65ba9d407fd8f46e3
be1db1c
https://www.illumos.org/issues/7104
The current default indirect block size is 16KB. We can improve
performance by increasing it to 128KB. This is especially helpful for
any workload that needs to read most of the metadata, e.g.
scrub/resilver, file deletion, filesystem deletion, and zfs send.
We also need to fix a few space estimation errors to make the tests
pass.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@759e89be35https://github.com/illumos/illumos-gate/commit/759e89be359f2af635e4122d147df56bc
e948773
https://www.illumos.org/issues/6447
I got a patch from someone who uses nvpair code outside of illumos. It fixes a
couple of gcc warnings/bugs for him.
1. silence uninitialized use warnings
2. add parentheses around assignment used as truth value
3. fix printf format specifier (ll is for integers only)
4. strstr, strspn, strcspn, and strcmp are declared in string.h, not
strings.h.
5. avoid scanning integer into boolean variable
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Andy Stormont <astormont@racktopsystems.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Steve Dougherty <sdougherty@barracuda.com>
illumos/illumos-gate@9adfa60d48https://github.com/illumos/illumos-gate/commit/9adfa60d484ce2435f5af77cc99dcd4e6
92b6660
https://www.illumos.org/issues/6314
Callers of dsl_dataset_name pass a buffer of size ZFS_MAXNAMELEN, but
dsl_dataset_name copies the datasets' name PLUS the snapshot name to it,
resulting in a max of 2 * ZFS_MAXNAMELEN + '@'.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
Note that the bulk of the upstream change is not applicable to FreeBSD
and the affected files are not even in the vendor area.
illumos/illumos-gate@45b174751545b1747515https://www.illumos.org/issues/7019
Currently zfsdev_ioctl, when confronted by a request with the FKIOCTL flag set,
skips all processing of secpolicy functions. This means that ZFS is not doing
any kind of verification of the credentials or access rights of the caller and
assuming that (as it is an in-kernel client) all such checks have already been
done.
This turns out to be quite a dangerous assumption, especially with respect to
sdev. In general I don't think it's particularly reasonable to offload this
enforcement of access rights onto other kernel subsystems when ZFS has some
particular local semantics in this area (delegated datasets etc) and does not
provide any kind of API to allow other subsystems to avoid code duplication
when doing it. ZFS should apply its normal access policy to requests from
within the kernel, and callers should take care to give it the correct
credentials and call it from the correct context in order to get the results
they need.
You can observe the currently unfortunate consequences of this bug in any non-
global zone that has access to /dev/zvol or any subset of it via sdev profiles.
In particular, a zone used to contain a KVM or similar which has a single zvol
passed through to it using a <device match= block in its zone XML.
Even though sdev makes something of an attempt to control for whether the
caller should have access to nodes in /dev/zvol, it doesn't do this correctly,
or really at all in the lookup call path. So, if we have a zone that's been
given access to any part of /dev/zvol, it can simply look up the full path to
any other zvol on the entire system, and the node will appear and be able to be
used.
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Wilson <alex.wilson@joyent.com>
illumos/illumos-gate@63364b0ee2https://github.com/illumos/illumos-gate/commit/63364b0ee2604783e7a55f84258888677
68eafa4
https://www.illumos.org/issues/6922
ZFS does not do a config_sync after removing an aux (spare, log, or cache)
device. AFAICT this isn't being done because it is slow and was deemed
unnecessary. However, it should be such a rare operation that speed doesn't
matter, and not doing it results in two problems:
1) It is theoretically possible to remove an aux device from one pool and
attach it to another, then lose power. When power is restored, both pools woul
d
think that they own the aux device.
2) Removal of the aux device doesn't send any useful sysevents to userland.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alan Somers <asomers@gmail.com>