d4a72f2386
During scans (scrubs or resilvers), it sorts the blocks in each transaction
group by block offset; the result can be a significant improvement. (On my
test system just now, which I put some effort to introduce fragmentation into
the pool since I set it up yesterday, a scrub went from 1h2m to 33.5m with the
changes.) I've seen similar rations on production systems.
Approved by: Alexander Motin
Obtained from: ZFS On Linux
Relnotes: Yes (improved scrub performance, with tunables)
Differential Revision: https://reviews.freebsd.org/D15562
When we're at our vnode limit, getnewvnode will call into the vnode LRU
cache to free up vnodes. If the vnode we try to recycle is a ZFS vnode we
end up, eventually, in zfs_rmnode. If the ZFS vnode we're recycling
represents something with extended attributes, zfs_rmnode will call
zfs_zget which will attempt to allocate another vnode. If the next vnode we
try to recycle is also a ZFS vnode representing something with extended
attributes we can recurse further. This ends up being unbounded and can end
up overflowing the stack.
In order to avoid this, restructure zfs_rmnode to simply add the extended
attribute directory's object ID to the unlinked set, thus not requiring the
allocation of a vnode. We then schedule a task that calls zfs_unlinked_drain
which will do the work of properly marking the vnodes for unlinking.
zfs_unlinked_drain is also called on mount so these will be cleaned up
there.
Reviewed by: avg, mav
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D15342
Even though there do appear to be more artificial frames, with 12, stack
traces no longer list at all. Revert until a better, more stable value can
be determined.
The 64-bit atomics defined for i386 are currently only available in
the kernel space.
Found by: cy@
MFC after: 1 week
Sponsored by: Mellanox Technologies
While at it add missing _acq_ and _rel_ variants for 64-bit atomic
operations under i386.
Reviewed by: kib @
MFC after: 1 week
Sponsored by: Mellanox Technologies
It turns out that sendfile_swapin() has an optimization where it may
insert pointers to bogus_page into the page array that it passes to
VOP_GETPAGES. That happens to work with buffer cache, because it
extensively uses bogus_page internally, so it has the necessary checks.
However, ZFS did not expect bogus_page as VOP_GETPAGES(9) does not
document such a (ab)use of bogus_page.
So, this commit adds checks and handling of bogus_page.
I expect that use of bogus_page with VOP_GETPAGES will get documented
sooner rather than later.
Reported by: Andrew Reilly <areilly@bigpond.net.au>, delphij
Tested by: Andrew Reilly <areilly@bigpond.net.au>
Requested by: many
MFC after: 1 week
Creating a pool with a temporary name fails when we also specify custom
dataset properties: this is because we mistakenly call
zfs_set_prop_nvlist() on the "real" pool name which, as expected,
cannot be found because the SPA is present in the namespace with the
temporary name.
Fix this by specifying the correct pool name when setting the dataset
properties.
Author: loli10K <ezomori.nozomu@gmail.com>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Obtained from: ZFS on Linux, zfsonlinux/zfs@4ceb8dd6fd
MFC after: 1 week
unwind_frame() may be instrumented by FBT, leading to recursion into
dtrace_probe(). Manually inline unwind_frame() as we do with stack
unwinding code for other architectures.
Submitted by: Domagoj Stolfa
Reviewed by: manu
MFC after: 1 week
Sponsored by: DARPA / AFRL
Differential Revision: https://reviews.freebsd.org/D15359
- Add macros to allow preinitialization of cap_rights_t.
- Convert most commonly used code paths to use preinitialized cap_rights_t.
A 3.6% speedup in fstat was measured with this change.
Reported by: mjg
Reviewed by: oshogbo
Approved by: sbruno
MFC after: 1 month
by doing most of the work in a new function prison_add_vfs in kern_jail.c
Now a jail-enabled filesystem need only mark itself with VFCF_JAIL, and
the rest is taken care of. This includes adding a jail parameter like
allow.mount.foofs, and a sysctl like security.jail.mount_foofs_allowed.
Both of these used to be a static list of known filesystems, with
predefined permission bits.
Reviewed by: kib
Differential Revision: D14681
While here predict false.
Note the code is wrong (regardless of this change). Dereference of the
pointer can race with module unload. A fix would set the probe to a
nop stub instead of NULL.
Each malloc/free was testing dtrace_malloc_enabled and forcing
extra reads from the malloc type struct to see if perhaps a
dtmalloc probe was on.
Treat it like lockstat and sdt: have a global bolean.
When the compressed ARC feature was added in commit d3c2ae1
the method of reference counting in the ARC was modified. As
part of this accounting change the arc_buf_add_ref() function
was removed entirely.
This would have be fine but the arc_buf_add_ref() function
served a second undocumented purpose of updating the ARC access
information when taking a hold on a dbuf. Without this logic
in place a cached dbuf would not migrate its associated
arc_buf_hdr_t to the MFU list. This would negatively impact
the ARC hit rate, particularly on systems with a small ARC.
This change reinstates the missing call to arc_access() from
dbuf_hold() by implementing a new arc_buf_access() function.
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Tim Chase <tim@chase2k.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The change adds -t <name> option to zpool create and -t option to zpool
import in its form with an old name and a new name. This allows to
import (or create) a pool under a name that's different from its real,
permanent name without affecting that name. This is useful when working
with VM images or images of other physical systems if they happen to
have a ZFS pool with the same name as the host system.
The changes come from ZoL with some small tweaks.
The porting has been done by julian.
The change is being submitted to OpenZFS:
https://github.com/openzfs/openzfs/pull/600
Submitted by: julian
Reviewed by: smh
MFC after: 2 weeks
Sponsored by: Panzura (porting)
Differential Revision: https://reviews.freebsd.org/D14972
Page daemon output is now regulated by a PID controller with a setpoint
of v_free_target. Moreover, the page daemon now wakes up regularly
rather than waiting for a wakeup from another thread. This means that
the free page count is unlikely to drop below the old
zfs_arc_free_target value, and as a result the ARC was not readily
freeing pages under memory pressure. Address the immediate problem by
updating zfs_arc_free_target to match the page daemon's new behaviour.
Reported and tested by: truckman
Discussed with: jeff
X-MFC with: r329882
Differential Revision: https://reviews.freebsd.org/D14994
This helps catch cases where an instrumented function is called while
in probe context.
Submitted by: Domagoj Stolfa <domagoj.stolfa@gmail.com>
MFC after: 2 weeks
Sponsored by: DARPA/AFRL
Differential Revision: https://reviews.freebsd.org/D14863
Device removal code does not set spa_indirect_vdevs_loaded for pools
that never experienced device removal. At least one visual consequence
of it is completely blocked speculative prefetcher. This patch sets
the variable in such situations.
9280 Assertion failure while running removal_with_ganging test with 4K devices
illumos/illumos-gate@243952c7ee
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matt Ahrens <Matt.Ahrens@delphix.com>
9188 increase size of dbuf cache to reduce indirect block decompression
illumos/illumos-gate@268bbb2a2f
With compressed ARC (6950) we use up to 25% of our CPU to decompress indirect
blocks, under a workload of random cached reads. To reduce this decompression
cost, we would like to increase the size of the dbuf cache so that more
indirect blocks can be stored uncompressed.
If we are caching entire large files of recordsize=8K, the indirect blocks
use 1/64th as much memory as the data blocks (assuming they have the same
compression ratio). We suggest making the dbuf cache be 1/32nd of all memory,
so that in this scenario we should be able to keep all the indirect blocks
decompressed in the dbuf cache. (We want it to be more than the 1/64th that
the indirect blocks would use because we need to cache other stuff in the
dbuf cache as well.)
In real world workloads, this won't help as dramatically as the example
above, but we think it's still worth it because the risk of decreasing
performance is low. The potential negative performance impact is that we
will be slightly reducing the size of the ARC (by ~3%).
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Allan Jude <allanjude@freebsd.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
9321 arc_loan_compressed_buf() can increment arc_loaned_bytes by the wrong value
illumos/illumos-gate@9be12bd737
arc_loan_compressed_buf() increments arc_loaned_bytes by psize unconditionally
In the case of zfs_compressed_arc_enabled=0, when the buf is returned via
arc_return_buf(), if ARC_BUF_COMPRESSED(buf) is false, then arc_loaned_bytes
is decremented by lsize, not psize.
Switch to using arc_buf_size(buf), instead of psize, which will return
psize or lsize, depending on the result of ARC_BUF_COMPRESSED(buf).
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Allan Jude <allanjude@freebsd.org>
9235 rename zpool_rewind_policy_t to zpool_load_policy_t
illumos/illumos-gate@5dafeea3eb
We want to be able to pass various settings during import/open of a pool,
which are not only related to rewind. Instead of adding a new policy and
duplicate a bunch of code, we should just rename rewind_policy to a more
generic term like load_policy.
For instance, we'd like to set spa->spa_import_flags from the nvlist,
rather from a flags parameter passed to spa_import as in some cases we want
those flags not only for the import case, but also for the open case. One
such flag could be ZFS_IMPORT_MISSING_LOG (as used in zdb) which would
allow zfs to open a pool when logs are missing.
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
9191 dump vdev tree to zfs_dbgmsg when spa load fails due to missing log devices
illumos/illumos-gate@ccef24b493
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
9187 racing condition between vdev label and spa_last_synced_txg in vdev_validate
illumos/illumos-gate@d1de72cfa2
ztest failed with uncorrectable IO error despite having the fix for #7163.
Both sides of the mirror have CANT_OPEN_BAD_LABEL, which also distinguishes
it from that issue.
Definitely seems like a racing condition between the vdev_validate and spa_sync:
1. Thread A (spa_sync): vdev label is updated to latest txg
2. Thread B (vdev_validate): vdev label's txg is compared to spa_last_synced_txg and is ahead.
3. Thread A (spa_sync): spa_last_synced_txg is updated to latest txg.
Solution: do not check txg in vdev_validate unless config lock is held.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
illumos/illumos-gate@8671400134
The idea of Storage Pool Checkpoint (aka zpool checkpoint) deals with
exactly that. It can be thought of as a “pool-wide snapshot” (or a
variation of extreme rewind that doesn’t corrupt your data). It remembers
the entire state of the pool at the point that it was taken and the user
can revert back to it later or discard it. Its generic use case is an
administrator that is about to perform a set of destructive actions to ZFS
as part of a critical procedure. She takes a checkpoint of the pool before
performing the actions, then rewinds back to it if one of them fails or puts
the pool into an unexpected state. Otherwise, she discards it. With the
assumption that no one else is making modifications to ZFS, she basically
wraps all these actions into a “high-level transaction”.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
It's not sufficient nor required to use the vnode interlock when
checking if we are going to drop the last use count as the code in
vputx() uses refcount (atomic) operations for both checking and
decrementing the use code. Apply the same method to vn_rele_async().
While here, remove vn_rele_inactive(), a wrapper around vrele() that
didn't add any value.
Also, the change required making vfs_refcount_release_if_not_last()
public. I've made vfs_refcount_acquire_if_not_zero() public as well.
They are in sys/refcount.h now. While making the move I've dropped the
vfs_ prefix.
Reviewed by: mjg
MFC after: 2 weeks
Sponsored by: Panzura
Differential Revision: https://reviews.freebsd.org/D14869
These have been supplanted by the MI signal information codes in
<sys/signal.h> since 7.0. The FPE_*_TRAP ones were deprecated even
earlier in 1999.
PR: 226579 (exp-run)
Reviewed by: kib
Differential Revision: https://reviews.freebsd.org/D14637
vdev_dbgmsg_print_tree printed vdev_id of uint64_t type with %u format
specifier. That caused subsequent parameters to be incorrectly read
from the stack and lead to a crash when a wrong value was interpreted as
a string pointer.
This should be upstreamed.
Reported by: pho
MFC after: 3 days
illumos/illumos-gate@edc8ef7d92
Reviewed by: C Fraire <cfraire@me.com>
Reviewed by: Andy Fiddaman <omnios@citrus-it.co.uk>
Approved by: Joshua M. Clulow <josh@sysmgr.org>
Author: Toomas Soome <tsoome@me.com>
In pursuit of improving performance on multi-core systems, we should
implements fanned out counters and use them to improve the performance of
some of the arc statistics. These stats are updated extremely frequently,
and can consume a significant amount of CPU time.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
assym is only to be included by other .s files, and should never
actually be assembled by itself.
Reviewed by: imp, bdrewery (earlier)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D14180
illumos/illumos-gate@5f5913bb835f5913bb83https://www.illumos.org/issues/9164
This issue has been reported by Alan Somers as
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=225877
dmu_objset_refresh_ownership() first disowns a dataset (and releases
it) and then owns it again. There is an assert that the new dataset
object is the same as the old dataset object. When running ZFS Test
Suite on FreeBSD we see this panic from zpool_upgrade_007_pos test:
panic: solaris assert: newds == os->os_dsl_dataset (0xfffff80045f4c000
== 0xfffff80021ab4800)
I see that the old dataset has dsl_dataset_evict_async() pending in
ds_dbu.dbu_tqent and its ds_dbuf is NULL.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <don.brady@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>
PR: 225877
Reported by: asomers
MFC after: 1 week
Update the ZFS TRIM code to ensure it respects VTOC8 partition headers as
documented by the ZFS On-Disk Specification section 1.3
Before this a zpool create on a VTOC8 partitioned device would overwrite the
partition metadata.
Reported by: marius
Reviewed by: marius agv
MFC after: 1 week
Sponsored by: Multiplay
illumos/illumos-gate@e9bacc6d1ae9bacc6d1ahttps://www.illumos.org/issues/8984
Consider a directory configured as:
drwx-ws---+ 2 henson cpp 3 Jan 23 12:35 dropbox/
user:henson:rwxpdDaARWcC--:f-i----:allow
owner@:--------------:f-i----:allow
group@:--------------:f-i----:allow
everyone@:--------------:f-i----:allow
owner@:rwxpdDaARWcC--:-di----:allow
group:cpp:-wx-----------:-------:allow
owner@:rwxpdDaARWcC--:-------:allow
A new file created in this directory ends up looking like:
rw-r--r-+ 1 astudent cpp 0 Jan 23 12:39 testfile
user:henson:rw-pdDaARWcC--:------I:allow
owner@:--------------:------I:allow
group@:--------------:------I:allow
everyone@:--------------:------I:allow
owner@:rw-p--aARWcCos:-------:allow
group@:r-----a-R-c--s:-------:allow
everyone@:r-----a-R-c--s:-------:allow
with extraneous group@ and everyone@ entries allowing read access that
shouldn't exist.
Per Albert Lee on the zfs mailing list:
"aclinherit=passthrough/passthrough-x should still
ignore the requested mode when an inheritable ACE for owner@ group@,
or everyone@ is present in the parent directory.
It appears there was an oversight in my fix for
https://www.illumos.org/issues/6764 which made calling zfs_acl_chmod
from zfs_acl_inherit unconditional. I think the parent ACL check for
aclinherit=passthrough needs to be reintroduced in zfs_acl_inherit."
We have a large number of faculty who use dropbox directories like the example
to have students submit projects. All of these directories are now allowing
Reviewed by: Sam Zaydel <szaydel@racktopsystems.com>
Reviewed by: Paul B. Henson <henson@acm.org>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Dominik Hassler <hadfl@omniosce.org>
PR: 216886
MFC after: 2 weeks
Those operations, zfsctl_snapdir_readdir and zfsctl_snapdir_getattr,
access the filesystem's objset and it can be unstable during operations
like receive and rollback.
MFC after: 2 weeks
It gets called by dmu_buf_init_user, which is inline but not static. So it
needs global linkage itself.
Reported by: GCC-6
MFC after: 17 days
X-MFC-With: 329722
This change is designed to account for yet another difference between
illumos and FreeBSD VFS. In FreeBSD a filesystem driver is supposed to
clean up mnt_data in its VFS_UNMOUNT method because it's the last call
into the driver before a struct mount object is destroyed. The VFS
drains all references to the object before destroying it, but for the
driver it's already as good as gone.
In contrast, illumos VFS provides another method, VFS_FREEVFS, that is
called when all references are drained. So, the driver can keep its
data after VFS_UNMOUNT and clean it up in VFS_FREEVFS after all
references are gone. This is what ZFS does on illumos.
So there a reference to a filesystem is sufficient to guarantee that the
ZFS specific data, aka zfsvfs_t, stays around (even if the filesystem
gets unmounted). In FreeBSD we need to vfs_busy the filesystem to get
the same guarantee. vfs_ref guarantees only that the struct mount is
kept.
The following rules should be observed in getzfsvfs / getzfsvfs_impl on
FreeBSD:
- if we need access to zfsvfs_t then we must use vfs_busy
- if only we need to access struct mount (aka vfs_t), then vfs_ref is
enough
- when illumos code actually needs only the vfs_t, they still can pass
the zfsvfs_t and get the vfs_t from it; that can work in FreeBSD if
the filesystem is busied, but when it's just referenced then we have
to pass the vfs_t explicitly
- we cannot call vfs_busy while holding a dataset because that creates a
LOR with dp_config_rwlock
As a result:
- getzfsvfs_impl now only references the filesystem, same as in illumos,
but unlike illumos it has to return the vfs_t
- the consumers are updated to account for the change
- getzfsvfs busies the filesystem (and drops the reference from
getzfsvfs_impl)
Also, zfs_unmount_snap() now gets a busied a filesystem, references it
and then unbusies it essentially reverting actions done in getzfsvfs.
This is needed because the code may perform some checks that require the
zfsvfs_t. So, those are done before the unbusying.
MFC after: 2 weeks
vrele() acquires the vnode lock only if the hold count drops to zero.
In other scenarios it needs only the interlock. So,
zfsctl_snapdir_lookup() can race with vfs_mount_destroy() -> vrele()
such that the lookup adds a new reference and then vrele() drops the
mountpoint's reference and only then we check the reference count.
It would be just one in this case.
In fact, the assert should have been removed in r323483 when the code
learned how to deal with the uncovered vnode.
PR: 225795
MFC after: 4 days
X-MFC with: r329556