- Add
nvlist_{add,get,take,move,exists,free}_{number,bool,string,nvlist,
descriptor} functions.
- Add support for (un)packing arrays.
- Add the nvl_array_next field to the nvlist structure.
If an array is added by the nvlist_{move,add}_nvlist_array function
this field will contains next element in the array.
- Add the nitems field to the nvpair and nvpair_header structure.
This field contains number of elements in the array.
- Add special flag (NV_FLAG_IN_ARRAY) which is set if nvlist is a part of
an array.
- Add special type (NV_TYPE_NVLIST_ARRAY_NEXT).This type is used only
on packing/unpacking.
- Add new API for traversing arrays (nvlist_get_array_next).
- Add the nvlist_get_pararr function which combines the
nvlist_get_array_next and nvlist_get_parent functions. If nvlist is in
the array it will return next element from array. If nvlist is last
element in array or it isn't in array it will return his
container (parent). This function should simplify traveling over nvlist.
- Add tests for new features.
- Add documentation for new functions.
- Add my copyright.
- Regenerate the sys/cddl/compat/opensolaris/sys/nvpair.h file.
PR: 191083
Reviewed by: allanjude (doc)
Approved by: pjd (mentor)
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bart Coddens <bart.coddens@gmail.com>
illumos/illumos-gate@fc98fea58e
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@2fd872a734
As a way to make it more difficult to introduce bugs into the ARC, and to
make it easier to diagnose issues when bugs do creep in, it would be
beneficial to change the type of the arc_state_t's arcs_size field to be
a refcount_t instead of a uint64_t. This would allow us to make stricter
checks when incrementing and decrementing the value with debugging enabled,
but still fallback to simple, fast atomic operations when debugging is
disabled.
when adjusting MFU size.
illumos/illumos-gate@31c46cf23chttps://www.illumos.org/issues/6033
When we're looking for the list containing oldest buffer we never
actually look at the MFU lists even when we try to evict from MFU.
looks like a copy paste error, the fix is here:
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Xin Li <delphij@delphij.net>
Reviewed by: Prakash Surya <me@prakashsurya.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>
Obtained from: illumos
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@244781f10d
This patch attempts to reduce lock contention on the current arc_state_t
mutexes. These mutexes are used liberally to protect the number of LRU
lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
which these locks are acquired has been shown to greatly affect the
performance of highly concurrent, cached workloads.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
illumos/illumos-gate@8f5190a540
6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Dan McDonald <danmcd@omniti.com>
illumos/illumos-gate@0f92170f1e
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@ca0cc3918a
A ZFS feature flags (large blocks) tracks its refcounts as the number of
datasets that have ever used the feature. Several features of this type
are planned to be added (new checksum functions). This code should be made
common infrastructure rather than duplicating the code for each feature.
5925 zfs receive -o origin=
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>
While running 'zfs recv' we noticed that every 128th 8K block required a
read. We were seeing that restore_write() was calling dmu_tx_hold_write()
and the indirect block was not cached. We should prefetch upcoming indirect
blocks to avoid having to go to disk and blocking the restore_write().
Allow an incremental send stream to be received as a clone, even if the
stream does not mark it as a clone.
illumos/illumos-gate@1d3f896f54https://www.illumos.org/issues/5981
When dmu_objset_find_dp gets called with a read lock held, it fans out
the work to the task queue. Each task in turn acquires its own read
lock before calling the callback. If during this process anyone tries
to a acquire a write lock, it will stall all read lock requests.Thus
the tasks will never finish, the read lock of the caller will never
get freed and the write lock never acquired. deadlock.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <jansen@webgods.de>
illumos/illumos-gate@12380e1e70https://www.illumos.org/issues/5269
When importing a pool (at boot or with zpool import) with many
filesystem, the process can take minutes. It doesn't matter whether
the pool has been exported cleanly or uncleanly. The problem is that
each dataset has its own log chain. On import, all datasets have to be
checked if there are logs to replay. The idea is to speed up this
process by paralellizing it.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Arne Jansen <jansen@webgods.de>
lzc_send_space when source is a bookmark
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
Author: Max Grossman <max.grossman@delphix.com>
illumos/illumos-gate@643da460c8
illumos/illumos-gate@70163ac57ehttps://www.illumos.org/issues/5695
In dmu_sync_ready(), a hole block pointer will have it's logical size
explicitly set as it's necessary for replay purposes. To "undo" this,
dmu_sync_done() will zero out any hole that it finds. This becomes a
problem when using the "hole_birth" feature, as this will also wipe out
any birth time that might have happened to be set on the hole.
...
As a fix, the logic to zero out a hole is only applied to old style
holes with a birth time of zero. Holes created with the "hole_birth"
feature enabled will have a non-zero birth time, and will be skipped
(thus preserving the ltime, type, and level information as well).
In addition, zdb was updated to also print the ltime, type, and level
information for these new style holes. Previously, only the logical
birth time would be printed.
Author: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@2ec99e3e98
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@8df173054c
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@732885fca0
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@23367a2f2c
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>
illumod/illumos-gate@34e8acef00
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@98110f08fa
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin Gibbs <justing@spectralogic.com>
illumos/illumos-gate@bc9014e6a8
arc_state_t stats and differentiate between "data" and "metadata"
Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@4076b1bf41
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <Chris.Williamson@delphix.com>
illumos/illumos-gate@89c86e3229
Currently, every buffer cached in the L2ARC is accompanied by a 240-byte
header in memory, leading to very high memory consumption when using very
large cache devices. These changes significantly reduce this overhead.
Currently:
L1-only header = 176 bytes
L1 + L2 or L2-only header = 176 bytes + 32 byte checksum + 32 byte l2hdr
= 240 bytes
Memory-optimized:
L1-only header = 176 bytes
L1 + L2 header = 176 bytes + 32 byte checksum = 208 bytes
L2-only header = 96 bytes + 32 byte checksum = 128 bytes
So overall:
Trunk Optimized
+-----------------+
L1-only | 176 B | 176 B | (same)
+-----------------+
L1 & L2 | 240 B | 208 B | (saved 32 bytes)
+-----------------+
L2-only | 240 B | 128 B | (saved 116 bytes)
+-----------------+
For an average blocksize of 8KB, this means that for the L2ARC, the ratio
of metadata to data has gone down from about 2.92% to 1.56%. For a
'storage optimized' EC2 instance with 1600GB of SSD and 60GB of RAM, this
means that we expect a completely full L2ARC to use (1600 GB * 0.0156) /
60GB = 41% of the available memory, down from 78%.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@34d7ce052c
5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@7f7ace3700
5661 ZFS: "compression = on" should use lz4 if feature is enabled
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>
illumos/illumos-gate@db1741f555
5630 stale bonus buffer in recycled dnode_t leads to data corruption
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
illumos/illumos-gate@9d47dec048
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>
illumos/illumos-gate@e57a022b8f
The check introduced by r285946 failed to add the dependency on
opt_kstack_pages.h which meant the default value for the platform instead
of the customised options KSTACK_PAGES=X was being tested.
Also wrap in #ifdef __FreeBSD__ for portability.
MFC after: 3 days
Sponsored by: Multiplay
When a process is exiting, there is a narrow window where p_cred may be
NULL while its threads are still executing. Specifically, the last thread
to exit a process sets the process state to PRS_ZOMBIE with the proc
spinlock held and then calls thread_exit(). thread_exit() drops the spin
lock, permitting the process to be reaped and thus causing its cred struct
to be released. However, the exiting thread may still cause DTrace probes
to fire by calling sched_throw(), resulting in a double fault if such a
probe enabling attempts to access the GID or UID DIF variables.
The thread's cred reference is not susceptible to this race since it is not
released until after the thread has exited.
MFC after: 1 week
As ZFS requires a more kernel stack pages than is the default on some
architectures e.g. i386, warn if KSTACK_PAGES is less than
ZFS_MIN_KSTACK_PAGES (which is 4 at the time of writing).
MFC after: 3 days
Sponsored by: Multiplay
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.
This function is going to be used by the CloudABI compatibility layer.
It looks like the OpenSolaris compatibility framework already provides a
function called thread_create(). Rename this function to
do_thread_create() and use a macro to deal with the namespacing
conflict. A similar approach is already used for thread_exit().
MFC after: 1 month
in lockstat.ko. This means that lockstat probes now have typed arguments and
will utilize SDT probe hot-patching support when it arrives.
Reviewed by: gnn
Differential Revision: https://reviews.freebsd.org/D2993
enabled. The cost of a timecounter read can be quite significant, and the
problem became more apparent after r284297, since that change resulted in
a call to lockstat_nsecs() for each acquisition of an rwlock read lock.
PR: 201642
Reviewed by: avg
Tested by: Jason Unovitch
MFC after: 3 days
Differential Revision: https://reviews.freebsd.org/D3073
belongs to the kernel stack address range for the thread. Right now,
code checks that new frame is not farther then KSTACK_PAGES pages from
the current frame, which allows the address to point past the top of
the stack.
Reviewed by: andrew, emaste, markj
Differential revision: https://reviews.freebsd.org/D3108
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.
Reviewed by: kib (earlier version)
Tested by: pho
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.
Reviewed by: kib
macros on amd64 and i386. Move the definition to machine/param.h.
kgdb defines INKERNEL() too, the conflict is resolved by renaming kgdb
version to PINKERNEL().
On i386, correct the lowest kernel address. After the shared page was
introduced, USRSTACK no longer points to the last user address + 1 [*]
Submitted by: Oliver Pinter [*]
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
We now take z_teardown_lock as a writer to ensure that there is no I/O
while the filesystem state is in a flux. Also, zfs_suspend_fs() ->
zfsvfs_teardown() call zfs_unregister_callbacks() and zfs_resume_fs() ->
zfsvfs_setup() call zfs_unregister_callbacks(). Previously there was no
synchronization between those calls and the calls in the re-mounting
case. That could lead to concurrent execution and a crash.
PR: 180060
Differential Revision: https://reviews.freebsd.org/D2865
Suggested by: mahrens
Reviewed by: delphij, pho, mahrens, will
MFC after: 13 days
Sponsored by: ClusterHQ
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@46e1baa6cfhttps://www.illumos.org/issues/5911
Sometimes ZFS appears to hang while deleting a file. It is actually
making slow progress at the file deletion, but other operations
(administrative and writes via the data path) "hang" until the file
removal completes, which can take a long time if the file has many
blocks. The deletion (or most of it) happens in a single txg, and the
sync thread spends most of its time reading indirect blocks via this
stack trace:
swtch+0x141()
cv_wait+0x70()
zio_wait+0x5b()
dbuf_read+0x2c0()
free_children+0x50()
free_children+0x12a()
free_children+0x12a()
free_children+0x12a()
dnode_sync_free_range_impl+0xdf()
dnode_sync_free_range+0x52()
range_tree_vacate+0x65()
dnode_sync+0x1d8()
dmu_objset_sync_dnodes+0x77()
dmu_objset_sync+0x19f()
dsl_dataset_sync+0x51()
dsl_pool_sync+0x9a()
spa_sync+0x2ff()
txg_sync_thread+0x21f()
thread_start+8()
One way to reproduce the problem is if we are over the arc_meta_limit,
e.g. because lots of indirect blocks are pinned because we have L0
dbufs under them. It could be that most of the L1 indirects are cached,
in which case when dmu_free_long_range_impl() calls dmu_tx_hold_free(),
it will complete very quickly. This allows dmu_free_long_range_impl() to
put many (perhaps all of its) transactions in the same TXG. However,
dmu_free_long_range_impl() calls dnode_evict_dbufs (and
dnode_free_range()), which removes the L0 dbufs, thus reducing the hold
count on the L1 indirect blocks above it, allowing them to be evicted.
Because we are over the arc_meta_limit(), these L1 blocks will be
evicted ASAP. Thus when we get to syncing context, the L1 indirects are
no longer cached and must be read in.
Obtained from: illumos
MFC after: 15 days
Do that only when when fast inline versions are available.
At the moment that can be the case only in the kernel and not for all
platforms.
The original code uses the binary search and that's kept as a fallback.
This is a micro optimization.
Differential Revision: https://reviews.freebsd.org/D2839
Reviewed by: delphij, mahrens, mav
MFC after: 17 days
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
The problem is that when inline versions of flsl and flsll are not
available, then libkern.h must be included for their declarations
in kernel sources.
The fix would be trivial, but I would like to figure out first if
it even makes sense to use the libkern provided implementations.
Reported by: bz
Pointyhat to: avg
b_size is a logical size of a buffer in memory, b_asize is its physical
size that accounts for possible compression.
Currently the latter is the best approximation for the allocated, on-disk
size.
L2ARC TRIM support was committed a few weeks before L2ARC compression
was imported, so originally the code was correct, because b_size was
the size.
Further thoughts. Given that the cache device is being overwritten
in a circular fashion it is not clear if a TRIM per each evicted L2ARC
buffer has any benefits.
Maybe it would be sufficient to issue a single trim request for the whole
device when it is loaded, e.g. after a bootup, or when it is unloaded, e.g.
before a shutdown. At least as long as L2ARC is not persistent across
reboots.
Discussed with: smh
MFC after: 19 says
This is a micro optimization.
The upstream code uses the binary search.
Differential Revision: https://reviews.freebsd.org/D2839
Reviewed by: delphij, mav
MFC after: 15 days
5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots
5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
illumos/illumos-gate@24218bebb4
Note that the upstream commit is modified during MFV: in the upstream
the check is done by inspecting ds_is_snapshot field while in FreeBSD
we call dsl_dataset_is_snapshot().
This is because illumos/illumos-gate@bc9014e6a8
(r277428 in vendor-sys/illumos) is not MFV-ed yet.
MFC after: 10 days
Currently there is no good reason to mangle the userland API.
The change was introduced in eac1d566b4,
r279437. Also see https://reviews.freebsd.org/D1881.
I am still convinced that nv should not have introduced intentionally
conflicting API.
Discussed with: rstone
X-MFC with: r279437
Sponsored by: ClusterHQ
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode. Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced. dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
- this should fail early unless the force flag is set
- if the force flag is set then any local modifications including
snapshots should be undone
See: https://www.illumos.org/issues/5912
See: https://reviews.csiden.org/r/220/
Reviewed by: mahrens, Paul Dagnelie <pcd@delphix.com>
MFC after: 15 days
Sponsored by: ClusterHQ
years for head. However, it is continuously misused as the mpsafe argument
for callout_init(9). Deprecate the flag and clean up callout_init() calls
to make them more consistent.
Differential Revision: https://reviews.freebsd.org/D2613
Reviewed by: jhb
MFC after: 2 weeks
This allows to ensure that we do not write to a file that was opened
for reading only or vice versa.
Also, use the correct capability in in zfs_ioc_send_new().
Differential Revision: https://reviews.freebsd.org/D2382
Reviewed by: delphij
MFC after: 17 days
Sponsored by: ClusterHQ
At present zfs_domount() acquires a reference on the filesystem's root vnode
and that reference is kept until zfs_umount.
The latter calls vflush(rootrefs = 1) to dispose of the extra reference.
There is no explanation of why that reference is kept - what problem it
solves or what behavior it improves.
Also, that logic is FreeBSD specific.
There is one real problem with that reference, though.
zfs recv -F may receive a full, non-incremental stream to a mounted filesystem.
In that case the received root object is likely to have a different z_gen
attribute value. Because of that, zfs_rezget will leave the previous root znode
and vnode disassociated from the actual object (z_sa_hdl == NULL).
Thus, future calls to VFS_ROOT() -> zfs_root() will produce a new vnode-znode
pair, while the old one will be kept alive by the outstanding reference.
So, the outstanding reference will not actually be for the new root vnode
(or, more precisely, vnodes - because a root vnode may be recycled and a newer
one can be created).
As a result, when vflush(rootrefs = 1) s called there will be two problems:
- a leaked reference on the old root vnode preventing a graceful unmount
- insufficient references on the actual root vnode leading to a crash upon
access to the vnode after it is destroyed by vgone() + vdrop()
The second issue will actually override the first one.
Differential Revision: https://reviews.freebsd.org/D2353
Reviewed by: delphij, kib, smh
MFC after: 17 days
In traverse_visitbp(), the input argument dnp is modified in the middle
to point to a temporary buffer. Originally this doesn't matter, because
no user of TRAVERSE_POST dereferences it. However, in fbeddd6 a piece of
code is added dereferencing dnp after the modification, creating a possible
bug.
We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case,
so we don't modify the input argument. Also we introduce different local
variables in the DMU_OT_OBJSET case to prevent confusion between the input
argument.
Obtained from: zfsonlinux (a585f2f844ed3d4270221fed88f5e494eb55d932)
MFC after: 2 weeks
Sponsored by: Multiplay
The old comment in zfs_rezget explains what situation the code handles,
the new comment also describes how the situation can arise.
Also, re-join a line that became sufficiently shorti some time ago.
Differential Revision: https://reviews.freebsd.org/D2352
Reviewed by: delphij, smh
MFC after: 12 days
/dev/zfs always has per-open data, so when it is missing the file
descriptor is for some other file. Returning ENOENT in this case
is confusing as a variety of other conditions (like a missing dataset)
may result in the same error. It's better to consistently return
EBADF for any problems with the file descriptor.
Note that zfs_onexit_fd_hold() is used with 'automatic cleanup fd'
- when that fd is closed, typically because a process is terminated,
some cleanup action is taken by ZFS driver. E.g. a temporary
snapshot hold is released.
Perhaps, it would even be worthwhile changing devfs_get_cdevpriv()
to return EBADF if there is no associated data.
Differential Revision: https://reviews.freebsd.org/D2370
Reviewed by: delphij, smh
MFC after: 12 days
skip 10, rather than 9, frames. This appears to work quite well in
practice on the BeagleBone Black, so remove a comment about the value
being bogus and replace it with a slightly less negative one. However,
the number of frames to skip is quite sensitive to details of the timer
and interrupt handling paths, so this is necessarily fragile -- but no
more so than on x86.
Sponsored by: DARPA, AFRL
It would previously call into some unfinished Solaris compatibility code and
return without actually calling panic(9). The compatibility code is
unneeded, however, so just remove it and have dtrace_panic() call vpanic(9)
directly.
Differential Revision: https://reviews.freebsd.org/D2349
Reviewed by: avg
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
Passing "-x lazyload" to dtrace -G during compilation causes dtrace(1) to
not link drti.o into the output object file, so the USDT probes are not created
during process startup. Instead, dtrace(1) will automatically discover and
create probes on the process' behalf when attaching.
Differential Revision: https://reviews.freebsd.org/D2203
Reviewed by: rpaulo
MFC after: 1 month
Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage
reaches certain threshold (3/4 on i386 or 16/17 otherwise). FreeBSD has
even less KVA, but had no such limit on archs with direct map as amd64.
As result, on machines with a lot of RAM, during load with very small user-
space memory pressure, such as `zfs send`, it was possible to reach state,
when there is enough both physical RAM and KVA (I've seen up to 25-30%),
but no continuous KVA range to allocate even single 128KB I/O request.
Address this situation from two sides:
- restore KVA usage limitations in a way the most close to Illumos;
- introduce new requirement for KVA fragmentation, specifying that we
should have at least one sequential KVA range of zfs_max_recordsize bytes.
Experiments show that first limitation done alone is not sufficient. On
machine with 64GB of RAM it is sometimes needed to drop up to half of ARC
size to get at leats one 1MB KVA chunk. Statically limiting ARC to half
of KVA/RAM is too strict, so second limitation makes it to work in cycles:
accumulate trash up to certain critical mass, do massive spring-cleaning,
and then start littering again. :)
MFC after: 1 month
This adds an upper bound, dtrace_ustackdepth_max, to the number of frames
traversed when computing the userland stack depth. Some programs - notably
firefox - are otherwise able to trigger an infinite loop in
dtrace_getustack_common(), causing a panic.
MFC after: 1 week
traps do appear in the regular call stack, rather than only in a special
trap frame, so we don't need to inject the trap-frame $pc into a returned
stack trace in DTrace.
MFC after: 3 days
Sponsored by: DARPA, AFRL
skip using the DTrace 'profile' provider on ARM. This causes stack traces
to skip various driver-and callout-related things as they do on x86, where
the likewise arbitrary values are '6' (32-bit) and '10' (64-bit) for
similar sorts of reasons.
MFC after: 3 days
Sponsored by: DARPA, AFRL
If zvol_geom_start is called with a BIO_DELETE from a thread which can
sleep it queues it for later processing by the zvol_geom_worker. The
zvol_geom_worker didn't have a delete case so would simply loose the bio
hence preventing the original caller from every completing. In addition
an other unknown types would suffer the same fate.
Allow zvol_geom_worker to process BIO_DELETE's via zvol_strategy and
return unsupported for all unknown bio types.
MFC after: 2 weeks
Sponsored by: Multiplay
emulate the instructions used in function entry and exit.
For function entry ARM will use a push instruction to push up to 16
registers to the stack. While we don't expect all 16 to be used we need to
handle any combination the compiler may generate, even if it doesn't make
sense (e.g. pushing the program counter).
On function return we will either have a pop or branch instruction. The
former is similar to the push instruction, but with care to make sure we
update the stack pointer and program counter correctly in the cases they
are either in the list of registers or not. For branch we need to take the
24-bit offset, sign-extend it, and add that number of 4-byte words to the
program counter. Care needs to be taken as, due to historical reasons, the
address the branch is relative to is not the current instruction, but 8
bytes later.
This allows us to use the following probes on ARM boards:
dtrace -n 'fbt::malloc:entry { stack() }'
and
dtrace -n 'fbt:🆓return { stack() }'
Differential Revision: https://reviews.freebsd.org/D2007
Reviewed by: gnn, rpaulo
Sponsored by: ABT Systems Ltd
expected to return the data in the memory location pointed at by target
after the operation. The FreeBSD atomic functions previously used return
either 0 or 1 to indicate if the comparison succeeded or not respectively.
With this change these functions only support ARMv6 and later are supported
by these functions.
Sponsored by: ABT Systems Ltd
dtrace is able to display a stack trace similar to the one below.
# dtrace -p 603 -n 'tcp:kernel::receive { stack(); }'
0 70 :receive
kernel`ip_input+0x140
kernel`netisr_dispatch_src+0xb8
kernel`ether_demux+0x1c4
kernel`ether_nh_input+0x3a8
kernel`netisr_dispatch_src+0xb8
kernel`ether_input+0x60
kernel`cpsw_intr_rx+0xac
kernel`intr_event_execute_handlers+0x128
kernel`ithread_loop+0xb4
kernel`fork_exit+0x84
kernel`swi_exit
kernel`swi_exit
Tested by: gnn
Sponsored by: ABT Systems Ltd
cannot happen on FreeBSD. r278136 overlooked the fact that a destructor
registered with devfs_set_cdevpriv(9) is invoked even in the case of an
error.
X-MFC-With: r278136
This allows dtrace to monitor the calls to txg_quiesce which can be really
helpful.
Also standardise __noinline order for arc_kmem_reap_now.
Sponsored by: Multiplay
ZFS already commits outstanding data every zfs_txg_timeout seconds, so these
syncs are unnecessarily intrusive.
Submitted by: gibbs
Sponsored by: Spectra Logic
MFSpectraBSD: 1105759 on 2014/12/11
Since allow_mounted is a FreeBSD-specific change, default to B_TRUE, then
locally check for the magic bit. Unconditionally check allow_mounted below.
Convert the setting of allow_mounted to an explicit boolean.
MFC after: 1 week
Sponsored by: Spectra Logic
MFSpectraBSD: 672578 (in part) on 2013/07/19
This is primarily for developer/debugging use; it enables built-in tagged
tracking of refcounts inside ZFS. It can only be enabled from the loader,
since it modifies how in-core state is managed. Default remains disabled.
MFC after: 1 week
Sponsored by: Spectra Logic
Remove the unnecessary #ifdef _KERNEL, which did not differ in the true or
false cases. Actually set the value of to_free before using it.
MFC after: 1 week
Sponsored by: Spectra Logic
dsl_dir_transfer_space() is mostly called after dsl_dir_diduse_space(),
which already calls dmu_buf_will_dirty() for the same dbuf and tx, so
its duplicate call in those cases will change nothing, only spend time.
Skipping this call by four times reduces time spent in dbuf_write_done()
and descendants, updating dataset statistics with several congested lock
acquisitions. When rewriting 8K zvol blocks at 1GB/s rate, this reduces
CPU time spent inside dbuf_write_done(), according to profiling, from 45%
of 683K samples to 18% of 422K.
MFC after: 2 weeks
A number of entries that can be present in the spa config shouldn't be saved
to disk so add a method to ensure this is case. Without this if the last
caller to vdev_config_generate requested stats then we can end up in the
cache file.
Also only skip a none writable pool in the cache file generation if its
active. This prevents unavailable pools incorrectly getting removed from
cache file.
Tested by: delphij
MFC after: 2 weeks
Sponsored by: Multiplay
Since the upstream for cddl code is now illumos not sun, mechanically
convert all sun #ifdef's to illumos #ifdef's which have been used in all
newer code for some time.
Also do a manual pass to correct the use if #ifdef comments as per style(9)
as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos.
MFC after: 1 month
Sponsored by: Multiplay
Previous throttling implementation approached problem from the wrong side.
It significantly limited useful delaying of TRIM requests and aggregation
potential, while not so much controlled TRIM burstiness under heavy load.
With this change random 4K write benchmarks (probably the worst case for
TRIM) show me IOPS increase by 20%, average latency reduction by 30%, peak
TRIM bursts reduction by 3 times and same peak TRIM map size (memory usage).
Also the new logic does not force map size down so heavily, really allowing
to keep deleted data for 32 TXG or 30 seconds under moderate load. It was
practically impossible with old throttling logic, which pushed map down to
only 64 segments.
Reviewed by: smh
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
New segment at the list head may block all TRIM requests until txg of that
segment can be processed. On my random I/O tests this change reduce peak
TRIM list length from 650 to 450 segments. Hopefully it should reduce TRIM
burstiness when list processing is unblocked.
MFC after: 2 weeks
On FreeBSD gethrtime() implemented via getnanouptime(), that has 1ms (1/hz)
precision. It makes primary sort key (timestamp) collision very possible.
In such situations sorting by secondary key of LBA is much more reasonable
then by totally meaningless zio pointer value.
With this change on multi-threaded synchronous ZVOL read I've measured 10%
throughput increase and average latency reduction.
MFC after: 2 weeks
When pools are unavailable their vdevs are also unavailable which means
that vdev_max_asize remains at the default zero. This default was being
used to calculate vs_esize resulting in a negative number as vdev_asize >
vdev_max_asize, which caused zpool list -v to display 16.0E for EXPANDSZ
of these vdevs.
It's redundant at the moment since it can be obtained from the trapframe
on the architectures where DTrace is supported, but this won't be the case
with ARM.