illumos/illumos-gate@11ceac77ea11ceac77eahttps://www.illumos.org/issues/6844
dnode_next_offset is used in a variety of places to iterate over the holes or
allocated blocks in a dnode. It operates under the premise that it can iterate
over the blockpointers of a dnode in open context while holding only the
dn_struct_rwlock as reader. Unfortunately, this premise does not hold.
When we create the zio for a dbuf, we pass in the actual block pointer in the
indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the bp is
now inconsistent from the perspective of dnode_next_offset: the bp will appear
to be a hole until zio_dva_allocate finally finishes filling it in. In the
meantime, dnode_next_offset can detect a hole in the dnode when none exists.
I was able to experimentally demonstrate this behavior with the following
setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the first L0
block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle
of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent hole.
The fix is to ensure that it is valid to iterate over the indirect blocks in a
dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP
and updating the actual BP in dbuf_write_ready while holding the lock.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Reece <alex@delphix.com>
MFC after: 3 weeks
The change has been undone in r301275 on the assumption that it was no
longer required. But that was incorrect, because in this case (and only
in this case) the snapshot root vnode is looked up before z_parent is
fixed up.
MFC after: 5 days
Predicates are DIF objects whose return value is compared with zero to
determine whether the corresponding probe body is to be executed. The return
value itself is the contents of a 64-bit DIF register, but it was being
truncated to an int before the comparison. This meant that a predicate such
as /0x100000000/ would evaluate to false.
Reported by: rwatson
MFC after: 3 days
Due to ARC initial configuration not being done and kmem information
not being available we need to blindly set zfs_arc_max and zfs_arc_min
when configured via the tunable.
This fixes vfs.zfs.arc_(min|max) configuration via loader.conf broken by
r302265.
Approved by: re(gjb)
MFC after: 1 week
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
Those changes were found confusing FreeBSD libc ACL code, that doesn't
differentiate ACL for directories and files, and report ACLs for all
directories created after those patches as non-trivial. On the other
side these changes were considered wrong from POSIX and NFSv4 points of
view. Until further investigation done upstream, revert those changes
locally in preparation for FreeBSD 11.0 release.
Approved by: re (hrs)
Prior to this change ZFS ARC min / max could only be changed using
boot time tunables, this allows the values to be tuned at runtime
using the sysctls:
* vfs.zfs.arc_max
* vfs.zfs.arc_min
When adjusting ZFS ARC minimum the memory used will only reduce
to the new minimum given memory pressure.
Reviewed by: allanjude
Approved by: re (gjb)
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D5907
getzfsvfs() called vfs_busy() in the waiting mode while having a hold on
a pool (via a call to dmu_objset_hold). In other words,
dp_config_rwlock was held in the shared mode while a thread could be
sleeping in vfs_busy().
The pool's txg sync thread needs to take dp_config_rwlock in the
exclusive mode for some actions, e.g., for executing sync tasks. If the
sync thread gets blocked, then any thread waiting for its sync task to
get executed is also blocked. Which, in turn, could mean that
vfs_busy() will keep waiting indefinitely.
The solution is to use vfs_ref() in the locked section and to call
vfs_busy() only after dropping other locks.
Note that a reference on a struct mount object does not prevent an
associated zfsvfs_t object from being destroyed. So, we have to be
careful to operate only on the struct mount object until we successfully
vfs_busy it.
Approved by: re (gjb)
MFC after: 2 weeks
This apparently puts ARC back under the limits after the vnode pressure
rework in r291244, in particular due to the kmem exhaustion.
Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
The change is in arc_buf_l2_cdata_free().
Without this we can trip the assertion in arc_hdr_realloc()
if INVARIANTS option is enabled.
Approved by: re (kib)
MFC after: 1 week
This is a followup to r300131.
A filesystem's root vnode can be reached not only through VSF_ROOT, but
by other means as well. For example, via a dot-dot lookup.
Also, a root vnode can get reclaimed and then re-created. For these
reasons it was insufficient to clear VV_ROOT flag from a root vnode of a
snapshot mounted under .zfs in zfsctl_snapdir_lookup().
So, now we set the flag in zfs_znode_sa_init() only if a vnode
represent a root of a filesystem or a standalone snapshot.
That is, the flag is not set for snapshots mounted under .zfs.
MFC after: 2 weeks
It could happen in an unlikely case that we fail to lock the root vnode
with requested flags (which appear to never include LK_NOWAIT).
MFC after: 1 week
sys/cddl/contrib/opensolaris/uts/common/sys/acl.h:
Improve the english in a comment. No functional changes
Submitted by: gibbs
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Support for the new hashing algorithms in ZFS was introduced in r289422
However it was disconnected because FreeBSD lacked implementations of
SHA-512 (truncated to 256 bits), and Skein.
These implementations were introduced in r300921 and r300966 respectively
This commit connects them to ZFS and enabled these new checksum algorithms
This new algorithms are not supported by the boot blocks, so do not use them
on your root dataset if you boot from ZFS.
Relnotes: yes
Sponsored by: ScaleEngine Inc.
Add zfsd, which deals with hard drive faults in ZFS pools. It manages
hotspares and replements in drive slots that publish physical paths.
cddl/usr.sbin/zfsd
Add zfsd(8) and its unit tests
cddl/usr.sbin/Makefile
Add zfsd to the build
lib/libdevdctl
A C++ library that helps devd clients process events
lib/Makefile
share/mk/bsd.libnames.mk
share/mk/src.libnames.mk
Add libdevdctl to the build. It's a private library, unusable by
out-of-tree software.
etc/defaults/rc.conf
By default, set zfsd_enable to NO
etc/mtree/BSD.include.dist
Add a directory for libdevdctl's include files
etc/mtree/BSD.tests.dist
Add a directory for zfsd's unit tests
etc/mtree/BSD.var.dist
Add /var/db/zfsd/cases, where zfsd stores case files while it's shut
down.
etc/rc.d/Makefile
etc/rc.d/zfsd
Add zfsd's rc script
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
Fix the resource.fs.zfs.statechange message. It had a number of
problems:
It was only being emitted on a transition to the HEALTHY state.
That made it impossible for zfsd to take actions based on drives
getting sicker.
It compared the new state to vdev_prevstate, which is the state that
the vdev had the last time it was opened. That doesn't make sense,
because a vdev can change state multiple times without being
reopened.
vdev_set_state contains logic that will change the device's new
state based on various conditions. However, the statechange event
was being posted _before_ that logic took effect. Now it's being
posted after.
Submitted by: gibbs, asomers, mav, allanjude
Reviewed by: mav, delphij
Relnotes: yes
Sponsored by: Spectra Logic Corp, iX Systems
Differential Revision: https://reviews.freebsd.org/D6564
The sys/types.h fix I proposed was only tested with zfs(4), not with
libzpool, which is where the build failure actually existed
Remove vm/vm_pageout.h from arc.c and zfs_vnops.c because they're both
unneeded
MFC after: 1 week
X-MFC with: r300865, r300870
In collaboration with: kib
Submitted by: alc
Sponsored by: EMC / Isilon Storage Division
ZFS's configuration needs to be updated whenever the physical path for a
device changes, but not when a new device is introduced. This is because new
devices necessarily cause config updates, but only if they are actually
accepted into the pool.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When
setting the vdev's physical path, only request a config update if
the physical path has changed. Don't request it when opening a
device for the first time, because the config sync will happen
anyway upstack.
sys/geom/geom_dev.c
Split g_dev_set_physpath and g_dev_set_media out of
g_dev_attrchanged
Submitted by: will, asomers
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6428
vm/vm_pageout.h grew a dependency on the bool typedef in r300865
arc.c didn't include sys/types.h, which included the definition for the typedef
Other items (ofed, drm2) might need to be chased for this commit.
X-MFC with: r300865
MFC after: 1 week
Pointyhat to: alc
Sponsored by: EMC / Isilon Storage Division
former return the current status for the latter to use. Without this we
could enable interrupts when they shouldn't be.
It's still not quite right as it should only update the bits we care about,
bit should be good enough until the correct fix can be tested.
PR: 204270
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
That was both redundant as zfs_znode_sa_init() already does the job and
insufficient as the root vnode can be reached via other means.
MFC after: 1 weeks
gfs code is (almsot) completely agnostic of FreeBSD VFS locking, so it
does not handle doomed but not yet dead vnodes and may return them.
Check for those vnodes here and retry a lookup.
Note that ZFS and gfs have additional protections that ensure that a
parent vnode of the current vnode is never doomed.
The fixed problem is an occasional failure to lookup a 'snapshot' or
'shares' directories under .zfs.
Note that for the above reason all uses of zfsctl_root_lookup() are
better be replaced with VOP_LOOKUP.
MFC after: 5 weeks
Speedup is hard to measure because the only time vdev_geom_open_by_guids
gets called on many drives at the same time is during boot. But with
vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations
like "zpool create" speed up by 65%.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
* Read all of a vdev's labels in parallel instead of sequentially.
* In vdev_geom_read_config, don't read the entire label, including
the uberblock. That's a waste of RAM. Just read the vdev config
nvlist. Reduces the IO and RAM involved with tasting from 1MB to
448KB.
Reviewed by: avg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6153
FreeBSD zfs_ioc_rename() has an option, not present upstream, that
allows to rename snapshots without unmounting them first. I am not sure
what is a rationale for that option, but its actual behavior was the
opposite of the intended behavior. That is, by default the snapshots
were not unmounted.
The option was introduced as part of a large update from upstream in
r248498.
One of the consequences was a havoc under .zfs/snapshot after the rename.
The snapshots got new names but were mounted on top of directories with
old names, so readdir would list the new names, but lookup would still
find the old mounts.
PR: 209093
Reported by: Frédéric VANNIÈRE <f.vanniere@planet-work.com>
MFC after: 5 days
That was just wrong. In fact, we can safely keep this static entry when
it's inactive.
Now the destructive action is moved to the reclaim method and the
function is renamed from zfsctl_snapdir_inactive(0 to
zfsctl_snapdir_reclaim().
Also, we can use gfs_vop_reclaim() instead of gfs_dir_inactive() +
kmem_free().
Lastly, we can just assert that the node does not any children when it
is reclaimed, even on the force unmount. That's because zfs_umount()
does an extra vflush() pass which should destroy all snapshot-mountpoint
vnodes that are the snapdir's children.
MFC after: 5 weeks
Those vnodes should not linger. "Stale" nodes may get out of
synchronization with actual snapshots. For example if we destroy a
snapshot and create a new one with the same name. Or when we rename a
snapshot.
While there fix the argument type for zfsctl_snapshot_reclaim().
Also, its original argument can be passed to gfs_vop_reclaim() directly.
Bug 209093 could be related although I have not specifically verified
that. Referencing just in case.
PR: 209093
MFC after: 5 weeks
Dropping the root vnode's lock after VFS_ROOT() didn't really help the
fact that we acquired the lock while holding its child's, .zfs, lock
while performing the operaiton.
So, directly use zfs_zget() to get the root vnode.
While there simplify the code in zfsctl_freebsd_root_lookup.
We know that .zfs is always exclusively locked.
We know that there is already a reference on *vpp, so no need for an
extra one.
Account for the fact that .. lookup may ask for a different lock type,
not necessarily LK_EXCLUSIVE. And handle a possible failure to acquire
the lock given the lock flags.
MFC after: 5 weeks
In fact, that was dangerous. For example, zfsctl_snapshot_reclaim()
calls gfs_dir_lookup() on ".." path and that ends up calling
gfs_lookup_dot() which violated locking order by acquiring the parent's
directory vnode lock after the child's vnode lock.
Also, the previous behavior was inconsistent as gfs_dir_lookup()
returned a locked vnode for . and .. lookups, but not for any other.
Now gfs_lookup_dot() just references a resulting vnode and the locking
is done in its consumers, where necessary.
Note that we do not enable shared locking support for any gfs / zfsctl
vnodes.
This commit partially reverts r273641.
MFC after: 5 weeks
The former acquired a snap vnode lock while holding sd_lock while the
latter does the opposite.
The solution is drop sd_lock before acquiring the vnode lock. That
should be okay as we are still holding a lock on the 'snapshot'
directory in the exclusive mode. That lock ensures that there are no
concurrent lookups in the directory and thus no concurrent mount attempts.
But now we have to account for the possibility that the snap vnode
might get reclaim after we drop sd_lock and before we can get
the node lock. So, check for that case and retry.
MFC after: 5 weeks
This commit partially reverts r273641 which introduced the leak.
It did so to accomodate for some consumers of traverse() that expected
the starting vnode to stay as-is. But that introduced the leak in the
case when a mounted filesystem was found and its root vnode was
returned.
r299914 removed the troublesome consumers and now there is no reason to
keep the starting vnode. So, now the new rules are:
- if there is no mounted filesystem, then nothing is changed
- otherwise the starting vnode is always released
- the root vnode of the mounted filesystem is returned locked and
referenced in the case of success
MFC after: 5 weeks
X-MFC after: r299914
We pretend that snapshots mounted under .zfs are part of the original
filesystem and we try very hard to hide vnodes on top of which the snapshots
are mounted. Given that I believe that the removed operations should
never be called. They might have been called previously because
of issues fixed in r299906, r299908 and r299913.
MFC after: 5 weeks
Previosuly we did that only if the snapshot was mounted earlier, its
root vnode got recycled and then we accessed it again.
We never cleared the flag for a freshly mounted snapshot.
That was very inconsistent and probably a source of some bugs.
Or maybe that painted over some bugs which might get revealed now.
We should consistently clear the flag because we try very hard to
pretend that snapshots auto-mounted under .zfs are part of their
original filesystem. In other words, we try to hide the fact that they
are different filesystems / mountpoints.
MFC after: 5 weeks
The logic is similar to that already present in zfs_dirlook() to handle
a dot-dot lookup on a root vnode of a snapshot mounted under
.zfs/snapshot/.
illumos does not have an equivalent of vop_vptocnp, so there only the
lookup had to be patched up.
MFC after: 4 weeks
* Remove excessive references on a snapshot mountpoint vnode.
zfsctl_snapdir_lookup() called VN_HOLD() on a vnode returned from
zfsctl_snapshot_mknode() and the latter also had a call to VN_HOLD()
on the same vnode.
On top of that gfs_dir_create() already returns the vnode with the
use count of 1 (set in getnewvnode).
So there was 3 references on the vnode.
* mount_snapshot() should keep a reference to a covered vnode.
That reference is owned by the mountpoint (mounted snapshot filesystem).
* Remove cryptic manipulations of a covered vnode in zfs_umount().
FreeBSD dounmount() already does the right thing and releases the covered
vnode.
PR: 207464
Reported by: dustinwenz@ebureau.com
Tested by: Howard Powell <hpowell@lighthouseinstruments.com>
MFC after: 3 weeks
Currently, Application Processors (non-boot CPUs) are started by
MD code at SI_SUB_CPU, but they are kept waiting in a "pen" until
SI_SUB_SMP at which point they are released to run kernel threads.
SI_SUB_SMP is one of the last SYSINIT levels, so APs don't enter
the scheduler and start running threads until fairly late in the
boot.
This change moves SI_SUB_SMP up to just before software interrupt
threads are created allowing the APs to start executing kernel
threads much sooner (before any devices are probed). This allows
several initialization routines that need to perform initialization
on all CPUs to now perform that initialization in one step rather
than having to defer the AP initialization to a second SYSINIT run
at SI_SUB_SMP. It also permits all CPUs to be available for
handling interrupts before any devices are probed.
This last feature fixes a problem on with interrupt vector exhaustion.
Specifically, in the old model all device interrupts were routed
onto the boot CPU during boot. Later after the APs were released at
SI_SUB_SMP, interrupts were redistributed across all CPUs.
However, several drivers for multiqueue hardware allocate N interrupts
per CPU in the system. In a system with many CPUs, just a few drivers
doing this could exhaust the available pool of interrupt vectors on
the boot CPU as each driver was allocating N * mp_ncpu vectors on the
boot CPU. Now, drivers will allocate interrupts on their desired CPUs
during boot meaning that only N interrupts are allocated from the boot
CPU instead of N * mp_ncpu.
Some other bits of code can also be simplified as smp_started is
now true much earlier and will now always be true for these bits of
code. This removes the need to treat the single-CPU boot environment
as a special case.
As a transition aid, the new behavior is available under a new kernel
option (EARLY_AP_STARTUP). This will allow the option to be turned off
if need be during initial testing. I plan to enable this on x86 by
default in a followup commit in the next few days and to have all
platforms moved over before 11.0. Once the transition is complete,
the option will be removed along with the !EARLY_AP_STARTUP code.
These changes have only been tested on x86. Other platform maintainers
are encouraged to port their architectures over as well. The main
things to check for are any uses of smp_started in MD code that can be
simplified and SI_SUB_SMP SYSINITs in MD code that can be removed in
the EARLY_AP_STARTUP case (e.g. the interrupt shuffling).
PR: kern/199321
Reviewed by: markj, gnn, kib
Sponsored by: Netflix
delete permissions for ACLs
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>
openzfs/openzfs@a40149b935
aclmode=passthrough
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Albert Lee <trisk@nexenta.com>
openzfs/openzfs@1bcf0d240b
perms (groupmask)
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Albert Lee <trisk@nexenta.com>
openzfs/openzfs@eebb483d0c
some additional considerations
Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>
openzfs/openzfs@d316fffc9c
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Joe Stein <joe.stein@delphix.com>
openzfs/openzfs@215198a6ad
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Tim Chase <tim@chase2k.com>
openzfs/openzfs@445e67805d
clear vd->vdev_tsd in vdev_geom_close_locked instead of vdev_geom_detach.
In the latter function, it would fail to happen in certain circumstances
where cp->private was unset. Ideally, the latter should never happen, but
it can happen when vdev open fails, or where spares are involved.
MFC after: 4 weeks
X-MFC-With: 298786
Sponsored by: Spectra Logic Corp
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Move checks for provider's sectorsize and mediasize into a single
location in vdev_geom_attach. Remove the zfs::vdev::taste class;
it's ok to use the regular vdev class for tasting. Consolidate guid
checks into a single location in vdev_attach_ok. Consolidate some
error handling code from vdev_geom_attach into vdev_geom_detach,
closing a resource leak of geom consumers in the process.
Reviewed by: avg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D5974
This allows for the long function components encountered in www/firefox.
This constant is part of DTrace's userland ABI, so this change may not be
MFC'ed.
PR: 207735
Without this change, DTrace will refuse to load a DOF section if the
function component of any of its probes exceeds DTRACE_FUNCNAMELEN (128).
Probes in C++ programs can have very long function components. Rather than
rejecting all probes if a single probe exceeds the limit, simply skip the
invalid probe and emit a warning. This ensures that valid probes are
instantiated.
PR: 207735
MFC after: 2 weeks
When this flag is turned on, DOF and DIF validation errors are printed to
the kernel message buffer. This is useful for debugging.
Also remove the debug.dtrace.debug sysctl, which has no effect.
While the instructions were not included into the original instruction
set, their support can be indicated by a special feature bit.
For example:
CPU: AMD Phenom(tm) II X4 955 Processor (3214.71-MHz K8-class CPU)
...
AMD Features2=0x37ff<LAHF, ...>
Clang 3.8 uses lahf/sahf as a faster alternative to pushf/popf where
possible.
MFC after: 2 weeks
illumos/illumos-gate@26455f9efc26455f9efchttps://www.illumos.org/issues/6052
At the moment type parameter of lzc_create() is of dmu_objset_type_t type.
That exposes an implementation detail and requires sys/fs/zfs.h to be included
in libzfs_core.h creating unnecessary coupling between libzfs_core interface
and ZFS internals.
I think that dmu_objset_type_t should be replaced with a libzfs_core
enumeration of supported dataset types.
For ABI reasons the new enumeration could be bit-compatible with
dmu_objset_type_t.
For example:
typedef enum {
LZC_DST_ZFS = 2,
LZC_DST_ZVOL
} lzc_dataset_type_t;
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
MFC after: 2 weeks
Sponsored by: ClusterHQ
Currently this argument is a pointer into the stack which is used by FBT
to fetch the first five probe arguments. On all non-x86 architectures it's
simply the trapframe address, so this change has no functional impact. On
amd64 it's a pointer into the trapframe such that stack[1 .. 5] gives the
first five argument registers, which are deliberately grouped together in
the amd64 trapframe definition.
A trapframe argument simplifies the invop handlers on !x86 and makes the
x86 FBT invop handler easier to understand. Moreover, it allows for invop
handlers that may want to modify the register set of the interrupted thread.
Note that now we have to account for possible partial writes
in dmu_write_uio_dbuf(). It seems that on illumos either all or none
of the data are expected to be written. But the partial writes are
quite expected when vn_io_fault support is enabled.
Reviewed by: kib
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D2790
Prior to this change, vdev_geom_open_by_path would call vdev_geom_attach
prior to verifying the device's GUIDs. vdev_geom_attach calls
vdev_geom_attrchange to set the physpath in the vdev object. The result is
that if the disk could not be found, then the labels for other disks in the
same TLD would overwrite the missing disk's physpath with the physpath of
whichever disk currently has the same devname as the missing one used to
have.
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Don't drop the g_topology_lock before freeing old_physpath. That
opens up a race where one thread can call vdev_geom_attrchanged,
set old_physpath, drop the g_topology_lock, then block trying to
acquire the SCL_STATE lock. Then another thread can come into
vdev_geom_attrchanged, set old_physpath to the same value, and
proceed to free it. When the first thread resumes, it will free
the same location.
It turns out that the SCL_STATE lock isn't needed. It was
originally added by gibbs to protect vd->vdev_physpath while
updating the same. However, the update process subsequently was
switched to an atomic operation (a pointer swap). Now, there is
no need for the SCL_STATE lock, and hence no need to drop the
g_topology_lock.
Reviewed by: delphij
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D5413
Previously uncompressed buffers did not obey that rule.
Type of b_asize is changed to uint64_t for consistency,
given that this is a zeta-byte filesystem.
l2arc_compress_buf is renamed to l2arc_transform_buf to better reflect
its new utility. Now not only we ensure that a compressed buffer has
a size aligned to ashift, but we also allocate a properly sized
temporary buffer if the original buffer is not compressed and it has
an odd size. This ensures that all I/O to the cache device is always
ashift-aligned, in terms of both a request offset and a request size.
If the aligned data is larger than the original data, then we have to use
a temporary buffer when reading it as well.
Also, enhance physical zio alignment checks using vdev_logical_ashift.
On FreeBSD we have this information, so we can make stricter assertions.
Reviewed by: smh, mav
MFC after: 1 month
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2789
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Author: Alexander Motin <mav@FreeBSD.org>
Improve speculative prefetch of indirect blocks.
Scalability of many operations on wide ZFS pool can be limited by
requirement to prefetch indirect blocks first. Recently added
asynchronous indirect block read partially helped, but did not
solve the problem completely. This patch extends existing prefetcher
functionality to explicitly work with indirect blocks.
Before this change prefetcher issued reads for up to 8MB of data in
advance. With this change it also issues indirect block reads
for up to 64MB of data in advance, so that when it will be time to
actually read those data, it can be done immediately. Alike effect
can be achieved by just increasing maximal data prefetch distance,
but at higher memory cost.
Also this change introduces indirect block prefetch for rewrite
operations, that was never done before. Previously ARC miss for
Indirect blocks regularly blocked rewrites, converting perfectly
aligned asynchronous operations into synchronous read-write pairs,
significantly reducing maximal rewrite speed.
While being there this issue was also fixed:
- prefetch was done always, even if caching for the dataset was
completely disabled.
Testing on FreeBSD with zvol on top of 6x striped 2x mirrored pool
of 12 assorted HDDs shown me such performance numbers:
------- BEFORE --------
Write 491363677 bytes/sec
Read 312430631 bytes/sec
Rewrite 97680464 bytes/sec
-------- AFTER --------
Write 493524146 bytes/sec
Read 438598079 bytes/sec
Rewrite 277506044 bytes/sec
Closes#65Closes#80openzfs/openzfs@792fd28ac0
Only include sysctl in kernel builds fixing warning about implicit
declaration of function 'sysctl_handle_int'.
PR: 204140
MFC after: 1 week
X-MFC-With: r297813
Sponsored by: Multiplay
At the moment no ZFS buffers are included into a crash dump unless
ZFS_DEBUG (or INVARIANTS) kernel option is enabled. That's not very
helpful for debugging of ZFS problems, because important information
often resides in metadata buffers.
This change switches the dumping behavior when UMA is used from the
illumos behavior to a more useful behavior that we have on FreeBSD
when ZFS buffers are allocated via malloc.
Reviewed by: smh, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D5892
This allows one to enable DTrace probes relatively early during boot,
during SI_SUB_DTRACE_ANON, before dtrace(1) can invoked. The desired
enabling is created using dtrace -A, which writes a /boot/dtrace.dof
file and uses nextboot(8) to ensure that DTrace kernel modules are loaded
and that the DOF file describing the enabling is loaded by loader(8)
during the subsequent boot. The trace output can then be fetched with
dtrace -a.
With this commit, boot-time DTrace is only functional on i386 and amd64: on
other architectures, the high-resolution timer frequency is initialized
during SI_SUB_CLOCKS and is thus not available when the anonymous
tracing state is initialized. On x86, the TSC is used and is thus available
earlier.
MFC after: 1 month
Relnotes: yes
This allows the hrtimer to be used earlier during boot. This is required
for boot-time DTrace: anonymous enablings are created during
SI_SUB_DTRACE_ANON, which runs before APs are started. In particular,
the DTrace deadman timer requires that the hrtimer be functional.
MFC after: 2 weeks
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Author: Will Andrews <will@firepipe.net>
Closes#83Closes#32openzfs/openzfs@9663688425
FreeBSD already had `zpool labelclear` functionality, so this is mostly
just a diff reduction.
MFC after: 1 month
This is because they might do data compression which is quite CPU
expensive. The original code is correct for illumos, because there
a higher priority corresponds to a greater number.
MFC after: 2 weeks
This made impossible spare disk open by known path, which kind of worked
only because the same fix was applied to vdev_geom_attach_by_guids() in
r293708.
MFC after: 1 week
for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080
Allow using DTRACE for performance analysis of userspace
applications - the function call stack can be captured.
This is almost an exact copy of AMD64 solution.
Obtained from: Semihalf
Sponsored by: Cavium
Reviewed by: emaste, gnn, jhibbits
Differential Revision: https://reviews.freebsd.org/D5779
On FreeBSD VFS_HOLD/VN_RELE were mapped to MNT_REF/MNT_REL that
manipulate mnt_ref. But the job of properly maintaining the reference
count is already automatically performed by insmntque(9) and
delmntque(9). So, in effect all ZFS vnodes referenced the corresponding
mountpoint twice.
That was completely harmless, but we want to be very explicit about what
FreeBSD VFS APIs are used, because illumos VFS_HOLD and FreeBSD MNT_REF
provide quite different guarantees with respect to the held vfs_t /
mountpoint. On illumos VFS_HOLD is sufficient to guarantee that
vfs_t.vfs_data stays valid. On the other hand, on FreeBSD MNT_REF does
*not* provide the same guarantee about mnt_data. We have to use
vfs_busy() to get that guarantee.
Thus, the calls to VFS_HOLD/VFS_RELE on vnode init and fini are removed.
VFS_HOLD calls are replaced with vfs_busy in the ioctl handlers.
And because vfs_busy has a richer interface that can not be dumbed down
in all cases it's better to explicitly use it rather than trying to mask
it behind VFS_HOLD.
This change fixes a panic that could result from a race between
zfs_umount() and zfs_ioc_rollback(). We observed a case where
zfsvfs_free() tried to destroy data that zfsvfs_teardown() was still
using. That happened because there was nothing to prevent unmounting of
a ZFS filesystem that was in between zfs_suspend_fs() and
zfs_resume_fs().
Reviewed by: kib, smh
MFC after: 3 weeks
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2794
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Eli Rosenthal <eli.rosenthal@delphix.com>
illumos/illumos-gate@c20404ff77
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alex Wilson <alex.wilson@joyent.com>
illumos/illumos-gate@d09e4475f6
Unlike Illumos FreeBSD has concept of logical ashift, that specifies
really minimal vdev block size that can be accessed. This knowledge
allows properly pad physical I/O and correctly assert its alignment.
This change fixes L2ARC write errors when device has logical sector
size above 512 bytes.
MFC after: 1 month
This fixes creation of zvol devices for snapshots during zfs receive,
that previously failed with "ZFS WARNING: Unable to create ZVOL" message.
This solution is not perfect, but IMHO better then it was before.
MFC after: 2 weeks
If device has stripe size bigger then maximal sector size supported by
ZFS, there is nothing can be done to avoid read-modify-write cycles.
Taking that stripe size into account will only reduce space efficiency
and pointlessly bother user with warnings that can not be fixed.
Discussed with: smh
Use of misaligned or non-power-of-2 stripes is not really useful for ZFS,
since increased ashift won't help to avoid read-modify-write cycles, and
only reduce pool space efficiency and compression rates.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Stefan Ring <stefanrin@gmail.com>
Reviewed by: Steven Burgess <sburgess@datto.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
In certain circumstances, "zfs send -i" (incremental send) can produce a
stream which will result in incorrect sparse file contents on the
target.
The problem manifests as regions of the received file that should be
sparse (and read a zero-filled) actually contain data from a file that
was deleted (and which happened to share this file's object ID).
Note: this can happen only with filesystems (not zvols, because they do
not free (and thus can not reuse) object IDs).
Note: This can happen only if, since the incremental source (FromSnap),
a file was deleted and then another file was created, and the new file
is sparse (i.e. has areas that were never written to and should be
implicitly zero-filled).
We suspect that this was introduced by 4370 (applies only if hole_birth
feature is enabled), and made worse by 5243 (applies if hole_birth
feature is disabled, and we never send any holes).
The bug is caused by the hole birth feature. When an object is deleted
and replaced, all the holes in the object have birth time zero. However,
zfs send cannot tell that the holes are new since the file was replaced,
so it doesn't send them in an incremental. As a result, you can end up
with invalid data when you receive incremental send streams. As a
short-term fix, we can always send holes with birth time 0 (unless it's
a zvol or a dataset where we can guarantee that no objects have been
reused).
Closes#37openzfs/openzfs@adef853162
6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt()
6673 want a macro to convert seconds to nanoseconds and vice-versa
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Eli Rosenthal <eli.rosenthal@delphix.com>
illumos/illumos-gate@a8f6344fa0
in the dedup property value
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: ilovezfs <ilovezfs@icloud.com>
illumos/illumos-gate@971640e6aa
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Dan McDonald <danmcd@omniti.com>
illumos/illumos-gate@5f7a8e6d75
after the scrub started
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@38d6103674
Reviewed by: Steve Gonczi <gonczi@comcast.net>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Gary Mills <gary_mills@fastmail.fm>
illumos/illumos-gate@8c04a1fa3f
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>
illumos/illumos-gate@97e8130957
- Handle the case where no DOF helper is provided. This occurs with the
currently-unused DTRACEHIOC_ADD ioctl.
- Fix some checks that prevented the loading DOF in the (non-default)
lazyload mode.
Upstream, tracepoints are protected by per-CPU mutexes. An unlinked
tracepoint may be freed once all the tracepoint mutexes have been acquired
and released - this is done in fasttrap_mod_barrier(). This mechanism was
not properly ported: in some places, the proc lock is used in place of a
tracepoint lock, and in others the locking is omitted entirely. This change
implements tracepoint locking with an rmlock, where the read lock is used
in fasttrap probe context. As a side effect, this fixes a recursion on the
proc lock when the raise action is used from a userland probe.
MFC after: 1 month
need to include it explicitly when <vm/vm_param.h> is already included.
Suggested by: alc
Reviewed by: alc
Differential Revision: https://reviews.freebsd.org/D5379
for all struct bio you get back from g_{new,alloc}_bio. Temporary
bios that you create on the stack or elsewhere should use this before
first use of the bio, and between uses of the bio. At the moment, it
is nothing more than a wrapper around bzero, but that may change in
the future. The wrapper also removes one place where we encode the
size of struct bio in the KBI.
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Tim Chase <tim@chase2k.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Andriy Gapon <avg@icyb.net.ua>
illumos/illumos-gate@e7e978b1f7
During the update process in sa_modify_attrs(), the sizes of existing
variably-sized SA entries are obtained from sa_lengths[]. The case where
a variably-sized SA was being replaced neglected to increment the index
into sa_lengths[], so subsequent variable-length SAs would be rewritten
with the wrong length. This patch adds the missing increment operation
so all variably-sized SA entries are stored with their correct lengths.
Another problem was that index into attr_desc[] was increased even when
an attribute was removed. If that attribute was not the last attribute,
then the last attribute was lost.
Change 294329 removed the ability to build ZFS pools that are backed by
zvols, because having that ability (even if it's not used) leads to
deadlocks. By popular demand, I'm adding an off-by-default sysctl to
reenable that ability.
Reviewed by: lidl, delphij
MFC after: Never
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4998
This is the final step required allowing to compile and to run RISC-V
kernel and userland from HEAD.
RISC-V is a completely open ISA that is freely available to academia
and industry.
Thanks to all the people involved! Special thanks to Andrew Turner,
David Chisnall, Ed Maste, Konstantin Belousov, John Baldwin and
Arun Thomas for their help.
Thanks to Robert Watson for organizing this project.
This project sponsored by UK Higher Education Innovation Fund (HEIF5) and
DARPA CTSRD project at the University of Cambridge Computer Laboratory.
FreeBSD/RISC-V project home: https://wiki.freebsd.org/riscv
Reviewed by: andrew, emaste, kib
Relnotes: Yes
Sponsored by: DARPA, AFRL
Sponsored by: HEIF5
Differential Revision: https://reviews.freebsd.org/D4982
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Albert Lee <trisk@omniti.com>
Author: Steven Hartland <steven.hartland@multiplay.co.uk>
illumos/illumos-gate@2bad22584d
exceeds refquota
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: Dan McDonald <danmcd@omniti.com>
illumos/illumos-gate@5878fad70d
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@68ecb2ec93
This allows to do a full (non-incremental send) and receive it as a clone
of an existing dataset. It can leverage nopwrite to share blocks with the
origin. This can be used to change the relationship of datasets on the
target. For example, maybe on the source you have:
A ---- B ---- C
And you have sent to the target a full of B, and the incremental B->C:
B ---- C
You later realize that you want to have A on the target. You will have to
do a full send of A, but nopwrite can save you space on the target if you
receive it as a clone of B, assuming that A and B have some blocks inxi
common:
B ---- C
\
A
Reviewed-by: Ned Bass <bass6@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: James Pan <jiaming.pan@yahoo.com>
illumos/illumos-gate@3502ed6e7c
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Will Andrews <will@firepipe.net>
illumos/illumos-gate@eb5bb58421
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Richard Yao <ryao@gentoo.org>
illumos/illumos-gate@c71c00bbe8
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Richard Yao <ryao@gentoo.org>
illumos/illumos-gate@5bdd995ddb
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Simon Klinkert <simon.klinkert@gmail.com>
illumos/illumos-gate@6575bca013
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Andriy Gapon <avg@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Richard Yao <ryao@gentoo.org>
illumos/illumos-gate@eaef6a96de
6292 exporting a pool while an async destroy is running can leave entries
in the deferred tree
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Fabian Keil <fk@fabiankeil.de>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
illumos/illumos-gate@a443cc80c7
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
illumos/illumos-gate@b39b744be7
This is revert of 5693.
called from zfs_write() - one of them, through dmu_write(), was handled
correctly; the other wasn't.
Reviewed by: avg@
MFC after: 1 month
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D4923
The return value doesn't need to be checked, because nvlist_get_guid's
callers check the returned values of the guids.
Coverity CID: 1341869
MFC after: 1 week
X-MFC-With: 292066
Sponsored by: Spectra Logic Corp
Using zvols as backing devices for ZFS pools is fraught with panics and
deadlocks. For example, attempting to online a missing device in the
presence of a zvol can cause a panic when vdev_geom tastes the zvol. Better
to completely disable vdev_geom from ever opening a zvol. The solution
relies on setting a thread-local variable during vdev_geom_open, and
returning EOPNOTSUPP during zvol_open if that thread-local variable is set.
Remove the check for MUTEX_HELD(&zfsdev_state_lock) in zvol_open. Its intent
was to prevent a recursive mutex acquisition panic. However, the new check
for the thread-local variable also fixes that problem.
Also, fix a panic in vdev_geom_taste_orphan. For an unknown reason, this
function was set to panic. But it can occur that a device disappears during
tasting, and it causes no problems to ignore this departure.
Reviewed by: delphij
MFC after: 1 week
Relnotes: yes
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4986
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gwr@nexenta.com>
illumos/illumos-gate@2bd7a8d078
This fixes erroneous double increments of the 'check' variable in a loop
in spa_prop_validate(). I ran into this in the clang380-import branch,
where clang 3.8.0 warns about it. (It is already fixed there.)
MFC after: 3 days
When a ZFS drive disappears, ZFS sends a resource.fs.zfs.removed event to
userland. A userland program like zfsd(8) can use that event, for example to
activate a hotspare. The current code contains a race condition: vdev_geom
will sent the sysevent _before_ spa.c would update the vdev's status,
causing userland processes to see pool state that does not reflect the
device removal. This change moves the sysevent to spa.c, closing the race.
Reviewed by: delphij, Sean Eric Fagan
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4902
After r292066, vdev_geom verifies both the vdev and pool guids of device
labels during open. However, spare and l2arc devices don't have pool guids,
so opening them by guid will fail (opening by path, when the pathname is
known, still succeeds). This change allows a vdev to be opened by guid if
the label contains no pool_guid, which is the case for inactive spares and
l2arc devices.
PR: 292066
Reported by: delphij
Reviewed by: delphij, smh
MFC after: 2 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4861
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
If available, record the physical path of a vdev in ZFS meta-data.
Do this both when opening the vdev, and when receiving an attribute
change notification from GEOM.
Make vdev_geom_close() synchronous instead of deferring its work to
a GEOM event handler. There is no benefit to deferring the work and
this prevents a future open call from referencing a consumer that is
scheduled for destruction. The close followed by an immediate open
will occur during a vdev reprobe triggered by any type of I/O error.
Consolidate vdev_geom_close() and vdev_geom_detach() into
vdev_geom_close() and vdev_geom_close_locked(). This also moves the
cross linking operations between vdev and GEOM consumer into a
single place (linking in vdev_geom_attach() and unlinking in
vdev_geom_close_locked()).
Submitted by: gibbs, asomers
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4524
Fix const conversion warning in lz4_decompress which shows when warnings
are enabled (to be done later).
MFC after: 2 weeks
X-MFC-With: r293268
Sponsored by: Multiplay
cperciva's libmd implementation is 5-30% faster
The same was done for SHA256 previously in r263218
cperciva's implementation was lacking SHA-384 which I implemented, validated against OpenSSL and the NIST documentation
Extend sbin/md5 to create sha384(1)
Chase dependancies on sys/crypto/sha2/sha2.{c,h} and replace them with sha512{c.c,.h}
Reviewed by: cperciva, des, delphij
Approved by: secteam, bapt (mentor)
MFC after: 2 weeks
Sponsored by: ScaleEngine Inc.
Differential Revision: https://reviews.freebsd.org/D3929
first instruction to see if it's either a pushm with lr, or a sub with sp.
The former is the common case, with the latter used with va_args.
This removes 12 probes. These are all hand-written assembly, with a few C
functions with no stack usage.
Submitted by: Howard Su <howard0su@gmail.com>
Differential Revision: https://reviews.freebsd.org/D4419
Rather than pushing all eight possible arguments into dtrace_probe()'s
stack frame, make the syscall_args struct for the current syscall available
via the current thread. Using a custom getargval method for the systrace
provider, this allows any syscall argument to be fetched, even in kernels
that have modified the maximum number of system call arguments.
Sponsored by: EMC / Isilon Storage Division
With the new VOP_GETPAGES() KPI the "count" argument counts pages already,
and doesn't need to be translated from bytes to pages.
While here make it consistent that *rbehind and *rahead are updated only
if we doesn't return error.
Pointy hat to: glebius
o With new KPI consumers can request contiguous ranges of pages, and
unlike before, all pages will be kept busied on return, like it was
done before with the 'reqpage' only. Now the reqpage goes away. With
new interface it is easier to implement code protected from race
conditions.
Such arrayed requests for now should be preceeded by a call to
vm_pager_haspage() to make sure that request is possible. This
could be improved later, making vm_pager_haspage() obsolete.
Strenghtening the promises on the business of the array of pages
allows us to remove such hacks as swp_pager_free_nrpage() and
vm_pager_free_nonreq().
o New KPI accepts two integer pointers that may optionally point at
values for read ahead and read behind, that a pager may do, if it
can. These pages are completely owned by pager, and not controlled
by the caller.
This shifts the UFS-specific readahead logic from vm_fault.c, which
should be file system agnostic, into vnode_pager.c. It also removes
one VOP_BMAP() request per hard fault.
Discussed with: kib, alc, jeff, scottl
Sponsored by: Nginx, Inc.
Sponsored by: Netflix
except during split, add, or create operations. This fixes a bug where the
wrong disk could be returned, and higher layers of ZFS would immediately
eject it again.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
o When opening by GUID, require both the pool and vdev GUIDs to
match. While it is highly unlikely for two vdevs to have the same
vdev GUIDs, the ZFS storage pool allocator only guarantees they
are unique within a pool.
o Modify the open behavior to:
- If we are opening a vdev that hasn't previously been opened,
open by path without checking GUIDs.
- Otherwise, open by path and verify GUIDs.
- If that fails, search all geom providers for a device with
matching GUIDs.
- If that fails, return ENOENT.
Submitted by: gibbs, asomers
Reviewed by: smh
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D4486
r281257 added support for lazyload mode by allowing dtrace(1) to register
a DOF section on behalf of a traced process. This was implemented by
having libdtrace copy the DOF section into a heap-allocated buffer and
passing its address to the ioctl handler. However, DTrace uses the DOF
section address as a lookup key in certain cases, so the ioctl handler
should be given the target process' DOF section address instead. This
change modifies the ADDDOF handler to copy the DOF section in from the
target process, rather than from dtrace(1).
These helper functions can be used to read in or write a buffer from or to
an arbitrary process' address space. Without them, this can only be done
using proc_rwmem(), which requires the caller to fill out a uio. This is
onerous and results in code duplication; the new functions provide a simpler
interface which is sufficient for most existing callers of proc_rwmem().
This change also adds a manual page for proc_rwmem() and the new functions.
Reviewed by: jhb, kib
Differential Revision: https://reviews.freebsd.org/D4245
While here update for armv6 to a tested value.
Submitted by: Howard Su <howard0su@gmail.com>
Reviewed by: stat
Differential Revision: https://reviews.freebsd.org/D4315
Boundary Trace to assembly to reduce the overhead of these checks.
Submitted by: Howard Su <howard0su@gmail.com>
Relnotes: Yes
Differential Revision: https://reviews.freebsd.org/D4266
tunable via sysctl or kernel tunables.
Illumos allows this parameters to be changed via the fasttrap.conf configuration
file, but FreeBSD code hardcoded the parameters. Expose them under
the kern.dtrace.fasttrap sysctl tree.
MFC after: 2 weeks
stack, take into account the copy of rsi pushed between the breakpoint
trapframe and the dtrace_invop frame. Prior to r287644, this was covered
by the fact that sizeof(struct amd64_frame) was 24 rather than 16.
Reported by: smh
As reported by Coverity a null pointer de-reference panic would be triggered
when zfs_recover was set so switch to straight panic as it can never be
recovered.
Reported by: Coverity Scan
MFC after: 1
X-MFC-With: r290401
Sponsored by: Multiplay
wrong value in the comparison, leading to incorrectly setting the new
value.
This has been observed in the ZFS code. Without this we can lose track of
the reference count in a zrlock object.
We should move to use the generic atomic functions, however as this has
been observed I would prefer to have this working, then move to the generic
functions.
PR: 204037
Sponsored by: ABT Systems Ltd
b_asize can be zero if the block is compressed into an empty block
(ZIO_COMPRESS_EMPTY) and the trim code asserts that meaningless
zero-sized trimming is not attempted.
The logic for calling trim_map_free() is extracted into a new function
l2arc_trim() to minimize code duplication.
PR: 203473
Reported by: Willem Jan Withagen <wjw@digiware.nl>
Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 11 days
linux_syscallnames[] from linux_* to linux32_* to avoid conflicts with
linux64.ko. While here, add support for linux64 binaries to systrace.
- Update NOPROTO entries in amd64/linux/syscalls.master to match the
main table to fix systrace build.
- Add a special case for union l_semun arguments to the systrace
generation.
- The systrace_linux32 module now only builds the systrace_linux32.ko.
module on amd64.
- Add a new systrace_linux module that builds on both i386 and amd64.
For i386 it builds the existing systrace_linux.ko. For amd64 it
builds a systrace_linux.ko for 64-bit binaries.
Reviewed by: markj
Differential Revision: https://reviews.freebsd.org/D3954
5561 support root pools on EFI/GPT partitioned disks
5125 update zpool/libzfs to manage bootable whole disk pools (EFI/GPT labeled disks)
Reviewed by: Jean McCormack <jean.mccormack@nexenta.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Hans Rosenfeld <hans.rosenfeld@nexenta.com>
illumos/illumos-gate@1a902ef862
This is NOP changes for FreeBSD.
4185 add new cryptographic checksums to ZFS: SHA-512, Skein, Edon-R
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@45818ee124
This is only a partial merge of respective ZFS infrastructure changes.
At this moment FreeBSD kernel has no those crypto algorithms, so the
parts of the code to enable them are commented out. When they are
implemented, it will be trivial to plug them in.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@9c3fd1216f
For more info, see:
- slides http://www.slideshare.net/MatthewAhrens/openzfs-send-and-receive
- video https://www.youtube.com/watch?v=iY44jPMvxog
- manpage changes (for zfs resume -s and zfs send -t)
- upcoming talk at the OpenZFS Developer Summit
The TL;DR is:
Use "zfs receive -s" to save the partially received state on failure.
On failure, get the receive token with "zfs get receive_resume_token <fs>"
Resume the send with "zfs send -t <token_value>"
Relnotes: yes
Reviewed by: Richard Yao <ryao@gentoo.org>
Reviewed by: Xin LI <delphij@freebsd.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Justin T. Gibbs <gibbs@FreeBSD.org>
illumos/illumos-gate@d2058105c6
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@8fe00bfb87
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@6de9bb5603
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@0f2e7d03b8
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Alexander Motin <mav@freebsd.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Gordon Ross <gordon.ross@nexenta.com>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@632802744e
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@139510fb6e
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Albert Lee <trisk@omniti.com>
Reviewed by: Xin Li <delphij@freebsd.org>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@b10bba7246
Before r278702 prefetch was blocked for I/Os > 1MB, after -- >= 1MB.
1MB I/Os are used for bulk operations in CTL (XCOPY, VERIFY), and disabling
prefetch for them reduced the performance.
This is temporary local patch, that should be replaced when upstreamed.
Discussed with: mahrens
MFC after: 3 days
otherwise DTRACE_ANCHORED() returns false and that makes stack()
insert a bogus frame at the top.
For example:
dtrace -n 'test:dtrace_test::sdttest { stack(); }
This change is not really a solution, but just a work-around.
The real solution is to record the probe's call site and to use
that for resolving a function name.
PR: 195222
MFC after: 22 days
A change to a property on a dataset must be propagated to its descendants
in case that property is inherited. For datasets whose information is
not currently loaded into memory (e.g. a snapshot that isn't currently
mounted), there is nothing to do; the property change will take effect
the next time that dataset is loaded. To handle updates to datasets that
are in-core, ZFS registers a callback entry for each property of each
loaded dataset with the dsl directory that holds that dataset. There
is a dsl directory associated with each live dataset that references
both the live dataset and any snapshots of the live dataset. A property
change is effected by doing a traversal of the tree of dsl directories
for a pool, starting at the directory sourcing the change, and invoking
these callbacks.
The current implementation both registers and de-registers properties
individually for each loaded dataset. While registration for a property is
O(1) (insert into a list), de-registration is O(n) (search list and then
remove). The 'n' for de-registration, however, is not limited to the size
(number of snapshots + 1) of the dsl directory. The eviction portion
of the life cycle for the in core state of datasets is asynchronous,
which allows multiple copies of the dataset information to be in-core
at once. Only one of these copies is active at any time with the rest
going through tear down processing, but all copies contribute to the
cost of performing a dsl_prop_unregister().
One way to create multiple, in-flight copies of dataset information
is by performing "zfs list" operations from multiple threads
concurrently. In-core dataset information is loaded on demand and then
evicted when reference counts drops to zero. For datasets that are not
mounted, there is no persistent reference count to keep them resident.
So, a list operation will load them, compute the information required to
do the list operation, and then evict them. When performing this operation
from multiple threads it is possible that some of the in-core dataset
information will be reused, but also possible to lose the race and load
the dataset again, even while the same information is being torn down.
Compounding the performance issue further is a change made for illumos
issue 5056 which made dataset eviction single threaded. In environments
using automation to manage ZFS datasets, it is now possible to create
enough of a backlog of dataset evictions to consume excessive amounts
of kernel memory and to bog down the system.
The fix employed here is to make property de-registration O(1). With this
change in place, it is hoped that a single thread is more than sufficient
to handle eviction processing. If it isn't, the problem can be solved
by increasing the number of threads devoted to the eviction taskq.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_prop.h:
Associate dsl property callback records with both the
dsl directory and the dsl dataset that is registering the
callback. Both connections are protected by the dsl directory's
"dd_lock".
When linking callbacks into a dsl directory, group them by
the property type. This helps reduce the space penalty for the
double association (the property name pointer is stored once
per dsl_dir instead of in each record) and reduces the number of
strcmp() calls required to do callback processing when updating
a single property. Property types are stored in a linked list
since currently ZFS registers a maximum of 10 property types
for each dataset.
Note that the property buckets/records associated with a dsl
directory are created on demand, but only freed when the dsl
directory is freed. Given the static nature of property types
and their small number, there is no benefit to freeing the few
bytes of memory used to represent the property record earlier.
When a property record becomes empty, the dsl directory is either
going to become unreferenced a little later in this thread of
execution, or there is a high chance that another dataset is
going to be loaded that would recreate the bucket anyway.
Replace dsl_prop_unregister() with dsl_prop_unregister_all().
All callers of dsl_prop_unregister() are trying to remove
all property registrations for a given dsl dataset anyway. By
changing the API, we can avoid doing any lookups of callbacks
by property type and just traverse the list of all callbacks
for the dataset and free each one.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
Replace use of dsl_prop_unregister() with the new
dsl_prop_unregister_all() API.
illumos/illumos-gate@03bad06fbb
Author: Justin Gibbs <gibbs@scsiguy.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Illumos issue:
6171 dsl_prop_unregister() slows down dataset eviction
https://www.illumos.org/issues/6171
MFC after: 2 weeks
c546f36aa8https://www.illumos.org/issues/6220
5408 introduced a memleak in l2arc, namely the member b_thawed gets leaked when
an arc_hdr is realloced from full to l2only.
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <sensille@gmx.net>
updated
ZFS already supports storing the vdev FRU in a vdev property. There
is code in libzfs to work with this property, and there is code in
the zfs-retire FMA module that looks for that information. But there
is no code actually setting or updating the FRU.
To address this, ZFS is changed to send a handful of new events
whenever a vdev is added, attached, cleared, or onlined, as well
as when a pool is created or imported.
Note that syseventd is not currently available on FreeBSD and thus
some work is needed to actually support the new ZFS events (e.g. in
zfsd) to actually use this capability, this changeset is mostly a
diff reduction from upstream.
illumos/illumos-gate@1437283407
Illumos issues:
5997 FRU field not set during pool creation and never updated
https://www.illumos.org/issues/5997
In r286570 (MFV of r277426) an unprotected write to b_flags to
set the compression mode was introduced. This would open a race
window where data is partially decompressed, modified, checksummed
and written to the pool, resulting in pool corruption due to the
partial decompression.
Prevent this by reintroducing b_compress
illumos/illumos-gate@d4cd038c92
Illumos issues:
6214 zpools going south
https://www.illumos.org/issues/6214
Rewrite the ZFS prefetch code to detect only forward, sequential
streams.
The following kstats have been added:
kstat.zfs.misc.arcstats.sync_wait_for_async
How many sync reads have waited for async read
to complete. (less is better)
kstat.zfs.misc.arcstats.demand_hit_predictive_prefetch
How many demand read didn't have to wait for I/O
because of predictive prefetch. (more is better)
zfetch kstats have been similified to hits, misses, and max_streams,
with max_streams representing times when we were not able to create
new stream because we already have the maximum number of sequences
for a file.
The sysctl variable/loader tunable vfs.zfs.zfetch.block_cap have been
replaced by vfs.zfs.zfetch.max_distance, which controls maximum bytes
to prefetch per stream.
illumos/illumos-gate@cf6106c8a0
Illumos ZFS issues:
5987 zfs prefetch code needs work
https://www.illumos.org/issues/5987
since on amd64 the first argument to a function is generally not on the
stack.
Revert an old DTrace bug fix to some code that assumed that
sizeof(struct amd64_frame) == 16.
Reviewed by: jhb, kib
Sponsored by: EMC / Isilon Storage Division
Differential Revision: https://reviews.freebsd.org/D3255