OpenZFS uses feature flags instead of a zpool version number to track
features since the split from Oracle. In addition to avoiding confusion
on ZFS vs OpenZFS version numbers, this also allows features to be added
to different operating systems that use OpenZFS in different order.
The previous zfs boot code (gptzfsboot) and loader (zfsloader) blindly
tries to read the pool, and if failed provided only a vague error message.
With this change, both the boot code and loader check the MOS features
list in the ZFS label and compare it against the list of features that
the loader supports. If any unsupported feature is active, the pool is
not considered as a candidate for booting, and a helpful diagnostic
message is printed to the screen. Features that are merely enabled via
zpool upgrade, but not in use, do not block booting from the pool.
Submitted by: Toomas Soome <tsoome@me.com>
Reviewed by: delphij, mav
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D6857
Also reduce the diff between us and upstream: the input data model will
always be DATAMODEL_NATIVE because of a bug (p_model is never set but is
always initialized to 0), so we don't need to override the caller anyway.
This change is also necessary to support the pid provider for 32-bit
processes on amd64.
MFC after: 2 weeks
illumos/illumos-gate@1825bc56e51825bc56e5https://www.illumos.org/issues/6878
Summary of changes:
* Replace generic "scan done" message with "scan aborted, restarting",
"scan cancelled", or "scan done"
* Log number of errors using spa_get_errlog_size
* Refactor scan restarting check into static function
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Nav Ravindranath <nav@delphix.com>
MFC after: 2 weeks
illumos/illumos-gate@99189164df99189164dfhttps://www.illumos.org/issues/6940
Similar to #6334, but this time with empty directories:
$ zfs create tank/quota
$ zfs set quota=10M tank/quota
$ zfs snapshot tank/quota@snap1
$ zfs set mountpoint=/mnt/tank/quota tank/quota
$ mkdir /mnt/tank/quota/dir # create an empty directory
$ mkfile 11M /mnt/tank/quota/11M
/mnt/tank/quota/11M: initialized 9830400 of 11534336 bytes: Disc quota exceeded
$ rmdir /mnt/tank/quota/dir # now unlink the empty directory
rmdir: directory "/mnt/tank/quota/dir": Disc quota exceeded
From user perspective, I would expect that ZFS is always able to remove files
and directories even when the quota is exceeded.
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Simon Klinkert <simon.klinkert@gmail.com>
MFC after: 2 weeks
illumos/illumos-gate@8df0bcf0df8df0bcf0dfhttps://www.illumos.org/issues/6513
If a ZFS object contains a hole at level one, and then a data block is created
at level 0 underneath that l1 block, l0 holes will be created. However, these
l0 holes do not have the birth time property set; as a result, incremental
sends will not send those holes.
Fix is to modify the dbuf_read code to fill in birth time data.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Paul Dagnelie <pcd@delphix.com>
MFC after: 3 weeks
illumos/illumos-gate@11ceac77ea11ceac77eahttps://www.illumos.org/issues/6844
dnode_next_offset is used in a variety of places to iterate over the holes or
allocated blocks in a dnode. It operates under the premise that it can iterate
over the blockpointers of a dnode in open context while holding only the
dn_struct_rwlock as reader. Unfortunately, this premise does not hold.
When we create the zio for a dbuf, we pass in the actual block pointer in the
indirect block above that dbuf. When we later zero the bp in
zio_write_compress, we are directly modifying the bp. The state of the bp is
now inconsistent from the perspective of dnode_next_offset: the bp will appear
to be a hole until zio_dva_allocate finally finishes filling it in. In the
meantime, dnode_next_offset can detect a hole in the dnode when none exists.
I was able to experimentally demonstrate this behavior with the following
setup:
1. Create a file with 1 million dbufs.
2. Create a thread that randomly dirties L2 blocks by writing to the first L0
block under them.
3. Observe dnode_next_offset, waiting for it to skip over a hole in the middle
of a file.
4. Do dnode_next_offset in a loop until we skip over such a non-existent hole.
The fix is to ensure that it is valid to iterate over the indirect blocks in a
dnode while holding the dn_struct_rwlock by passing the zio a copy of the BP
and updating the actual BP in dbuf_write_ready while holding the lock.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Boris Protopopov <bprotopopov@hotmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Alex Reece <alex@delphix.com>
MFC after: 3 weeks
The change has been undone in r301275 on the assumption that it was no
longer required. But that was incorrect, because in this case (and only
in this case) the snapshot root vnode is looked up before z_parent is
fixed up.
MFC after: 5 days
Predicates are DIF objects whose return value is compared with zero to
determine whether the corresponding probe body is to be executed. The return
value itself is the contents of a 64-bit DIF register, but it was being
truncated to an int before the comparison. This meant that a predicate such
as /0x100000000/ would evaluate to false.
Reported by: rwatson
MFC after: 3 days
Due to ARC initial configuration not being done and kmem information
not being available we need to blindly set zfs_arc_max and zfs_arc_min
when configured via the tunable.
This fixes vfs.zfs.arc_(min|max) configuration via loader.conf broken by
r302265.
Approved by: re(gjb)
MFC after: 1 week
mp_maxid or CPU_FOREACH() as appropriate. This fixes a number of places in
the kernel that assumed CPU IDs are dense in [0, mp_ncpus) and would try,
for example, to run tasks on CPUs that did not exist or to allocate too
few buffers on systems with sparse CPU IDs in which there are holes in the
range and mp_maxid > mp_ncpus. Such circumstances generally occur on
systems with SMT, but on which SMT is disabled. This patch restores system
operation at least on POWER8 systems configured in this way.
There are a number of other places in the kernel with potential problems
in these situations, but where sparse CPU IDs are not currently known
to occur, mostly in the ARM machine-dependent code. These will be fixed
in a follow-up commit after the stable/11 branch.
PR: kern/210106
Reviewed by: jhb
Approved by: re (glebius)
Those changes were found confusing FreeBSD libc ACL code, that doesn't
differentiate ACL for directories and files, and report ACLs for all
directories created after those patches as non-trivial. On the other
side these changes were considered wrong from POSIX and NFSv4 points of
view. Until further investigation done upstream, revert those changes
locally in preparation for FreeBSD 11.0 release.
Approved by: re (hrs)
Prior to this change ZFS ARC min / max could only be changed using
boot time tunables, this allows the values to be tuned at runtime
using the sysctls:
* vfs.zfs.arc_max
* vfs.zfs.arc_min
When adjusting ZFS ARC minimum the memory used will only reduce
to the new minimum given memory pressure.
Reviewed by: allanjude
Approved by: re (gjb)
MFC after: 2 weeks
Relnotes: yes
Sponsored by: Multiplay
Differential Revision: https://reviews.freebsd.org/D5907
getzfsvfs() called vfs_busy() in the waiting mode while having a hold on
a pool (via a call to dmu_objset_hold). In other words,
dp_config_rwlock was held in the shared mode while a thread could be
sleeping in vfs_busy().
The pool's txg sync thread needs to take dp_config_rwlock in the
exclusive mode for some actions, e.g., for executing sync tasks. If the
sync thread gets blocked, then any thread waiting for its sync task to
get executed is also blocked. Which, in turn, could mean that
vfs_busy() will keep waiting indefinitely.
The solution is to use vfs_ref() in the locked section and to call
vfs_busy() only after dropping other locks.
Note that a reference on a struct mount object does not prevent an
associated zfsvfs_t object from being destroyed. So, we have to be
careful to operate only on the struct mount object until we successfully
vfs_busy it.
Approved by: re (gjb)
MFC after: 2 weeks
This apparently puts ARC back under the limits after the vnode pressure
rework in r291244, in particular due to the kmem exhaustion.
Based on patch by: mckusick
Reviewed by: avg, mckusick
Tested by: allanjude, madpilot
Sponsored by: The FreeBSD Foundation
Approved by: re (gjb)
The change is in arc_buf_l2_cdata_free().
Without this we can trip the assertion in arc_hdr_realloc()
if INVARIANTS option is enabled.
Approved by: re (kib)
MFC after: 1 week
This is a followup to r300131.
A filesystem's root vnode can be reached not only through VSF_ROOT, but
by other means as well. For example, via a dot-dot lookup.
Also, a root vnode can get reclaimed and then re-created. For these
reasons it was insufficient to clear VV_ROOT flag from a root vnode of a
snapshot mounted under .zfs in zfsctl_snapdir_lookup().
So, now we set the flag in zfs_znode_sa_init() only if a vnode
represent a root of a filesystem or a standalone snapshot.
That is, the flag is not set for snapshots mounted under .zfs.
MFC after: 2 weeks
It could happen in an unlikely case that we fail to lock the root vnode
with requested flags (which appear to never include LK_NOWAIT).
MFC after: 1 week
sys/cddl/contrib/opensolaris/uts/common/sys/acl.h:
Improve the english in a comment. No functional changes
Submitted by: gibbs
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Support for the new hashing algorithms in ZFS was introduced in r289422
However it was disconnected because FreeBSD lacked implementations of
SHA-512 (truncated to 256 bits), and Skein.
These implementations were introduced in r300921 and r300966 respectively
This commit connects them to ZFS and enabled these new checksum algorithms
This new algorithms are not supported by the boot blocks, so do not use them
on your root dataset if you boot from ZFS.
Relnotes: yes
Sponsored by: ScaleEngine Inc.
Add zfsd, which deals with hard drive faults in ZFS pools. It manages
hotspares and replements in drive slots that publish physical paths.
cddl/usr.sbin/zfsd
Add zfsd(8) and its unit tests
cddl/usr.sbin/Makefile
Add zfsd to the build
lib/libdevdctl
A C++ library that helps devd clients process events
lib/Makefile
share/mk/bsd.libnames.mk
share/mk/src.libnames.mk
Add libdevdctl to the build. It's a private library, unusable by
out-of-tree software.
etc/defaults/rc.conf
By default, set zfsd_enable to NO
etc/mtree/BSD.include.dist
Add a directory for libdevdctl's include files
etc/mtree/BSD.tests.dist
Add a directory for zfsd's unit tests
etc/mtree/BSD.var.dist
Add /var/db/zfsd/cases, where zfsd stores case files while it's shut
down.
etc/rc.d/Makefile
etc/rc.d/zfsd
Add zfsd's rc script
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c
Fix the resource.fs.zfs.statechange message. It had a number of
problems:
It was only being emitted on a transition to the HEALTHY state.
That made it impossible for zfsd to take actions based on drives
getting sicker.
It compared the new state to vdev_prevstate, which is the state that
the vdev had the last time it was opened. That doesn't make sense,
because a vdev can change state multiple times without being
reopened.
vdev_set_state contains logic that will change the device's new
state based on various conditions. However, the statechange event
was being posted _before_ that logic took effect. Now it's being
posted after.
Submitted by: gibbs, asomers, mav, allanjude
Reviewed by: mav, delphij
Relnotes: yes
Sponsored by: Spectra Logic Corp, iX Systems
Differential Revision: https://reviews.freebsd.org/D6564
The sys/types.h fix I proposed was only tested with zfs(4), not with
libzpool, which is where the build failure actually existed
Remove vm/vm_pageout.h from arc.c and zfs_vnops.c because they're both
unneeded
MFC after: 1 week
X-MFC with: r300865, r300870
In collaboration with: kib
Submitted by: alc
Sponsored by: EMC / Isilon Storage Division
ZFS's configuration needs to be updated whenever the physical path for a
device changes, but not when a new device is introduced. This is because new
devices necessarily cause config updates, but only if they are actually
accepted into the pool.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Split vdev_geom_set_physpath out of vdev_geom_attrchanged. When
setting the vdev's physical path, only request a config update if
the physical path has changed. Don't request it when opening a
device for the first time, because the config sync will happen
anyway upstack.
sys/geom/geom_dev.c
Split g_dev_set_physpath and g_dev_set_media out of
g_dev_attrchanged
Submitted by: will, asomers
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6428
vm/vm_pageout.h grew a dependency on the bool typedef in r300865
arc.c didn't include sys/types.h, which included the definition for the typedef
Other items (ofed, drm2) might need to be chased for this commit.
X-MFC with: r300865
MFC after: 1 week
Pointyhat to: alc
Sponsored by: EMC / Isilon Storage Division
former return the current status for the latter to use. Without this we
could enable interrupts when they shouldn't be.
It's still not quite right as it should only update the bits we care about,
bit should be good enough until the correct fix can be tested.
PR: 204270
Obtained from: ABT Systems Ltd
Sponsored by: The FreeBSD Foundation
That was both redundant as zfs_znode_sa_init() already does the job and
insufficient as the root vnode can be reached via other means.
MFC after: 1 weeks
gfs code is (almsot) completely agnostic of FreeBSD VFS locking, so it
does not handle doomed but not yet dead vnodes and may return them.
Check for those vnodes here and retry a lookup.
Note that ZFS and gfs have additional protections that ensure that a
parent vnode of the current vnode is never doomed.
The fixed problem is an occasional failure to lookup a 'snapshot' or
'shares' directories under .zfs.
Note that for the above reason all uses of zfsctl_root_lookup() are
better be replaced with VOP_LOOKUP.
MFC after: 5 weeks
Speedup is hard to measure because the only time vdev_geom_open_by_guids
gets called on many drives at the same time is during boot. But with
vdev_geom_open hacked to always call vdev_geom_open_by_guids, operations
like "zpool create" speed up by 65%.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
* Read all of a vdev's labels in parallel instead of sequentially.
* In vdev_geom_read_config, don't read the entire label, including
the uberblock. That's a waste of RAM. Just read the vdev config
nvlist. Reduces the IO and RAM involved with tasting from 1MB to
448KB.
Reviewed by: avg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D6153
FreeBSD zfs_ioc_rename() has an option, not present upstream, that
allows to rename snapshots without unmounting them first. I am not sure
what is a rationale for that option, but its actual behavior was the
opposite of the intended behavior. That is, by default the snapshots
were not unmounted.
The option was introduced as part of a large update from upstream in
r248498.
One of the consequences was a havoc under .zfs/snapshot after the rename.
The snapshots got new names but were mounted on top of directories with
old names, so readdir would list the new names, but lookup would still
find the old mounts.
PR: 209093
Reported by: Frédéric VANNIÈRE <f.vanniere@planet-work.com>
MFC after: 5 days
That was just wrong. In fact, we can safely keep this static entry when
it's inactive.
Now the destructive action is moved to the reclaim method and the
function is renamed from zfsctl_snapdir_inactive(0 to
zfsctl_snapdir_reclaim().
Also, we can use gfs_vop_reclaim() instead of gfs_dir_inactive() +
kmem_free().
Lastly, we can just assert that the node does not any children when it
is reclaimed, even on the force unmount. That's because zfs_umount()
does an extra vflush() pass which should destroy all snapshot-mountpoint
vnodes that are the snapdir's children.
MFC after: 5 weeks
Those vnodes should not linger. "Stale" nodes may get out of
synchronization with actual snapshots. For example if we destroy a
snapshot and create a new one with the same name. Or when we rename a
snapshot.
While there fix the argument type for zfsctl_snapshot_reclaim().
Also, its original argument can be passed to gfs_vop_reclaim() directly.
Bug 209093 could be related although I have not specifically verified
that. Referencing just in case.
PR: 209093
MFC after: 5 weeks
Dropping the root vnode's lock after VFS_ROOT() didn't really help the
fact that we acquired the lock while holding its child's, .zfs, lock
while performing the operaiton.
So, directly use zfs_zget() to get the root vnode.
While there simplify the code in zfsctl_freebsd_root_lookup.
We know that .zfs is always exclusively locked.
We know that there is already a reference on *vpp, so no need for an
extra one.
Account for the fact that .. lookup may ask for a different lock type,
not necessarily LK_EXCLUSIVE. And handle a possible failure to acquire
the lock given the lock flags.
MFC after: 5 weeks