Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.
The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.
The ugly code employs potentially unsafe locking tricks.
Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.
PR: kern/151111
Reviewed by: kib
MFC after: 13 days
Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.
illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted
References:
https://www.illumos.org/issues/3254
MFC after: 2 weeks
Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.
It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.
IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write
References:
https://www.illumos.org/issues/3236
MFC after: 2 weeks
* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.
* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model. All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.
* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.
* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.
* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.
The above changes fix at least two known/reported problems:
o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit. This type of deadlock has not been
reported as of now.
o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim. This would happen if the znode is unlinked
while its vnode is still active.
To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start. This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested. Or to avoid blocking
when it is not allowed (see zil_commit example above).
ffs_vgetf seems like a good source of inspiration.
Tested by: Willem Jan Withagen <wjw@digiware.nl>
MFC after: 6 weeks
... otherwise zfs_getpages would mostly be called with one page at a time.
It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block. Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk. The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed. vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem. So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.
vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.
vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.
Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.
The above list should cover all the current calls to VOP_BMAP.
Reviewed by: kib
MFC after: 6 weeks
Illumos 13886:e3261d03efbf
3349 zpool upgrade -V bumps the on disk version number, but leaves
the in core version
References:
https://www.illumos.org/issues/3349
MFC after: 1 week
... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.
This is not an issue for upstream because they do not seem to support
using raidz as a root pool.
Reported by: Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by: Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after: 6 days
The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.
The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).
MFC after: 6 days
if we fail to generate a proper root pool configuration based on disk
probing. Currently we can not properly generate the configuration for
multi-vdev pools. Make that explicit.
Reported by: madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
Tested by: madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
MFC after: 4 days
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.
Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.
The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.
Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.
PR: standards/170346
Submitted by: "Jukka A. Ukkonen" <jau@iki.fi>
MFC after: 1 month
... before trying to destroy the zvol snapshots themselves.
PR: kern/173442
Reported by: Petri Helenius <petri@helenius.fi>,
mm
Obtained from: Brian Behlendorf <behlendorf1@llnl.gov>,
Illumos Bug #3170
Tested by: Petri Helenius <petri@helenius.fi>
MFC after: 10 days
Illumos r13840:97fd5cdf328a:
3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()
Illumos r13849:3468a95b27cd:
3258 ztest's use of file descriptors is unstable
There is one known issue: Some probes will display an error message along the
lines of: "Invalid address (0)"
I tested this with both a simple dtrace probe and dtruss on a few different
binaries on 32-bit. I only compiled 64-bit, did not run it, but I don't expect
problems without the modules loaded. Volunteers are welcome.
MFC after: 1 month
Otherwise we could fail with an incorrect error if e.g. parent
object id is removed too or we can even return a wrong vnode if
parent object has been already re-used.
Discussed with: pjd
Also see: http://article.gmane.org/gmane.os.freebsd.devel.file-systems/13863
MFC after: 26 days
vn_lock should do the right thing with respect to given vnode lock
flags. If a caller doesn't mind a doomed vnode, then zfs should deliver.
Reviewed by: kib
MFC after: 19 days
It turned out to be not that useful, because its default value may lead
to a problem when a root pool is present in zpool.cache, but its
on-disk status is 'exported'. This may happen if the pool was imported
in a different environment with -f flag and then exported.
MFC after: 12 days
POOL_STATE_SPARE and POOL_STATE_L2CACHE were not handled correctly
and thus the cache and spare disks would not be correctly probed.
Reported by: Michael Schmiedgen <schmiedgen@gmx.net>,
Matthew D. Fuller <fullermd@over-yonder.net>
Tested by: Michael Schmiedgen <schmiedgen@gmx.net>,
flo
MFC after: 5 days
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.
The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.
Conducted and reviewed by: attilio
Tested by: pho
... otherwise the current thread might be holding ARC locks and thus run
into a deadlock. This happens, for example, when a thread does memory
allocation in the ARC code and runs into KVA shortage.
Also, it really makes the most sense to wait in pageproc, so that the
results of ARC reclamation are seen before the page cache is acted upon.
In other cases where vm_lowmem is invoked, e.g. on KVA space shortage,
the callers perform multiple attempts (up to 8) and wait for rather
long intervals between them (up to 4 seconds), so ARC reclaim results
should become visible even without explicit waiting on the ARC thread.
Note that this is not a critical issue for typical ZFS usages where KVA
space should already be large enough. On amd64 systems setting KVA size
to twice the physical memory size is known to mitigate KVA fragmentation
issues in practice.
Side note: perhaps vm_lowmem 'how' parameter should be used to
differentiate between causes of the event.
Reported by: Nikolay Denev <ndenev@gmail.com>
MFC after: 19 days
getnewvnode_reserve helps to avoid "recursing" back into zfs code
via getnewvnode when that latter needs to reclaim some vnodes.
zfs code may hold a number of locks around getnewvnode and doesn't
expect any recursion to happen on those locks, because that never
happens in solaris.
I believe that this change also eleiminates a need for the delayed
znode destruction via the taskqueue.
Many thanks to kib for devising getnewvnode_reserve.
Reported by: flo
Tested by: bapt, kwm, swills
MFC after: 2 weeks
X-MFC after: r241556
... instead of deferring the action until first open.
Unlike upstream this has no benefit on FreeBSD.
We know that as soon as the provider is created it is going to be tasted
and thus opened. Initial mediasize of zero causes tasting failure
and subsequent retasting because of the size change.
MFC after: 14 days
This should allow to mount a dataset as a root filesystem even if
it belongs to a pool that is not described in zpool.cache.
This adds some overhead to the boot process though.
If the root filesystem's pool is found in zpool.cache, the by default
its cached configuration will be used for import.
vfs.zfs.rootpool.prefer_cached_config could be set to zero to force
the config to be retasted.
Discussed with: gibbs, pjd, des
MFC after: 25 days
doesn't exist on a dataset we are starting from. For example if we
have the following configuration:
tank
tank/foo
tank/foo@snap
tank/bar
tank/bar@snap
We can execute:
# zfs destroy -t tank@snap
eventhough tank@snap doesn't exit.
Unfortunately it is not possible to do the same with recursive rename:
# zfs rename -r tank@snap tank@pans
cannot open 'tank@snap': dataset does not exist
...until now. This change allows to recursively rename snapshots even if
snapshot doesn't exist on the starting dataset.
Sponsored by: rsync.net
MFC after: 2 weeks
The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.
Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).
There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.
Sponsored by: multiplay.co.uk
This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.
Obtained from: zfsonlinux
Submitted by: Etienne Dechamps <etienne.dechamps@ovh.net>
Do this by checking if spa_namespace_lock is already held and not taking
it again in that case.
Add a comment explaining why that is done and why it is safe.
Reviewed by: pjd
MFC after: 24 days