Commit Graph

551 Commits

Author SHA1 Message Date
Andriy Gapon
992ffc58ae gfs_file_inactive: replace bad code with ugly code
Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.

The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.

The ugly code employs potentially unsafe locking tricks.

Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.

PR:		kern/151111
Reviewed by:	kib
MFC after:	13 days
2012-12-01 18:12:55 +00:00
Martin Matuska
7faa32552f MFV r243395:
Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.

illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted

References:
https://www.illumos.org/issues/3254

MFC after:	2 weeks
2012-11-26 12:24:39 +00:00
Martin Matuska
53e5858c68 Add loader(8) tunable to enable/disable nopwrite functionality:
vfs.zfs.nopwrite_enabled

MFC after:	2 weeks
2012-11-25 16:54:43 +00:00
Martin Matuska
dd801aa546 MFV r243013 and r243267:
Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after:	2 weeks
2012-11-25 16:32:07 +00:00
Andriy Gapon
3a0e1b57bb zfs_freebsd_reclaim: remove a stray variable
... which leaked from a subsequent local change.
Unfortunately I noticed that only after commit.

MFC after:	5 weeks
X-MFC with:	r243520
2012-11-25 15:46:29 +00:00
Andriy Gapon
4ff1c77d22 zfs: overhaul zfs-vfs glue for vnode life-cycle management
* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.

* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model.  All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.

* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.

* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.

* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.

The above changes fix at least two known/reported problems:

o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit.  This type of deadlock has not been
reported as of now.

o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim.  This would happen if the znode is unlinked
while its vnode is still active.

To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start.  This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested.  Or to avoid blocking
when it is not allowed (see zil_commit example above).

ffs_vgetf seems like a good source of inspiration.

Tested by:	Willem Jan Withagen <wjw@digiware.nl>
MFC after:	6 weeks
2012-11-25 15:33:26 +00:00
Andriy Gapon
7ca5310ea3 zfs_fhtovp: there is no reason to amend lock flags with LK_RETRY here
MFC after:	12 days
2012-11-25 15:07:27 +00:00
Andriy Gapon
7192f62bcc add zfs_bmap to aid vnode_pager_haspage
... otherwise zfs_getpages would mostly be called with one page at a time.

It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block.  Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk.  The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed.  vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem.  So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.

vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.

vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.

Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.

The above list should cover all the current calls to VOP_BMAP.

Reviewed by:	kib
MFC after:	6 weeks
2012-11-25 15:01:12 +00:00
Andriy Gapon
b609e5f891 zfs_getpages: optimize for large block sizes
MFC after:	6 weeks
2012-11-25 14:53:26 +00:00
Martin Matuska
2f06dfc9a3 MFV r243012:
Illumos 13886:e3261d03efbf

3349 zpool upgrade -V bumps the on disk version number, but leaves
     the in core version

References:
https://www.illumos.org/issues/3349

MFC after:	1 week
2012-11-25 10:53:42 +00:00
Martin Matuska
2b8d4033cc MFV r242735:
Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after:	2 weeks
2012-11-25 09:06:32 +00:00
Andriy Gapon
328998eac1 zfs roopool: add support for multi-vdev configurations
Tested by:	madpilot
MFC after:	10 days
2012-11-24 13:23:15 +00:00
Andriy Gapon
e1fccde2c9 spa_import_rootpool: initialize ub_version before calling spa_config_parse
... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.

This is not an issue for upstream because they do not seem to support
using raidz as a root pool.

Reported by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after:	6 days
2012-11-24 13:16:49 +00:00
Andriy Gapon
cfca00a2fb spa_import_rootpool: do not call spa_history_log_version
The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.

The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).

MFC after:	6 days
2012-11-24 13:14:53 +00:00
Andriy Gapon
c4f59a3c09 zfs: create devices/geoms from zvols after receiveing them
PR:		kern/167066
Tested by:	Andreas Nilsson <andrnils@gmail.com>
MFC after:	13 days
2012-11-24 13:07:31 +00:00
Andriy Gapon
dbe922173c zfs_remove: assert that delete_now case is never true on FreeBSD
That case is specific to Solaris VFS and it would violate pretty
fundamental contracts of FreeBSD VFS.

Discussed with:	pjd
MFC after:	12 days
2012-11-19 11:30:08 +00:00
Andriy Gapon
7b069f7fee zfs_remove: set VV_NOSYNC flag if a node is unlinked
Suggested by:	kib
MFC after:	12 days
2012-11-19 11:25:20 +00:00
Andriy Gapon
62875a6b10 spa_import_rootpool: fall back to use configuration from zpool.cache...
if we fail to generate a proper root pool configuration based on disk
probing.  Currently we can not properly generate the configuration for
multi-vdev pools.  Make that explicit.

Reported by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
Tested by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
MFC after:	4 days
2012-11-18 11:47:25 +00:00
Konstantin Belousov
f13b5a0f01 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
Andriy Gapon
7631b580ff zfs_ioc_destroy_snaps_nvl: remove disk device entries for zvol snapshots
... before trying to destroy the zvol snapshots themselves.

PR:		kern/173442
Reported by:	Petri Helenius <petri@helenius.fi>,
		mm
Obtained from:	Brian Behlendorf <behlendorf1@llnl.gov>,
		Illumos Bug #3170
Tested by:	Petri Helenius <petri@helenius.fi>
MFC after:	10 days
2012-11-10 12:22:26 +00:00
Xin LI
a9b09a3f3c MFV r242729 (mm):
Illumos r13840:97fd5cdf328a:

3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Illumos r13849:3468a95b27cd:

3258 ztest's use of file descriptors is unstable
2012-11-10 01:52:52 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Justin Hibbits
c757049235 Implement DTrace for PowerPC. This includes both 32-bit and 64-bit.
There is one known issue:  Some probes will display an error message along the
lines of:  "Invalid address (0)"

I tested this with both a simple dtrace probe and dtruss on a few different
binaries on 32-bit.  I only compiled 64-bit, did not run it, but I don't expect
problems without the modules loaded.  Volunteers are welcome.

MFC after:	1 month
2012-11-07 23:45:09 +00:00
Andriy Gapon
2c6024ec1b zfs_dirlook: bailout early if directory is unlinked
Otherwise we could fail with an incorrect error if e.g. parent
object id is removed too or we can even return a wrong vnode if
parent object has been already re-used.

Discussed with:	pjd
Also see:	http://article.gmane.org/gmane.os.freebsd.devel.file-systems/13863
MFC after:	26 days
2012-11-04 14:50:08 +00:00
Andriy Gapon
5c997cc429 zfsctl_snapdir_lookup: obtain a snapname in the remount case
... which is triggered if somebody did regular umount on a snapshot mount.

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	20 days
2012-11-04 14:43:15 +00:00
Andriy Gapon
88c8884a71 zfs: set MNTK_EXTENDED_SHARED flag
Discussed with:	kib
MFC after:	20 days
2012-11-04 14:36:11 +00:00
Andriy Gapon
71900cfaf7 zfs_vnode_forget: dispose of larvae vnode using public vfs api (mostly)
Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:24:00 +00:00
Andriy Gapon
a16e534dbe zfs_umount: no need to set MNTK_UNMOUNTF here, dounmount handles that
Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:22:25 +00:00
Andriy Gapon
62eeeb8ff8 zfs_vnode_lock: no need to double-guess caller's intentions here
vn_lock should do the right thing with respect to given vnode lock
flags.  If a caller doesn't mind a doomed vnode, then zfs should deliver.

Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:15:13 +00:00
Andriy Gapon
d548e8b66f zfs_mount: drop vfs.zfs.rootpool.prefer_cached_config tunable
It turned out to be not that useful, because its default value may lead
to a problem when a root pool is present in zpool.cache, but its
on-disk status is 'exported'.  This may happen if the pool was imported
in a different environment with -f flag and then exported.

MFC after:	12 days
2012-11-04 13:50:08 +00:00
Andriy Gapon
7ac8ca0d58 zfs_freebsd_close: call zfs_close with count=1 instead of count=0
Otherwise we may be leaking z_sync_cnt, which may lead to unnecessary
ZIL sync-ing.

MFC after:	12 days
2012-11-04 13:48:48 +00:00
Xin LI
7f24254add s/dettach/detach/g
Approved by:	pjd
MFC after:	1 month
2012-10-30 01:29:45 +00:00
Andriy Gapon
86812da016 zfs: fix label validation code in vdev_geom_read_config
POOL_STATE_SPARE and POOL_STATE_L2CACHE were not handled correctly
and thus the cache and spare disks would not be correctly probed.

Reported by:	Michael Schmiedgen <schmiedgen@gmx.net>,
		Matthew D. Fuller <fullermd@over-yonder.net>
Tested by:	Michael Schmiedgen <schmiedgen@gmx.net>,
		flo
MFC after:	5 days
2012-10-26 14:50:16 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Andriy Gapon
42abbab471 zfs: wait in arc_lowmem only if curproc == pageproc
... otherwise the current thread might be holding ARC locks and thus run
into a deadlock.  This happens, for example, when a thread does memory
allocation in the ARC code and runs into KVA shortage.
Also, it really makes the most sense to wait in pageproc, so that the
results of ARC reclamation are seen before the page cache is acted upon.
In other cases where vm_lowmem is invoked, e.g. on KVA space shortage,
the callers perform multiple attempts (up to 8) and wait for rather
long intervals between them (up to 4 seconds), so ARC reclaim results
should become visible even without explicit waiting on the ARC thread.

Note that this is not a critical issue for typical ZFS usages where KVA
space should already be large enough.  On amd64 systems setting KVA size
to twice the physical memory size is known to mitigate KVA fragmentation
issues in practice.

Side note: perhaps vm_lowmem 'how' parameter should be used to
differentiate between causes of the event.

Reported by:	Nikolay Denev <ndenev@gmail.com>
MFC after:	19 days
2012-10-20 10:02:18 +00:00
Andriy Gapon
edb085b8ab zfs: make use of getnewvnode_reserve in zfs_mknode and zfs_zget
getnewvnode_reserve helps to avoid "recursing" back into zfs code
via getnewvnode when that latter needs to reclaim some vnodes.
zfs code may hold a number of locks around getnewvnode and doesn't
expect any recursion to happen on those locks, because that never
happens in solaris.

I believe that this change also eleiminates a need for the delayed
znode destruction via the taskqueue.

Many thanks to kib for devising getnewvnode_reserve.

Reported by:	flo
Tested by:	bapt, kwm, swills
MFC after:	2 weeks
X-MFC after:	r241556
2012-10-17 10:59:56 +00:00
Kevin Lo
9823d52705 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
Kevin Lo
a10cee30c9 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
Andriy Gapon
8bf749ef3a zvol: set mediasize in geom provider right upon its creation
... instead of deferring the action until first open.
Unlike upstream this has no benefit on FreeBSD.
We know that as soon as the provider is created it is going to be tasted
and thus opened.  Initial mediasize of zero causes tasting failure
and subsequent retasting because of the size change.

MFC after:	14 days
2012-10-06 19:57:27 +00:00
Andriy Gapon
61e100ee3b zfs_mount: taste geom providers for root pool config
This should allow to mount a dataset as a root filesystem even if
it belongs to a pool that is not described in zpool.cache.
This adds some overhead to the boot process though.

If the root filesystem's pool is found in zpool.cache, the by default
its cached configuration will be used for import.
vfs.zfs.rootpool.prefer_cached_config could be set to zero to force
the config to be retasted.

Discussed with:	gibbs, pjd, des
MFC after:	25 days
2012-10-06 19:33:47 +00:00
Martin Matuska
8469b12c2e Merge recent vendor changes in ZFS.
Illumos issued covered:
2811 missing implementation: zfs send -r
3139 zdb dies when it tries to determine path of unlinked file
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
3208 moving zpool cross-endian results in incorrect user/group accounting

References:
  https://www.illumos.org/issues/ + [issue_id]

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-09-26 09:37:58 +00:00
Pawel Jakub Dawidek
c622f88dd2 It is possible to recursively destroy snapshots even if the snapshot
doesn't exist on a dataset we are starting from. For example if we
have the following configuration:

	tank
	tank/foo
	tank/foo@snap
	tank/bar
	tank/bar@snap

We can execute:

	# zfs destroy -t tank@snap

eventhough tank@snap doesn't exit.

Unfortunately it is not possible to do the same with recursive rename:

	# zfs rename -r tank@snap tank@pans
	cannot open 'tank@snap': dataset does not exist

...until now. This change allows to recursively rename snapshots even if
snapshot doesn't exist on the starting dataset.

Sponsored by:	rsync.net
MFC after:	2 weeks
2012-09-23 20:12:10 +00:00
Pawel Jakub Dawidek
bcb77be2b7 Add TRIM support.
The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.

Sponsored by:	multiplay.co.uk

This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.

Obtained from:	zfsonlinux
Submitted by:	Etienne Dechamps <etienne.dechamps@ovh.net>
2012-09-23 19:40:58 +00:00
Andriy Gapon
81c4584e30 zfs: allow a zvol to be used as a pool vdev, again
Do this by checking if spa_namespace_lock is already held and not taking
it again in that case.
Add a comment explaining why that is done and why it is safe.

Reviewed by:	pjd
MFC after:	24 days
2012-09-22 17:42:53 +00:00
Pawel Jakub Dawidek
3c5a057574 As in r226967, r226987 and r232401 changes to UFS and TMPFS remove cache
entries associated with the source and the target of rename().

MFC after:	1 week
2012-09-22 17:32:40 +00:00
Andriy Gapon
6ed9e9f32f zfs: correctly calculate dn_bonuslen for saving SAs to disk
Since all attribute values start at 8-byte aligned boundary, we would
previously incorrectly calculate dn_bonuslen if any attribute but the
last had a variable-length value with length not multiple of 8.

Reported by:	Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Tested by:	Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Reviewed by:	Matthew Ahrens <mahrens@delphix.com> (for upstream)
MFC after:	2 weeks
2012-09-18 08:02:54 +00:00
Andriy Gapon
ea559fb573 zfs: allow both DEBUG and ZFS_DEBUG to be defined on command line
Discussed with:	pjd
MFC after:	10 days
2012-09-18 08:00:56 +00:00
Martin Matuska
4c5238d576 Merge recent zfs vendor changes, sync code and adjust userland DEBUG.
Illumos issued covered:
1884 Empty "used" field for zfs *space commands
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument
     is zero
3028 zfs {group,user}space -n prints (null) instead of numeric GID/UID
3048 zfs {user,group}space [-s|-S] is broken
3049 zfs {user,group}space -t doesn't really filter the results
3060 zfs {user,group}space -H output isn't tab-delimited
3061 zfs {user,group}space -o doesn't use specified fields order
3064 usr/src/cmd/zpool/zpool_main.c misspells "successful"
3093 zfs {user,group}space's -i is noop
3098 zfs userspace/groupspace fail without saying why when run as non-root

References:
  https://www.illumos.org/issues/ + [issue_id]

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-09-12 18:05:43 +00:00
Andriy Gapon
9c5c3cafbd zfs: fix sa_modify_attrs handling of variable-sized attributes
- skip length_idx index for a replaced variable-sized attribute
- skip length_idx index for a removed variable-sized attribute
- also re-arranged code to make sure that length_idx is always
  incremented for variable-sized attributes
- additionally add an assertion that the number of actually produced
  attributes is the same as the expected number of resulting
  attributes

In cooperation with:	Matthew Ahrens <mahrens@delphix.com>
Tested by:	Trent Nelson <trent@snakebite.org>
Reviewed by:	Matthew Ahrens <mahrens@delphix.com> (for upstream)
To do:		get this upstreamed
MFC after:	2 weeks
2012-09-11 07:07:52 +00:00
Martin Matuska
6643637f67 Add assfail() and assfail3() to the opensolaris module.
Remove obsoleted intermediate cddl/compat/opensolaris/sys/debug.h.

MFC after:	2 weeks
2012-09-10 10:24:57 +00:00