Commit Graph

604 Commits

Author SHA1 Message Date
Martin Matuska
9689178c3f MFV r246633:
Import vendor bugfixes regarding SA rounding, header size and layout.
This was already partially fixed by avg.

Illumos ZFS issues:
  3512 rounding discrepancy in sa_find_sizes()
  3513 mismatch between SA header size and layout

References:
  https://www.illumos.org/issues/3512
  https://www.illumos.org/issues/3513

MFC after:	2 weeks
2013-02-11 14:29:38 +00:00
Martin Matuska
bb03847418 MFV r246394:
Add tunable to allow block allocation on degraded vdevs.

Illumos ZFS issues:
  3507 Tunable to allow block allocation even on degraded vdevs

References:
  https://www.illumos.org/issues/3507

MFC after:	2 weeks
2013-02-11 13:59:57 +00:00
Martin Matuska
ff20578569 MFV r246392:
Import vendor ZFS bugfix fixing a possible deadlock in arc_read().

Illumos ZFS issues:
  3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

References:
  https://www.illumos.org/issues/3498

MFC after:	2 weeks
2013-02-11 12:42:11 +00:00
Martin Matuska
8a2dc7faae MFV r246390:
Import minor type change in refcount.h header from vendor (illumos).

MFC after:	2 weeks
2013-02-11 07:48:57 +00:00
Martin Matuska
fd9778c236 MFV r246388:
Import vendor bugfixes

Illumos ZFS issues:
  3422 zpool create/syseventd race yield non-importable pool
  3425 first write to a new zvol can fail with EFBIG

References:
  https://www.illumos.org/issues/3422
  https://www.illumos.org/issues/3425

MFC after:	2 weeks
2013-02-10 19:32:55 +00:00
Xin LI
ef17620fc8 MFV r245512:
* Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.

LZ4 is a new high-speed BSD-licensed compression algorithm created
by Yann Collet that delivers very high compression and decompression
performance compared to lzjb (>50% faster on compression, >80% faster
on decompression and around 3x faster on compression of incompressible
data), while giving better compression ratio [1].

This version of LZ4 corresponds to upstream's [2] revision 85.

Please note that for obvious reasons this is not backward read
compatible.  This means once a pool have LZ4 compressed data, these
data can no longer be read by older ZFS implementations.

Local changes:

 - On-stack hash table disabled and using kernel slab allocator
   instead, at this time.  This requires larger kernel thread stack
   for zio workers.  This may change in the future should we adjusted
   the zio workers' thread stack size.
 - likely and unlikely will be undefined if they are already defined,
   this is required for i386 XEN build.
 - Removed De Bruijn sequence based __builtin_ctz family of builtins
   in favor of the latter.  Both GCC and clang supports these builtins.
 - Changed the way the LZ4 code detects endianness.
 - Manual pages modifications to mention the feature based on Illumos
   counterpart.
 - Boot loader changes to make it support LZ4 decompression.

[1] https://www.illumos.org/issues/3035
[2] http://code.google.com/p/lz4/source/list

Obtained from:	Illumos (13921:9d721847e469)
Tested on:	FreeBSD/amd64
MFC after:	1 month
2013-02-09 06:39:28 +00:00
Andriy Gapon
0dcab786b8 zfs_vget, zfs_fhtovp: properly handle the z_shares_dir object
A special gfs vnode corresponds to that object.
A regular zfs vnode must not be returned.

This should be upstreamed.

Reported by:	pluknet
Submitted by:	rmacklem
Tested by:	pluknet
MFC after:	10 days
2013-02-08 07:49:54 +00:00
Andriy Gapon
e2bb19dce5 zfs: update comments about zfid_long_t to match the FreeBSD definitions
MFC after:	1 week
2013-02-08 07:44:15 +00:00
Andriy Gapon
c7d346f269 zfs: fix, improve and re-organize page_lookup and page_unlock
Now they are split into two pairs: page_hold/page_unhold for mappedread
and page_busy/page_unbusy for update_pages.

For mappedread we simply hold a page that is to be used as a source if it
is resident and valid (and not busy).  This is sufficient since we are
only doing page -> user buffer copying.  There is no page <-> backing
storage I/O involved.

update_pages is now better split to properly handle the putpages case
(page -> arc) and the regular write case (arc -> page).

For the latter we use complete protocol of marking an object with
paging-in-progress and marking a page with io_start (busy count).
Also, in this case we remove the write bit from all page mappings and
clear dirty bits of the pages, the former is needed to ensure that the
latter does the right thing.
Additionally we update a page if it is cached instead of just freeing it
as was done before.  This needs to be verified.

A minor detail: ZFS-backed pages should always be either fully valid
or fully invalid.  Assert this and use simpler API that does not deal
with sub-page blocks.

Reviewed by:	kib
MFC after:	26 days
2013-02-03 18:42:20 +00:00
Andriy Gapon
13235aaa89 zfs: add MODULE_VERSION for zfsctrl
This should allow the kernel linker to easily detect a situation
when the module is present both in a kernel and in a preloaded file
(zfs.ko).

Reviewed by:	jhb
MFC after:	5 days
2013-02-02 11:35:18 +00:00
Andriy Gapon
ea84c62f93 spa_generate_rootconf: add support for old vdev labels
It seems that old ZFS versions (v15) completely omit "vdev_children"
property when there is a single child.

Reported by:	jase
Tested by:	jase
MFC after:	1 week
2013-01-26 10:34:17 +00:00
Xin LI
5c74885e99 MFV r245510:
improve the comment in txg.c

Obtained from:	Illumos (13910:f3454e0a097c)
MFC after:	2 weeks
2013-01-16 22:59:50 +00:00
Konstantin Belousov
614b9f9130 For zfs vnodes, use the standard inode number based hash algorithm.
Reviewed and tested by:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:45:33 +00:00
Xin LI
290a1ba9a4 The current ZFS code expects ddt_zap_count to always succeed by asserting
the underlying zap_count() to return no errors.  However, it is possible
that the pool reaches to such a state where zap_count would return error,
leading to panics when a pool is imported.

This commit changes the ddt_zap_count to return error returned from
zap_count and handle the error appropriately.  With this change, it's now
possible to let zpool rollback damaged transaction groups and import the
pool.

Obtained from:	ZFS on Linux github (e8fd45a0f9)
MFC after:	1 month
2013-01-10 19:26:56 +00:00
Andriy Gapon
f71fbb1d12 zfs: solaris doesn't have KM_ZERO, kmem_zalloc should be used instead
To do:		remove KM_ZERO declaration
Pointyhat to:	avg (for mindlessly using the pseudo-flag)
MFC after:	instantly (to fix stable/8 build)
2012-12-23 19:58:41 +00:00
Steven Hartland
5780c4a723 Added vfs.zfs.vdev.trim_on_init sysctl which allows full vdev trim on
initialisation to be enabled (1) / disabled (0) defaults to enabled.

This is useful for devices which have a slow trim speed and are either
new or have otherwise already been wiped e.g. secure erase.

PR:     kern/173116
Submitted by:   Steven Hartland
Approved by:    pjd (mentor)
2012-12-13 17:39:07 +00:00
Steven Hartland
c440a359ca Upgrades trim free request sizes before inserting them into to free map,
making range consolidation much more effective particularly for small
deletes.

This reduces memory used by the free map as well as reducing the number
of bio requests down to geom required to process all deletes.

In tests this achieved a factor of 10 reduction of trim ranges / geom
call downs.

While I'm here correct the description of zio_vdev_io_start.

PR:		kern/173254
Submitted by:	Steven Hartland
Approved by:	pjd (mentor)
2012-12-13 17:06:38 +00:00
Steven Hartland
7150222c0a Renamed zfs trim stats removing duplicate zio_trim identifier from the name
Added description option to kstats.
Added descriptions for zio_trim kstats

PR:		kern/173113
Submitted by:	Steven Hartland
Reviewed by:	pjd
Approved by:	pjd
MFC after:	2 weeks
2012-12-12 16:14:14 +00:00
Xin LI
2740382ebd Use SA_ZPL_CRTIME instead of SA_ZPL_CTIME for creation time.
Submitted by:	phil.stone at gmx.com
MFC after:	2 weeks
2012-12-03 04:25:37 +00:00
Andriy Gapon
289b3b96ac zfs_getpages: make use of vm_page_readahead_finish
Suggested by:	kib
MFC after:	5 days
2012-12-01 18:13:53 +00:00
Andriy Gapon
992ffc58ae gfs_file_inactive: replace bad code with ugly code
Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.

The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.

The ugly code employs potentially unsafe locking tricks.

Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.

PR:		kern/151111
Reviewed by:	kib
MFC after:	13 days
2012-12-01 18:12:55 +00:00
Martin Matuska
7faa32552f MFV r243395:
Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.

illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted

References:
https://www.illumos.org/issues/3254

MFC after:	2 weeks
2012-11-26 12:24:39 +00:00
Martin Matuska
53e5858c68 Add loader(8) tunable to enable/disable nopwrite functionality:
vfs.zfs.nopwrite_enabled

MFC after:	2 weeks
2012-11-25 16:54:43 +00:00
Martin Matuska
dd801aa546 MFV r243013 and r243267:
Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after:	2 weeks
2012-11-25 16:32:07 +00:00
Andriy Gapon
3a0e1b57bb zfs_freebsd_reclaim: remove a stray variable
... which leaked from a subsequent local change.
Unfortunately I noticed that only after commit.

MFC after:	5 weeks
X-MFC with:	r243520
2012-11-25 15:46:29 +00:00
Andriy Gapon
4ff1c77d22 zfs: overhaul zfs-vfs glue for vnode life-cycle management
* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.

* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model.  All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.

* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.

* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.

* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.

The above changes fix at least two known/reported problems:

o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit.  This type of deadlock has not been
reported as of now.

o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim.  This would happen if the znode is unlinked
while its vnode is still active.

To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start.  This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested.  Or to avoid blocking
when it is not allowed (see zil_commit example above).

ffs_vgetf seems like a good source of inspiration.

Tested by:	Willem Jan Withagen <wjw@digiware.nl>
MFC after:	6 weeks
2012-11-25 15:33:26 +00:00
Andriy Gapon
7ca5310ea3 zfs_fhtovp: there is no reason to amend lock flags with LK_RETRY here
MFC after:	12 days
2012-11-25 15:07:27 +00:00
Andriy Gapon
7192f62bcc add zfs_bmap to aid vnode_pager_haspage
... otherwise zfs_getpages would mostly be called with one page at a time.

It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block.  Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk.  The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed.  vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem.  So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.

vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.

vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.

Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.

The above list should cover all the current calls to VOP_BMAP.

Reviewed by:	kib
MFC after:	6 weeks
2012-11-25 15:01:12 +00:00
Andriy Gapon
b609e5f891 zfs_getpages: optimize for large block sizes
MFC after:	6 weeks
2012-11-25 14:53:26 +00:00
Martin Matuska
2f06dfc9a3 MFV r243012:
Illumos 13886:e3261d03efbf

3349 zpool upgrade -V bumps the on disk version number, but leaves
     the in core version

References:
https://www.illumos.org/issues/3349

MFC after:	1 week
2012-11-25 10:53:42 +00:00
Martin Matuska
2b8d4033cc MFV r242735:
Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after:	2 weeks
2012-11-25 09:06:32 +00:00
Andriy Gapon
328998eac1 zfs roopool: add support for multi-vdev configurations
Tested by:	madpilot
MFC after:	10 days
2012-11-24 13:23:15 +00:00
Andriy Gapon
e1fccde2c9 spa_import_rootpool: initialize ub_version before calling spa_config_parse
... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.

This is not an issue for upstream because they do not seem to support
using raidz as a root pool.

Reported by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after:	6 days
2012-11-24 13:16:49 +00:00
Andriy Gapon
cfca00a2fb spa_import_rootpool: do not call spa_history_log_version
The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.

The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).

MFC after:	6 days
2012-11-24 13:14:53 +00:00
Andriy Gapon
c4f59a3c09 zfs: create devices/geoms from zvols after receiveing them
PR:		kern/167066
Tested by:	Andreas Nilsson <andrnils@gmail.com>
MFC after:	13 days
2012-11-24 13:07:31 +00:00
Andriy Gapon
dbe922173c zfs_remove: assert that delete_now case is never true on FreeBSD
That case is specific to Solaris VFS and it would violate pretty
fundamental contracts of FreeBSD VFS.

Discussed with:	pjd
MFC after:	12 days
2012-11-19 11:30:08 +00:00
Andriy Gapon
7b069f7fee zfs_remove: set VV_NOSYNC flag if a node is unlinked
Suggested by:	kib
MFC after:	12 days
2012-11-19 11:25:20 +00:00
Andriy Gapon
62875a6b10 spa_import_rootpool: fall back to use configuration from zpool.cache...
if we fail to generate a proper root pool configuration based on disk
probing.  Currently we can not properly generate the configuration for
multi-vdev pools.  Make that explicit.

Reported by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
Tested by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
MFC after:	4 days
2012-11-18 11:47:25 +00:00
Konstantin Belousov
f13b5a0f01 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
Andriy Gapon
7631b580ff zfs_ioc_destroy_snaps_nvl: remove disk device entries for zvol snapshots
... before trying to destroy the zvol snapshots themselves.

PR:		kern/173442
Reported by:	Petri Helenius <petri@helenius.fi>,
		mm
Obtained from:	Brian Behlendorf <behlendorf1@llnl.gov>,
		Illumos Bug #3170
Tested by:	Petri Helenius <petri@helenius.fi>
MFC after:	10 days
2012-11-10 12:22:26 +00:00
Xin LI
a9b09a3f3c MFV r242729 (mm):
Illumos r13840:97fd5cdf328a:

3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Illumos r13849:3468a95b27cd:

3258 ztest's use of file descriptors is unstable
2012-11-10 01:52:52 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Justin Hibbits
c757049235 Implement DTrace for PowerPC. This includes both 32-bit and 64-bit.
There is one known issue:  Some probes will display an error message along the
lines of:  "Invalid address (0)"

I tested this with both a simple dtrace probe and dtruss on a few different
binaries on 32-bit.  I only compiled 64-bit, did not run it, but I don't expect
problems without the modules loaded.  Volunteers are welcome.

MFC after:	1 month
2012-11-07 23:45:09 +00:00
Andriy Gapon
2c6024ec1b zfs_dirlook: bailout early if directory is unlinked
Otherwise we could fail with an incorrect error if e.g. parent
object id is removed too or we can even return a wrong vnode if
parent object has been already re-used.

Discussed with:	pjd
Also see:	http://article.gmane.org/gmane.os.freebsd.devel.file-systems/13863
MFC after:	26 days
2012-11-04 14:50:08 +00:00
Andriy Gapon
5c997cc429 zfsctl_snapdir_lookup: obtain a snapname in the remount case
... which is triggered if somebody did regular umount on a snapshot mount.

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	20 days
2012-11-04 14:43:15 +00:00
Andriy Gapon
88c8884a71 zfs: set MNTK_EXTENDED_SHARED flag
Discussed with:	kib
MFC after:	20 days
2012-11-04 14:36:11 +00:00
Andriy Gapon
71900cfaf7 zfs_vnode_forget: dispose of larvae vnode using public vfs api (mostly)
Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:24:00 +00:00
Andriy Gapon
a16e534dbe zfs_umount: no need to set MNTK_UNMOUNTF here, dounmount handles that
Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:22:25 +00:00
Andriy Gapon
62eeeb8ff8 zfs_vnode_lock: no need to double-guess caller's intentions here
vn_lock should do the right thing with respect to given vnode lock
flags.  If a caller doesn't mind a doomed vnode, then zfs should deliver.

Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:15:13 +00:00
Andriy Gapon
d548e8b66f zfs_mount: drop vfs.zfs.rootpool.prefer_cached_config tunable
It turned out to be not that useful, because its default value may lead
to a problem when a root pool is present in zpool.cache, but its
on-disk status is 'exported'.  This may happen if the pool was imported
in a different environment with -f flag and then exported.

MFC after:	12 days
2012-11-04 13:50:08 +00:00
Andriy Gapon
7ac8ca0d58 zfs_freebsd_close: call zfs_close with count=1 instead of count=0
Otherwise we may be leaking z_sync_cnt, which may lead to unnecessary
ZIL sync-ing.

MFC after:	12 days
2012-11-04 13:48:48 +00:00
Xin LI
7f24254add s/dettach/detach/g
Approved by:	pjd
MFC after:	1 month
2012-10-30 01:29:45 +00:00
Andriy Gapon
86812da016 zfs: fix label validation code in vdev_geom_read_config
POOL_STATE_SPARE and POOL_STATE_L2CACHE were not handled correctly
and thus the cache and spare disks would not be correctly probed.

Reported by:	Michael Schmiedgen <schmiedgen@gmx.net>,
		Matthew D. Fuller <fullermd@over-yonder.net>
Tested by:	Michael Schmiedgen <schmiedgen@gmx.net>,
		flo
MFC after:	5 days
2012-10-26 14:50:16 +00:00
Konstantin Belousov
5050aa86cf Remove the support for using non-mpsafe filesystem modules.
In particular, do not lock Giant conditionally when calling into the
filesystem module, remove the VFS_LOCK_GIANT() and related
macros. Stop handling buffers belonging to non-mpsafe filesystems.

The VFS_VERSION is bumped to indicate the interface change which does
not result in the interface signatures changes.

Conducted and reviewed by:	attilio
Tested by:	pho
2012-10-22 17:50:54 +00:00
Andriy Gapon
42abbab471 zfs: wait in arc_lowmem only if curproc == pageproc
... otherwise the current thread might be holding ARC locks and thus run
into a deadlock.  This happens, for example, when a thread does memory
allocation in the ARC code and runs into KVA shortage.
Also, it really makes the most sense to wait in pageproc, so that the
results of ARC reclamation are seen before the page cache is acted upon.
In other cases where vm_lowmem is invoked, e.g. on KVA space shortage,
the callers perform multiple attempts (up to 8) and wait for rather
long intervals between them (up to 4 seconds), so ARC reclaim results
should become visible even without explicit waiting on the ARC thread.

Note that this is not a critical issue for typical ZFS usages where KVA
space should already be large enough.  On amd64 systems setting KVA size
to twice the physical memory size is known to mitigate KVA fragmentation
issues in practice.

Side note: perhaps vm_lowmem 'how' parameter should be used to
differentiate between causes of the event.

Reported by:	Nikolay Denev <ndenev@gmail.com>
MFC after:	19 days
2012-10-20 10:02:18 +00:00
Andriy Gapon
edb085b8ab zfs: make use of getnewvnode_reserve in zfs_mknode and zfs_zget
getnewvnode_reserve helps to avoid "recursing" back into zfs code
via getnewvnode when that latter needs to reclaim some vnodes.
zfs code may hold a number of locks around getnewvnode and doesn't
expect any recursion to happen on those locks, because that never
happens in solaris.

I believe that this change also eleiminates a need for the delayed
znode destruction via the taskqueue.

Many thanks to kib for devising getnewvnode_reserve.

Reported by:	flo
Tested by:	bapt, kwm, swills
MFC after:	2 weeks
X-MFC after:	r241556
2012-10-17 10:59:56 +00:00
Kevin Lo
9823d52705 Revert previous commit...
Pointyhat to:	kevlo (myself)
2012-10-10 08:36:38 +00:00
Kevin Lo
a10cee30c9 Prefer NULL over 0 for pointers 2012-10-09 08:27:40 +00:00
Andriy Gapon
8bf749ef3a zvol: set mediasize in geom provider right upon its creation
... instead of deferring the action until first open.
Unlike upstream this has no benefit on FreeBSD.
We know that as soon as the provider is created it is going to be tasted
and thus opened.  Initial mediasize of zero causes tasting failure
and subsequent retasting because of the size change.

MFC after:	14 days
2012-10-06 19:57:27 +00:00
Andriy Gapon
61e100ee3b zfs_mount: taste geom providers for root pool config
This should allow to mount a dataset as a root filesystem even if
it belongs to a pool that is not described in zpool.cache.
This adds some overhead to the boot process though.

If the root filesystem's pool is found in zpool.cache, the by default
its cached configuration will be used for import.
vfs.zfs.rootpool.prefer_cached_config could be set to zero to force
the config to be retasted.

Discussed with:	gibbs, pjd, des
MFC after:	25 days
2012-10-06 19:33:47 +00:00
Martin Matuska
8469b12c2e Merge recent vendor changes in ZFS.
Illumos issued covered:
2811 missing implementation: zfs send -r
3139 zdb dies when it tries to determine path of unlinked file
3189 kernel panic in ZFS test suite during hotspare_onoffline_004_neg
3208 moving zpool cross-endian results in incorrect user/group accounting

References:
  https://www.illumos.org/issues/ + [issue_id]

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-09-26 09:37:58 +00:00
Pawel Jakub Dawidek
c622f88dd2 It is possible to recursively destroy snapshots even if the snapshot
doesn't exist on a dataset we are starting from. For example if we
have the following configuration:

	tank
	tank/foo
	tank/foo@snap
	tank/bar
	tank/bar@snap

We can execute:

	# zfs destroy -t tank@snap

eventhough tank@snap doesn't exit.

Unfortunately it is not possible to do the same with recursive rename:

	# zfs rename -r tank@snap tank@pans
	cannot open 'tank@snap': dataset does not exist

...until now. This change allows to recursively rename snapshots even if
snapshot doesn't exist on the starting dataset.

Sponsored by:	rsync.net
MFC after:	2 weeks
2012-09-23 20:12:10 +00:00
Pawel Jakub Dawidek
bcb77be2b7 Add TRIM support.
The code builds a map of regions that were freed. On every write the
code consults the map and eventually removes ranges that were freed
before, but are now overwritten.

Freed blocks are not TRIMed immediately. There is a tunable that defines
how many txg we should wait with TRIMming freed blocks (64 by default).

There is a low priority thread that TRIMs ranges when the time comes.
During TRIM we keep in-flight ranges on a list to detect colliding
writes - we have to delay writes that collide with in-flight TRIMs in
case something will be reordered and write will reached the disk before
the TRIM. We don't have to do the same for in-flight writes, as
colliding writes just remove ranges to TRIM.

Sponsored by:	multiplay.co.uk

This work includes some important fixes and some improvements obtained
from the zfsonlinux project, including TRIMming entire vdevs on pool
create/add/attach and on pool import for spare and cache vdevs.

Obtained from:	zfsonlinux
Submitted by:	Etienne Dechamps <etienne.dechamps@ovh.net>
2012-09-23 19:40:58 +00:00
Andriy Gapon
81c4584e30 zfs: allow a zvol to be used as a pool vdev, again
Do this by checking if spa_namespace_lock is already held and not taking
it again in that case.
Add a comment explaining why that is done and why it is safe.

Reviewed by:	pjd
MFC after:	24 days
2012-09-22 17:42:53 +00:00
Pawel Jakub Dawidek
3c5a057574 As in r226967, r226987 and r232401 changes to UFS and TMPFS remove cache
entries associated with the source and the target of rename().

MFC after:	1 week
2012-09-22 17:32:40 +00:00
Andriy Gapon
6ed9e9f32f zfs: correctly calculate dn_bonuslen for saving SAs to disk
Since all attribute values start at 8-byte aligned boundary, we would
previously incorrectly calculate dn_bonuslen if any attribute but the
last had a variable-length value with length not multiple of 8.

Reported by:	Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Tested by:	Nicolas Rachinsky <fbsd-mas-0@ml.turing-complete.org>
Reviewed by:	Matthew Ahrens <mahrens@delphix.com> (for upstream)
MFC after:	2 weeks
2012-09-18 08:02:54 +00:00
Andriy Gapon
ea559fb573 zfs: allow both DEBUG and ZFS_DEBUG to be defined on command line
Discussed with:	pjd
MFC after:	10 days
2012-09-18 08:00:56 +00:00
Martin Matuska
4c5238d576 Merge recent zfs vendor changes, sync code and adjust userland DEBUG.
Illumos issued covered:
1884 Empty "used" field for zfs *space commands
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first argument
     is zero
3028 zfs {group,user}space -n prints (null) instead of numeric GID/UID
3048 zfs {user,group}space [-s|-S] is broken
3049 zfs {user,group}space -t doesn't really filter the results
3060 zfs {user,group}space -H output isn't tab-delimited
3061 zfs {user,group}space -o doesn't use specified fields order
3064 usr/src/cmd/zpool/zpool_main.c misspells "successful"
3093 zfs {user,group}space's -i is noop
3098 zfs userspace/groupspace fail without saying why when run as non-root

References:
  https://www.illumos.org/issues/ + [issue_id]

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-09-12 18:05:43 +00:00
Andriy Gapon
9c5c3cafbd zfs: fix sa_modify_attrs handling of variable-sized attributes
- skip length_idx index for a replaced variable-sized attribute
- skip length_idx index for a removed variable-sized attribute
- also re-arranged code to make sure that length_idx is always
  incremented for variable-sized attributes
- additionally add an assertion that the number of actually produced
  attributes is the same as the expected number of resulting
  attributes

In cooperation with:	Matthew Ahrens <mahrens@delphix.com>
Tested by:	Trent Nelson <trent@snakebite.org>
Reviewed by:	Matthew Ahrens <mahrens@delphix.com> (for upstream)
To do:		get this upstreamed
MFC after:	2 weeks
2012-09-11 07:07:52 +00:00
Martin Matuska
6643637f67 Add assfail() and assfail3() to the opensolaris module.
Remove obsoleted intermediate cddl/compat/opensolaris/sys/debug.h.

MFC after:	2 weeks
2012-09-10 10:24:57 +00:00
Martin Matuska
4a24a25b2f Merge recent vendor changes and sync code:
1862 incremental zfs receive fails for sparse file > 8PB
3112 ztest does not honor ZFS_DEBUG
3122 zfs destroy filesystem should prefetch blocks
3129 'zpool reopen' restarts resilvers
3130 ztest failure: Assertion failed:
       0 == dmu_objset_destroy(name, B_FALSE) (0x0 == 0x10)

References:
  https://www.illumos.org/issues/1862
  https://www.illumos.org/issues/3112
  https://www.illumos.org/issues/3122
  https://www.illumos.org/issues/3129
  https://www.illumos.org/issues/3130

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-09-05 12:02:09 +00:00
Ed Schouten
3f0fb35417 Use a proper destructor function.
When calling a revoke(2) on a dtrace device, dtrace_close() could be
called, even if threads are still stuck in the device. Defer the actual
deallocation of datastructures to the cdevpriv destructor.

While there, remove the unneeded D_TRACKCLOSE and D_NEEDMINOR flags. For
the helper device, we never need it. For the regular dtrace devices, we
only need these flags on FreeBSD pre-8.

MFC after:	1 month
2012-08-28 18:33:12 +00:00
Martin Matuska
6e767def16 Merge recent vendor changes:
3100 zvol rename fails with EBUSY when dirty
3104 eliminate empty bpobjs
3120 zinject hangs in zfsdev_ioctl() due to uninitialized zc

References:
  https://www.illumos.org/issues/3100
  https://www.illumos.org/issues/3104
  https://www.illumos.org/issues/3120

Obtained from:	illumos (vendor/illumos, vendor/illumos-sys)
MFC after:	2 weeks
2012-08-28 12:25:37 +00:00
Martin Matuska
671303c6d5 Merge recent vendor changes:
3086 unnecessarily setting DS_FLAG_INCONSISTENT on async destroyed datasets
3090 vdev_reopen() during reguid causes vdev to be treated as corrupt
3102 vdev_uberblock_load() and vdev_validate() may read the wrong label

Referenes:
  https://www.illumos.org/issues/3086
  https://www.illumos.org/issues/3090
  https://www.illumos.org/issues/3102

PR:		kern/170912, kern/170914
Obtained from:	illumos (changeset #13776, #13777)
MFC after:	2 weeks
2012-08-23 19:32:57 +00:00
Martin Matuska
bb9b1f7a8b Backport fix for vendor issue #3085
3085 zfs diff panics, then panics in a loop on booting

References:
  https://www.illumos.org/issues/3085

PR:		kern/170763
Obtained from:	ssh://anonhg@hg.illumos.org/illumos-gate (r13772)
MFC after:	1 week
2012-08-19 09:59:41 +00:00
Hans Petter Selasky
07da61a6cc Streamline use of cdevpriv and correct some corner cases.
1) It is not useful to call "devfs_clear_cdevpriv()" from
"d_close" callbacks, hence for example read, write, ioctl and
so on might be sleeping at the time of "d_close" being called
and then then freed private data can still be accessed.
Examples: dtrace, linux_compat, ksyms (all fixed by this patch)

2) In sys/dev/drm* there are some cases in which memory will
be freed twice, if open fails, first by code in the open
routine, secondly by the cdevpriv destructor. Move registration
of the cdevpriv to the end of the drm open routines.

3) devfs_clear_cdevpriv() is not called if the "d_open" callback
registered cdevpriv data and the "d_open" callback function
returned an error. Fix this.

Discussed with:	phk
MFC after:	2 weeks
2012-08-15 16:19:39 +00:00
Marius Strobl
787c338407 Include <vm/vm_param.h> for PA_LOCK_COUNT in order to fix kernel build
with options ZFS after r239065.
2012-08-05 20:19:27 +00:00
Martin Matuska
e9832bb1da Partial MFV (illumos-gate 13753:2aba784c276b)
2762 zpool command should have better support for feature flags

References:
https://www.illumos.org/issues/2762

MFC after:	2 weeks
2012-07-30 23:14:24 +00:00
Edward Tomasz Napierala
b75ca29147 Make ZVOL resizing ('zfs set volsize') properly resize the GEOM provider.
Sponsored by:	FreeBSD Foundation
2012-07-20 16:56:34 +00:00
Pawel Jakub Dawidek
92484a2615 vdev_io_done stage is not used for ioctls.
MFC after:	1 week
2012-07-04 17:39:29 +00:00
Martin Matuska
a6a8d8377f Expose scrub and resilver tunables.
This allows the user to tune the priority trade-off between scrub/resilver
and other ZFS I/O.

MFC after:	2 weeks
Discussed with:	pjd
2012-07-02 07:27:14 +00:00
Pedro F. Giffuni
9a9df34345 Bump dtrace_helper_actions_max from 32 to 128
Dave Pacheco from Joyent (and Dtrace.org) bumped the cap to 1024 but,
according to his blog, 128 is the recommended minimum.

For now bump it safely to 128 although we may have to bump it further
if there is demand in the future.

Reference:

http://www.illumos.org/issues/2558
http://dtrace.org/blogs/dap/2012/01/50/where-does-your-node-program-spend-its-time/
2012-06-29 18:49:14 +00:00
Pedro F. Giffuni
675cf9154b Bring llquantize support into Dtrace.
Bryan Cantrill implemented the equivalent of semi-log graph
paper for Dtrace so llquantize will use one logarithmic and
one linear scale.

Special thanks to Mark Peek for providing fix to an
assertion and to Fabian Keill for testing the port.

Illumos Revision:	13355:15b74a2a9a9d

Reference:
https://www.illumos/issues/905

Obtained from:	Illumos
Tested by:	Fabian Keill, mp
MFC after:	4 days
2012-06-27 04:39:30 +00:00
Martin Matuska
de37372f73 Import Illumos revision 13736:9f1d48e1681f
2901 ZFS receive fails for exabyte sparse files

References:
https://www.illumos.org/issues/2901

Obtained from:	illumos (issue #2901)
MFC after:	1 week
2012-06-22 20:42:11 +00:00
Martin Matuska
2d9cf57e18 Introduce "feature flags" for ZFS pools (bump SPA version to 5000).
Add first feature "com.delphix:async_destroy" (asynchronous destroy
of ZFS datasets).
Implement features support in ZFS boot code.

Illumos revisions merged:
13700:2889e2596bd6
13701:1949b688d5fb
2619 asynchronous destruction of ZFS file systems
2747 SPA versioning with zfs feature flags

References:
https://www.illumos.org/issues/2619
https://www.illumos.org/issues/2747

Obtained from:	illumos (issue #2619, #2747)
MFC after:	1 month
2012-06-11 11:35:22 +00:00
Pawel Jakub Dawidek
25892bfc61 ds_guid of 0 is special, as it is used by snapshot receive code to
differentiate between an incremental and full stream.
Be sure not to generate guid equal to 0.

Reported by:	someone who saw 0 being generated as 64bit random guid
MFC after:	3 days
2012-06-09 20:16:19 +00:00
Pawel Jakub Dawidek
97e9ad8ec4 Tighten up the assertion: because size can't be 0 and even if sm_space is equal
to sm_size, any 'sm_space - size' will be less than sm_size.

MFC after:	3 days
2012-05-29 18:11:45 +00:00
Pawel Jakub Dawidek
837a617728 Eliminate 'where' argument, we don't use it.
MFC after:	3 days
2012-05-29 18:09:14 +00:00
Pawel Jakub Dawidek
8ac2669cc8 Remove unused variable.
MFC after:	3 days
2012-05-29 18:05:24 +00:00
Pawel Jakub Dawidek
e21c77d804 Remove unused sysctl.
MFC after:	3 days
2012-05-29 17:53:11 +00:00
Martin Matuska
2182d44714 Import illumos changeset 13570:3411fd5f1589
1948 zpool list should show more detailed pool information

Display per-vdev information with "zpool list -v".
The added expandsize property has currently no value on FreeBSD.
This changeset allows adding expansion support to individual vdevs
in the future.

References:
https://www.illumos.org/issues/1948

Obtained from:	illumos (issue #1948)
MFC after:	2 weeks
2012-05-27 16:00:00 +00:00
Martin Matuska
d9727dc29c Import illumos changeset 13605:b5c2b5db80d6 (partial)
763 FMD msg URLs should refer to something visible

Replace sun.com URL's with illumos.org

References:
https://www.illumos.org/issues/763

Obtained from:	illumos (issue #763)
MFC after:	1 week
2012-05-27 12:31:57 +00:00
Edward Tomasz Napierala
9280affe16 Fix enforcement of file size limit with O_APPEND on ZFS.
vn_rlimit_fsize takes uio->uio_offset and uio->uio_resid into account
when determining whether given write would exceed RLIMIT_FSIZE.

When APPEND flag is specified, ZFS updates uio->uio_offset to point to the
end of file.

But this happens after a call to vn_rlimit_fsize, so vn_rlimit_fsize check
can be rendered ineffective by thread that opens some file with O_APPEND
and lseeks below RLIMIT_FSIZE before calling write.

Submitted by:	Mateusz Guzik <mjguzik at gmail dot com>
MFC after:	2 weeks
2012-05-22 10:54:42 +00:00
Martin Matuska
a837775a9e Import illumos changeset 13686:4bc0783f6064
2703 add mechanism to report ZFS send progress

If the zfs send command is used with the -v flag, the amount of bytes
transmitted is reported in per second updates.

References:
https://www.illumos.org/issues/2703

Obtained from:	illumos (issue #2703)
MFC after:	2 weeks
2012-05-10 10:39:45 +00:00
Marius Strobl
35225ae651 Partially revert r232938; ZFS only requires nfs4 but not posix1e.
Submitted by:	jhb
2012-04-29 16:21:47 +00:00
Ryan Stone
c6024848dd Implement the D "cpu" variable, which returns curcpu. I have chosen not
to follow the example of OpenSolaris and its descendants, which implemented
cpu as an inline that took a value out of curthread.  At certain points in
the FreeBSD scheduler curthread->td_oncpu will no longer be valid (in
particukar, just before the thread gets descheduled) so instead I have
implemented this as its own built-in variable.

Sponsored by:	Sandvine Inc.
MFC after:	1 week
2012-04-26 01:07:03 +00:00
Edward Tomasz Napierala
af6e6b87ad Remove unused thread argument to vrecycle().
Reviewed by:	kib
2012-04-23 14:10:34 +00:00
Attilio Rao
a0f2c37b6f - Introduce a cache-miss optimization for consistency with other
accesses of the cache member of vm_object objects.
- Use novel vm_page_is_cached() for checks outside of the vm subsystem.

Reviewed by:	alc
MFC after:	2 weeks
X-MFC:		r234039
2012-04-09 17:05:18 +00:00
Andriy Gapon
70542ee01f zfs_ioctl: no need for ddi_copyin/out here because sys_ioctl handles that
On FreeBSD the direct ioctl argument is automatically copied in/out
as necesary by the kernel ioctl entry point.

PR:		kern/164445
Submitted by:	Luis Garces-Erice <lge@ieee.org>
Tested by:	Attila Nagy <bra@fsn.hu>
MFC after:	5 days
2012-04-05 07:59:59 +00:00
Oleksandr Tymoshenko
5083ce5c09 Add MIPS support to cddl/contrib part:
- header and stub .c file for fasttrap module. It's not supported on
    MIPS yet, but there is no way to disable support completely
- Do as amd64 trying to limit allocated memory
2012-03-24 04:52:18 +00:00