Commit Graph

774 Commits

Author SHA1 Message Date
Xin LI
bd3d1456a5 MFV r253783:
Skip eviction step of processing free records when doing ZFS
receive to avoid the expensive search operation of non-existent
dbufs in dn_dbufs.

Illumos ZFS issues:
  3834 incremental replication of 'holey' file systems is slow

MFC after:      2 weeks
2013-07-30 21:35:02 +00:00
Xin LI
1c4ead73c6 MFV r253782:
To quote Illumos issue #3888:

When 'zfs recv -F' is used with an incremental recv it rolls
back any changes made since the last snapshot in case new
changes were made to the file system while the recv is in
progress (without -F the recv would fail when it does it's
final check to commit the recv-ed data as the recv-ed data
conflicts with the newly written data).

However, if there is a snapshot taken after the recv began
rolling back to the 'latest' snapshot will not help and the
recv will still fail. 'zfs recv -F' should be extended to
destroy any snapshots created since the source snapshot when
finishing the recv (effectively rolling back through all
snapshots, instead of just to the latest snapshot).

Illumos ZFS issues:
  3888 zfs recv -F should destroy any snapshots created since the
       incremental source

MFC after:	2 weeks
2013-07-30 21:20:12 +00:00
Xin LI
d637247e1f MFV r253781 + r253871:
Illumos ZFS issues:
  3894 zfs should not allow snapshot of inconsistent dataset

MFC after:	2 weeks
2013-07-30 21:02:09 +00:00
Xin LI
44e362e207 MFV r253780:
To quote Illumos #3875:

The problem here is that if we ever end up in the error
path, we drop the locks protecting access to the zfsvfs_t
prior to forcibly unmounting the filesystem. Because z_os
is NULL, any thread that had already picked up the zfsvfs_t
and was sitting in ZFS_ENTER() when we dropped our locks
in zfs_resume_fs() will now acquire the lock, attempt to
use z_os, and panic.

Illumos ZFS issues:
  3875 panic in zfs_root() after failed rollback

MFC after:	2 weeks
2013-07-30 20:37:32 +00:00
Alexander Motin
ec4d2e0d96 Allow three IOCTLs to be used on suspended pool, restoring state that
existed before IOCTL code refactoring merged change 4445fffb from illumos
at r248571.

This change allows `zpool clear` to be used again to recover suspended pool.
It seems the only was supposed by the code to restore pool operation after
reconnecting lost disks that were required for data completeness.  There
are still cases where `zpool clear` command can just safely stuck due to
deadlocks inside ZFS kernel part, but probably that is better then having
no chances to recover at all.
2013-07-30 14:50:44 +00:00
Alexander Motin
698cd997d6 Partially close race between calls of orphan() method from GEOM and close()
method from ZFS core, that reliably causes use-after-free panic if SSD vdev
detached during inititial erase.
2013-07-28 20:07:34 +00:00
Alexander Motin
ffacde9be5 Following r222950, revert unintentional change cls -> class in argument name
in r245264.  Aside from non-uniformity, that again confused C++ compilers.
2013-07-25 08:41:22 +00:00
Andriy Gapon
f66c1f6482 zfs module: perform cleanup during shutdown in addition to module unload
- move init and fini code into separate functions (like it is done upstream)
- invoke fini code via shutdown_post_sync event hook

This should make zfs close its underlying devices during shutdown,
which may be important for their drivers.

MFC after:	20 days
2013-07-24 09:59:16 +00:00
Andriy Gapon
886dbd270f zfs: move vnode creation from zfs_znode_cache_constructor to zfs_znode_alloc
All other places where a znode is allocated do not need z_vnode at all.
These are:
- zfs_create_share_dir
- zfs_create_fs

This chnage ensures two things:
- VN_LOCK_ASHARE is not erroneously called for VFIFO vnodes
- vn_lock is called on a fully constructed vnode with correct v_ops

The change also allows to make zfs_znode_cache_constructor a normal
kmem_cache constructor again (as it is in upstream).
This allows to avoid a problem where zfs_znode_cache_destructor
may be called on un-constructed znodes.

MFC after:	17 days
2013-07-24 09:15:59 +00:00
Xin LI
c92bc5e996 Manually merge part of vendor import r238583 from Illumos.
Illumos changeset: 13680:2bd022a765e2
Illumos ZFS issue:

    2671 zpool import should not fail if vdev ashift has increased

MFC after:	3 days
2013-07-18 00:22:42 +00:00
Andriy Gapon
37b8b2d4d8 dtrace/fasttrap: install hook functions only after all data is
initialized

Sponsored by:	HybridCluster
MFC after:	 7 days
2013-07-09 09:05:00 +00:00
Andriy Gapon
9c1f50af0a zfs: try to properly handle i/o errors in mappedread_sf
Unconditionally freeing a page is not good, especially if it is the page
that was wired by the caller.  The checks are picked up from
kern_sendfile.

MFC after:	3 weeks
2013-07-09 08:47:11 +00:00
Andriy Gapon
78ed7a7855 zfs: load zpool.cache after a root fs is mounted
MFC after:	3 weeks
2013-07-09 08:37:42 +00:00
Mark Johnston
46d27dbb38 Hide references to mod_lock. In FreeBSD it is always acquired with the
provider lock held, so its use has no effect.
2013-07-05 22:42:10 +00:00
Martin Matuska
12df7d65b0 MFV r252839:
Quoting illumos issue #3836:
  Currently zio_free() always puts the zio on a list for subsequent
  processing by zio_free_sync().  This is only necessary for frees that
  might need to issue reads (gang and dedup blocks).

  By processing the majority of the frees as we encounter them, we reduce
  the amount of time that the spa_sync() thread spends burning CPU and
  not doing any i/o, thus increasing the overall write throughput of the
  system.

Illumos ZFS issues:
  3836 zio_free() can be processed immediately in the common case

MFC after:	1 week
2013-07-05 21:29:59 +00:00
Mark Johnston
0022f867b4 Be sure to destory the fasttrap cleanup mutex when unloading the fasttrap
module. This should be MFCed with r250953.
2013-07-01 23:12:59 +00:00
Robert Millan
2592710c47 Enable kernel-specific code for FreeBSD also on other systems that use
the kernel of FreeBSD.

Reviewed by:	pjd
2013-06-30 23:14:55 +00:00
Steven Hartland
baa0b41221 Remove invalid ASSERT which causes a panic on zfs renames when run with ASSERTS.
Removal was missed in merge of illumos 3464 (r248571)

MFC after:	2 days
2013-06-29 23:15:45 +00:00
Martin Matuska
f82ca5238a Unbreak "zfs jail" and "zfs unjail" (broken since r248571)
I missed to register zfs_ioc_jail and zfs_ioc_unjail as legacy ioctl's
with the new zfs_ioctl_register_legacy() function.

These operations do not modify pools or datasets so there is no need to
log them to pool history.

Reported by:	Alexander Leidinger <ale@FreeBSD.org> and others on current@
MFC after:	3 days
2013-06-29 16:45:37 +00:00
Gavin Atkinson
af582854d8 Don't try to re-insert an already present but invalid page.
This could happen if a thread doing a page-in loses a ZFS range lock
race to a thread writing to the same range

This fixes "panic: vm_page_alloc: pindex already allocated" in
http://docs.FreeBSD.org/cgi/mid.cgi?1372165971.96049.42.camel

Submitted by:	avg
MFC after:	1 week
2013-06-28 07:51:12 +00:00
Xin LI
e33806a54a MFV r252215:
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).

Illumos ZFS issues:
  3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl

MFC after:	1 week
2013-06-25 22:14:32 +00:00
Steven Hartland
9446debe6b Fix intermittent ZFS lock panic when kernel is compiled with debugging caused
by access of uninitialized smlock in mmutex_init.

MFC after:	1 week
2013-06-21 15:47:10 +00:00
Steven Hartland
5f921c5911 Fixed import of destroyed ZFS pools failing due to vdev_geom incorrectly
preventing config loads from devices associated with destroyed pools.

Reviewed by:	avg
MFC after:	1 week
2013-06-21 12:02:09 +00:00
Xin LI
9625321547 MFV r251644:
Poor ZFS send / receive performance due to snapshot
hold / release processing (by smh@)

Illumos ZFS issues:
  3740 Poor ZFS send / receive performance due to snapshot
       hold / release processing

MFC after:      2 weeks
2013-06-12 07:07:06 +00:00
Xin LI
ed8fd1989f MFV r251626:
ZFS event processing should work on R/O root filesystems

Illumos ZFS issues:
  3749 zfs event processing should work on R/O root filesystems

MFC after:      2 weeks
2013-06-11 19:35:44 +00:00
Xin LI
3b245f3ee1 MFV r251624:
txg commit callbacks don't work

Illumos ZFS issues:
  3747 txg commit callbacks don't work

MFC after:      2 weeks
2013-06-11 19:29:31 +00:00
Xin LI
3f3a9cac29 MFV r251622:
ZFS shouldn't ignore errors unmounting snapshots

Illumos ZFS issues:
  3744 zfs shouldn't ignore errors unmounting snapshots

MFC after:      2 weeks
2013-06-11 19:22:20 +00:00
Xin LI
57e06a1a63 MFV r251621:
ZFS needs a refcount audit

Illumos ZFS issues:
  3741 zfs needs a refcount audit

MFC after:      2 weeks
2013-06-11 19:16:14 +00:00
Xin LI
a91afe8a8d MFV r251620:
ZFS comments need cleaner, more consistent style

Illumos ZFS issues:
  3741 zfs comments need cleaner, more consistent style

MFC after:      2 weeks
2013-06-11 19:12:06 +00:00
Xin LI
4acaabea05 MFV r251619:
ZFS needs better comments.

Illumos ZFS issues:
  3741 zfs needs better comments

MFC after:      2 weeks
2013-06-11 19:02:36 +00:00
Xin LI
9e43a32a5c MFV r251519:
* Illumos ZFS issue #3805 arc shouldn't cache freed blocks

Quote from the Illumos issue:

    ZFS should proactively evict freed blocks from the cache.

    Even though these freed blocks will never be used again, and thus
    will eventually be evicted, this causes us to use memory
    inefficiently for 2 reasons:

    1. A block that is freed has no chance of being accessed again, but
       will be kept in memory preferentially to a block that was accessed
       before it (and is thus older) but has not been freed and thus has
       at least some chance of being accessed again.

    2. We partition the ARC into several buckets:
       user data that has been accessed only once (MRU)
       metadata that has been accessed only once (MRU)
       user data that has been accessed more than once (MFU)
       metadata that has been accessed more than once (MFU)

    The user data vs metadata split is somewhat arbitrary, and the
    primary control on how much memory is used to cache data vs metadata
    is to simply try to keep the proportion the same as it has been in the
    past (each bucket "evicts against" itself).  The secondary control is
    to evict data before evicting metadata.

    Because of this bucketing, we may end up with one bucket mostly
    containing freed blocks that are very old, while another bucket has
    more recently accessed, still-allocated blocks.  Data in the useful
    bucket (with still-allocated blocks) may be evicted in preference to
    data in the useless bucket (with old, freed blocks).

    On dcenter, we saw that the MFU metadata bucket was 230MB, while the
    MFU data bucket was 27GB and the MRU metadata bucket was 256GB.
    However, the vast majority of data in the MRU metadata bucket (256GB)
    was freed blocks, and thus useless.  Meanwhile, the MFU metadata bucket
    (230MB) was constantly evicting useful blocks that will be soon needed.

    The problem of cache segmentation is a larger problem that needs more
    investigation.  However, if we stop caching freed blocks, it should
    reduce the impact of this more fundamental issue.

MFC after:	2 weeks
2013-06-08 09:11:20 +00:00
Xin LI
ca8a27d4b1 MFV r251474:
* Illumos zfs issue #3137 L2ARC compression

Whether or not to compress buffers entering the L2ARC is
controlled by "compression" setting on the dataset, when
compression is not "off", L2ARC compression is enabled.

The compress method is always LZ4 for L2ARC when enabled
because it works best for the scenario.

MFC after:	2 weeks
2013-06-06 23:21:41 +00:00
Mark Johnston
427bc75e19 The fasttrap provider cleans up probes asynchronously when a process with
USDT probes exits. This was previously done with a callout; however, it is
possible to sleep while holding the DTrace mutexes, so a panic will occur
on INVARIANTS kernels if the callout handler can't immediately acquire one
of these mutexes. This panic will be frequently triggered on systems where
a USDT-enabled program (perl, for instance) is often run.

This revision changes the fasttrap cleanup mechanism so that a dedicated
thread is used instead of a callout. The old behaviour is otherwise
preserved.

Reviewed by:	rpaulo
MFC after:	1 month
2013-05-24 03:29:32 +00:00
Mark Johnston
09e6105ff4 Bring back part of r249367 by adding DTrace's temporal option, which allows
users to guarantee that the output of DTrace scripts will be time-ordered.
This option is enabled by adding the line

  #pragma D option temporal

to the beginning of a script, or by adding '-x temporal' to the arguments of
dtrace(1).

This change fixes a bug in the original port of the temporal option. This
bug was causing some assertions to fail, so they had been disabled; in this
revision the assertions are working properly and are enabled.

The DTrace version number has been bumped from 1.9.0 to 1.9.1 to reflect
the language change that's being introduced.

This change corresponds to part of illumos-gate commit e5803b76927480:
  3021 option for time-ordered output from dtrace(1M)

Reviewed by:	pfg
Obtained from:	illumos
MFC after:	1 month
2013-05-12 16:26:33 +00:00
Davide Italiano
a28d25df31 In case ZFS doesn't use UMA for buffers there's no need to waste memory
creating zones that will remain empty.

Reviewed by:	pjd
2013-05-01 17:34:44 +00:00
Steven Hartland
562a9d583b Changed ZFS TRIM sysctl from vfs.zfs.trim_disable -> vfs.zfs.trim.enabled
Enabled ZFS TRIM by default

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-04-26 11:24:20 +00:00
Martin Matuska
95b6497f5e MFV r249857:
Merge vendor bugfix for a possible deadlock related to async destroy
and improve write performance by introducing a new lock protecting
tx_open_txg.

Illumos ZFS issues:
  3642 dsl_scan_active() should not issue I/O to determine if async
       destroying is active
  3643 txg_delay should not hold the tc_lock

MFC after:	1 week
2013-04-24 21:21:03 +00:00
Martin Matuska
90eafc0bb8 The zfs synctask code restructuring introduced a new bug that makes it
impossible to set quota and reservation on pools lower than version 22.
Problem has been reported and a solution discussed with vendor.

Illumos ZFS issues:
  3739 cannot set zfs quota or reservation on pool version < 22

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
Reported by:	Steve Wills <swills@FreeBSD.org>
MFC after:	3 days
2013-04-23 06:28:35 +00:00
Pedro F. Giffuni
03836978be DTrace: Revert r249367
The following change from illumos brought caused DTrace to
pause in an interactive environment:

3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This was not detected during testing because it doesn't
affect scripts.

We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.

Reference:
https://www.illumos.org/issues/3026

Reported by:	Navdeep Parhar and Mark Johnston
2013-04-17 02:20:17 +00:00
Pedro F. Giffuni
ddd5b8e9b4 DTrace: option for time-ordered output
Merge changes from illumos:

3021 option for time-ordered output from dtrace(1M)
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
3025 register leak in D code generation
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider

This brings yet another feature implemented in upstream DTrace.
A complete description is available here:
http://dtrace.org/blogs/ahl/2012/07/28/my-new-dtrace-favorite/

This change bumps the DT_VERS_* number to 1.9.1 in
accordance to what is done in illumos.

This change was somewhat complicated because upstream is mixed many
changes in an individual commit and some of the tests don't really
apply to us.

There are also appear to be differences in timestamping with Solaris
so we had to workaround some assertions making sure no regression
happened.

Special thanks to Fabian Keil for changes and testing.

Illumos Revisions:	13758:23432da34147

Reference:
https://www.illumos.org/issues/3021
https://www.illumos.org/issues/3022
https://www.illumos.org/issues/3023
https://www.illumos.org/issues/3024
https://www.illumos.org/issues/3025
https://www.illumos.org/issues/1694

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 months
2013-04-11 16:24:36 +00:00
Martin Matuska
2cb0c5e424 MFV r249354:
Merge bugfixes accepted and integrated by vendor. Underlying problems
have been reported by us and fixed in r240942 and r249196.

Illumos ZFS issues:
  3645 dmu_send_impl: possibilty of pool hold leak
  3692 Panic on zfs receive of a recursive deduplicated stream

MFC after:	8 days
2013-04-11 07:40:30 +00:00
Martin Matuska
86161c3eeb Cast to (void *)(uintptr_t) on copyout and copyin of zfs_iocparm_t.zfs_cmd
MFC after:	9 days
2013-04-10 07:01:17 +00:00
Martin Matuska
83b4af1142 ZFS expects a copyout of zfs_cmd_t on an ioctl error. Our sys_ioctl()
doesn't copyout in this case.

To solve this issue a new struct zfs_iocparm_t is introduced consisting of:
- zfs_ioctl_version (future backwards compatibility purposes)
- user space pointer to zfs_cmd_t (copyin and copyout)
- size of zfs_cmd_t (verification purposes)

The copyin and copyout of zfs_cmd_t is now done the illumos (vendor) way
what makes porting of new changes easier and ensures correct behavior if
returning an error.

MFC after:	10 days
2013-04-09 22:27:44 +00:00
Martin Matuska
a548fef5dc MFV r249186:
Do not list read-only pools in zpool.cache
Reduce diff against vendor in unused vdev_disk.c

Illumos ZFS issues:
  3639 zpool.cache should skip over readonly pools
  3640 want automatic devid updates

MFC after:	1 week
2013-04-06 17:24:00 +00:00
Martin Matuska
95e7edacfe MFV r248660:
Merge vendor change - modify time processing in deadman thread.

Illumos ZFS issues:
  3618 ::zio dcmd does not show timestamp data

MFC after:	3 weeks
2013-04-06 17:15:47 +00:00
Martin Matuska
367437755d Provide a fix for kernel panic if receiving recursive deduplicated streams.
Problem reported to vendor.

Illumos ZFS issues:
  3692 Panic on zfs receive of a recursive deduplicated stream

MFC after:	2 weeks
2013-04-06 11:54:41 +00:00
Martin Matuska
f1b5c26470 MFV r248217:
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.

Illumos ZFS issues:
  3598 want to dtrace when errors are generated in zfs

MFC after:	3 weeks
2013-04-06 10:39:38 +00:00
Martin Matuska
bae7bccf39 MFV r242816:
Import vendor change to reduce diff, no effect on FreeBSD.

Illumos ZFS issues:
  3517 importing pool with autoreplace=on and "hole" vdevs crashes syseventd
2013-04-06 08:21:37 +00:00
Andriy Gapon
9ff9b984c9 spa_open_common: fix argument to zvol_create_minors
Prior to r248571 spa_open was always called with a bare pool name,
but now it is called with a dataset name instead (spa_lookup handles
that).
So, when a ZFS root is mounted spa_open is called with a name of a root
dataset, which can very well be different from the pool name.
But zvol_create_minors should be called with the pool name, because it
performs a recursive traversal of all datasets under the name to find
all those that are volumes.

MFC after:	7 days
2013-04-03 11:06:26 +00:00
Martin Matuska
03863a70e1 Fix possible pool hold leak in dmu_send_impl()
Problem reported to vendor:
  https://www.illumos.org/issues/3645

Reported by:	Andriy Gapon <avg@FreeBSD.org>
MFC after:	15 days
2013-04-03 09:52:30 +00:00
Martin Matuska
41451f4a0e Do not check against uninitialized rc and comment out vendor code
MFC after:	16 days
2013-04-02 08:15:39 +00:00
Pedro F. Giffuni
9f4c7ba460 Dtrace: enablings on defunct providers prevent providers from unregistering
Merge change from illumos:

1368 enablings on defunct providers prevent providers from unregistering

We try to address some underlying differences between the Solaris
and FreeBSD implementations: dtrace_attach() / dtrace_detach() are
currently unimplemented in FreeBSD but the new code from illumos
makes use of taskq so some adaptations were made to dtrace_open()
and dtrace_close() to handle them appropriately.

Illumos Revision:	r13430:8e6add739e38

Reference:
https://www.illumos.org/issues/1368

Reviewed by:	gnn
Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	3 weeks
2013-04-01 19:13:46 +00:00
Martin Matuska
20547d41f8 Call dmu_snapshot_list_next() in zvol.c with dsl_pool_config lock held
Submitted by:	Andriy Gapon <avg@FreeBSD.org>
MFC after:	17 days
2013-04-01 16:14:57 +00:00
Pedro F. Giffuni
f5678b698a Dtrace: dtrace.c erroneously checks for memory alignment on amd64.
Merge change from illumos:

3511 dtrace.c erroneously checks for memory alignment on amd64

Illumos Revision:	c93cc65

Reference:
https://www.illumos.org/issues/3511

Obtained from:	Illumos
MFC after:	3 weeks
2013-03-26 20:17:08 +00:00
Pedro F. Giffuni
5472787377 Dtrace: Add SUN MDB-like type-aware print() action.
Merge change from illumos:

1694 Add type-aware print() action

This is a very nice feature implemented in upstream Dtrace.
A complete description is available here:
http://dtrace.org/blogs/eschrock/2011/10/26/your-mdb-fell-into-my-dtrace/

This change bumps the DT_VERS_* number to 1.9.0 in
accordance to what is done in illumos.

While here also include some minor cleanups to ease further merging
and appease clang with a fix by Fabian Keil.

Illumos Revisions:	13501:c3a7090dbc16
			13483:f413e6c5d297

Reference:
https://www.illumos.org/issues/1560
https://www.illumos.org/issues/1694

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 month
2013-03-25 20:38:09 +00:00
Pedro F. Giffuni
730cecb05a Dtrace: add toupper()/tolower() and enhancements to lltostr().
Merge changes from illumos:

1451 DTrace needs toupper()/tolower() subroutines
1457 lltostr() D subroutine should take an optional base

This change bumps the DT_VERS_* number to 1.8.1 in
accordance to what is done in illumos.

The test suite we currently include is outdated and
doesnt support some updates in tst.subr.d which had to
be left out for now.

Illumos Revisions:	r13458 5e394d8db762
			r13459 c3454574dd1a

Reference:
https://www.illumos.org/issues/1451
https://www.illumos.org/issues/1457

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 month
2013-03-25 15:40:57 +00:00
Pedro F. Giffuni
f2e66d30b8 Dtrace: add optional size argument to tracemem().
Merge change from illumos:

1455 DTrace tracemem() should take an optional size argument

Our local enhancements to dt_print_bytes were equivalent to
those in illumos but we made it match the illumos version
to ease further code merges.

For now leave out tst.smallsize.d and tst.smallsize.d.out
since those don't seem to work cleanly on FreeBSD.

This change bumps the DT_VERS_* number to 1.7.1 in accordance
to what is done in illumos.

Illumos Revision:	13457:571b0355c2e3

Reference:
https://www.illumos.org/issues/1455

Tested by:	Fabian Keil
Obtained from:	Illumos
MFC after:	1 month
2013-03-24 19:12:08 +00:00
Will Andrews
58567a1b4e ZFS: Fix a panic while unmounting a busy filesystem.
This particular scenario was easily reproduced using a NFS export.  When the
first 'zfs unmount' occurred, it returned EBUSY via this path, while
vflush() had flushed references on the filesystem's root vnode, which in
turn caused its v_interlock to be destroyed.  The next time 'zfs unmount'
was called, vflush() tried to obtain this lock, which caused this panic.

Since vflush() on FreeBSD is a definitive call, there is no need to check
vfsp->vfs_count after it completes.  Simply #ifdef sun this check.

Submitted by:	avg
Reviewed by:	avg
Approved by:	ken (mentor)
MFC after:	1 month
2013-03-23 16:34:56 +00:00
Steven Hartland
def84b9736 Fix for building libzpool under i386.
Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 23:06:11 +00:00
Steven Hartland
2b114ad2a4 Add missing descriptions for ZFS sysctls
Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 11:25:21 +00:00
Steven Hartland
adea827b21 Optimisation of TRIM processing.
Previously TRIM processing was very bursty. This was made worse by the fact
that TRIM requests on SSD's are typically much slower than reads or writes.
This often resulted in stalls while large numbers of TRIM's where processed.

In addition due to the way the TRIM thread was only woken by writes, deletes
could stall in the queue for extensive periods of time.

This patch adds a number of controls to how often the TRIM thread for each
SPA processes its outstanding delete requests.
vfs.zfs.trim.timeout: Delay TRIMs by up to this many seconds
vfs.zfs.trim.txg_delay: Delay TRIMs by up to this many TXGs (reduced to 32)
vfs.zfs.vdev.trim_max_bytes: Maximum pending TRIM bytes for a vdev
vfs.zfs.vdev.trim_max_pending: Maximum pending TRIM segments for a vdev
vfs.zfs.trim.max_interval: Maximum interval between TRIM queue processing
(seconds)

Given the most common TRIM implementation is ATA TRIM the current defaults
are targeted at that.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 11:02:08 +00:00
Steven Hartland
6ad46cec23 Names the ZFS TRIM thread
Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
MFC after:	2 weeks
2013-03-21 10:41:30 +00:00
Steven Hartland
89e5b43079 TRIM cache devices based on time instead of TXGs.
Currently, the trim module uses the same algorithm for data and cache
devices when deciding to issue TRIM requests, based on how far in the
past the TXG is.

Unfortunately, this is not ideal for cache devices, because the L2ARC
doesn't use the concept of TXGs at all. In fact, when using a pool for
reading only, the L2ARC is written but the TXG counter doesn't
increase, and so no new TRIM requests are issued to the cache device.

This patch fixes the issue by using time instead of the TXG number as
the criteria for trimming on cache devices. The basic delay principle
stays the same, but parameters are expressed in seconds instead of
TXGs. The new parameters are named trim_l2arc_limit and
trim_l2arc_batch, and both default to 30 second.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	17122c31ac
MFC after:	2 weeks
2013-03-21 10:29:05 +00:00
Steven Hartland
78ad0c1c80 Improve TXG handling in the TRIM module.
This patch adds some improvements to the way the trim module considers
TXGs:

 - Free ZIOs are registered with the TXG from the ZIO itself, not the
   current SPA syncing TXG (which may be out of date);
 - L2ARC are registered with a zero TXG number, as L2ARC has no concept
   of TXGs;
 - The TXG limit for issuing TRIMs is now computed from the last synced
   TXG, not the currently syncing TXG. Indeed, under extremely unlikely
   race conditions, there is a risk we could trim blocks which have been
   freed in a TXG that has not finished syncing, resulting in potential
   data corruption in case of a crash.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	5b46ad40d9
MFC after:	2 weeks
2013-03-21 10:16:10 +00:00
Steven Hartland
e07e3a3792 Don't register repair writes in the trim map.
The trim map inflight writes tree assumes non-conflicting writes, i.e.
that there will never be two simultaneous write I/Os to the same range
on the same vdev. This seemed like a sane assumption; however, in
actual testing, it appears that repair I/Os can very well conflict
with "normal" writes.

I'm not quite sure if these conflicting writes are supposed to happen
or not, but in the mean time, let's ignore repair writes for now. This
should be safe considering that, by definition, we never repair blocks
that are freed.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	Source: 6a3cebaf7c
2013-03-21 10:02:32 +00:00
Steven Hartland
e05aad2d33 Add TRIM support for L2ARC.
This adds TRIM support to cache vdevs. When ARC buffers are removed
from the L2ARC in arc_hdr_destroy(), arc_release() or l2arc_evict(),
the size previously occupied by the buffer gets scheduled for TRIMming.
As always, actual TRIMs are only issued to the L2ARC after
txg_trim_limit.

Reviewed by:	pjd (mentor)
Approved by:	pjd (mentor)
Obtained from:	31aae37399
MFC after:	2 weeks
2013-03-21 09:34:41 +00:00
Martin Matuska
192d547574 Release hold on pool before calling zvol_create_minor() 2013-03-20 09:56:20 +00:00
Martin Matuska
a0abc0d302 Run zvol_create_minors() only if in non-error case 2013-03-19 22:27:15 +00:00
Martin Matuska
e56718d734 Run zvol_create_minors() on snapshot creation 2013-03-19 22:14:50 +00:00
Martin Matuska
07091d8f14 MFV r247580:
Merge synctask code restructuring from vendor.

Modify forward and backward compatibility to support new change.

Illumos ZFS issues:
  3464 zfs synctask code needs restructuring

Sponsored by:	Hybrid Logic Ltd.
2013-03-19 12:51:18 +00:00
Martin Matuska
520268fb97 MFC @248493 2013-03-19 11:09:15 +00:00
Martin Matuska
87a5cb4650 Plug memory leak in dsl_check_snap_cb()
This was unnoticed because the function is very rarely used.

MFC after:	3 days
2013-03-19 07:47:51 +00:00
Martin Matuska
a602517b63 Add missing zvol_create_mirrors() on zfs_ioc_create() 2013-03-18 20:22:40 +00:00
Martin Matuska
876a84e867 MFC @248461 2013-03-18 09:39:51 +00:00
Martin Matuska
6f4accc2de Move common zfs ioctl compatibility functions (userland) into libzfs_compat.c
Introduce additional constants for zfs ioctl versions
2013-03-18 09:32:29 +00:00
Justin Hibbits
80a5635c8b Add FBT for PowerPC DTrace. Also, clean up the DTrace assembly code,
much of which is not necessary for PowerPC.

The FBT module can likely be factored into 3 separate files: common,
intel, and powerpc, rather than duplicating most of the code between
the x86 and PowerPC flavors.

All DTrace modules for PowerPC will be MFC'd together once Fasttrap is
completed.
2013-03-18 05:30:18 +00:00
Martin Matuska
af2e40ccd1 Merge libzfs_core part of r239388
Illumos ZFS issues:
  3085 zfs diff panics, then panics in a loop on booting

References:
  https://www.illumos.org/issues/3085
2013-03-17 18:49:11 +00:00
Martin Matuska
70b0720877 Fix accidentially changed ioc variable for old v15 compatibility 2013-03-17 17:28:06 +00:00
Martin Matuska
6cf922c88b Fix typo in sysctl description
Reported by:	Jeremy Chadwick
MFC after:	3 days
2013-03-17 15:53:27 +00:00
Martin Matuska
e2b4467975 libzfs_core:
- provide complete backwards compatibility (old utility, new kernel)
  - add zfs_cmd_t compatibility mapping in both directions
  - determine ioctl address in zfs_ioctl_compat.c
2013-03-17 10:57:04 +00:00
Martin Matuska
4f33cfb284 Initialize "error" variable where illumos does. 2013-03-16 20:28:38 +00:00
Martin Matuska
a03fbc7ecf MFC @248093 2013-03-09 11:57:51 +00:00
Attilio Rao
89f6b8632c Switch the vm_object mutex to be a rwlock. This will enable in the
future further optimizations where the vm_object lock will be held
in read mode most of the time the page cache resident pool of pages
are accessed for reading purposes.

The change is mostly mechanical but few notes are reported:
* The KPI changes as follow:
  - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK()
  - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK()
  - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK()
  - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED()
    (in order to avoid visibility of implementation details)
  - The read-mode operations are added:
    VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(),
    VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED()
* The vm/vm_pager.h namespace pollution avoidance (forcing requiring
  sys/mutex.h in consumers directly to cater its inlining functions
  using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h
  consumers now must include also sys/rwlock.h.
* zfs requires a quite convoluted fix to include FreeBSD rwlocks into
  the compat layer because the name clash between FreeBSD and solaris
  versions must be avoided.
  At this purpose zfs redefines the vm_object locking functions
  directly, isolating the FreeBSD components in specific compat stubs.

The KPI results heavilly broken by this commit.  Thirdy part ports must
be updated accordingly (I can think off-hand of VirtualBox, for example).

Sponsored by:	EMC / Isilon storage division
Reviewed by:	jeff
Reviewed by:	pjd (ZFS specific review)
Discussed with:	alc
Tested by:	pho
2013-03-09 02:32:23 +00:00
Martin Matuska
91edf37414 Comment out unfeasible illumos copyin code and restore previous behavior. 2013-03-07 23:45:16 +00:00
Martin Matuska
b3b6851c78 Add missing init functions
Reduce diff to illumos
2013-03-06 11:33:25 +00:00
Xin LI
2f79ac7f21 Diff reduction with Illumos 2013-03-06 01:21:56 +00:00
Xin LI
227d24fc59 Use adx2 instead of adx in the second vsprintf, this fixes a
panic.
2013-03-05 22:58:53 +00:00
Martin Matuska
400c4069a5 MFV r247845:
Import ZFS bpobj bugfix from vendor.

Illumos ZFS issues:
  3603 panic from bpobj_enqueue_subobj()
  3604 zdb should print bpobjs more verbosely

References:
  https://www.illumos.org/issues/3603
  https://www.illumos.org/issues/3604

MFC after:	1 week
2013-03-05 18:54:41 +00:00
Martin Matuska
dce1a726f2 WiP merge of libzfs_core (MFV r238590, r238592)
not yet working, ioctl handling needs to be changed
2013-03-05 08:09:53 +00:00
Justin T. Gibbs
7e2a739f03 Fix assertion failure when using userland DTrace probes from
the pid provider on a kernel compiled with INVARIANTS.

sys/cddl/contrib/opensolaris/uts/intel/dtrace/fasttrap_isa.c:
	In fasttrap_probe_pid(), attempts to write to the
	address space of the thread that fired the probe
	must be performed with the process of the thread
	held.  Use _PHOLD() to ensure this is the case.

	In fasttrap_probe_pid(), use proc_write_regs() instead
	of calling set_regs() directly.  proc_write_regs()
	performs invariant checks to verify the calling
	environment of set_regs().  PROC_LOCK()/UNLOCK() around
	the call to proc_write_regs() so that it's invariants
	are satisfied.

Sponsored by:	Spectra Logic Corporation
Reviewed by:	gnn, rpaulo
MFC after:	1 week
2013-03-04 22:07:36 +00:00
Pawel Jakub Dawidek
2609222ab4 Merge Capsicum overhaul:
- Capability is no longer separate descriptor type. Now every descriptor
  has set of its own capability rights.

- The cap_new(2) system call is left, but it is no longer documented and
  should not be used in new code.

- The new syscall cap_rights_limit(2) should be used instead of
  cap_new(2), which limits capability rights of the given descriptor
  without creating a new one.

- The cap_getrights(2) syscall is renamed to cap_rights_get(2).

- If CAP_IOCTL capability right is present we can further reduce allowed
  ioctls list with the new cap_ioctls_limit(2) syscall. List of allowed
  ioctls can be retrived with cap_ioctls_get(2) syscall.

- If CAP_FCNTL capability right is present we can further reduce fcntls
  that can be used with the new cap_fcntls_limit(2) syscall and retrive
  them with cap_fcntls_get(2).

- To support ioctl and fcntl white-listing the filedesc structure was
  heavly modified.

- The audit subsystem, kdump and procstat tools were updated to
  recognize new syscalls.

- Capability rights were revised and eventhough I tried hard to provide
  backward API and ABI compatibility there are some incompatible changes
  that are described in detail below:

	CAP_CREATE old behaviour:
	- Allow for openat(2)+O_CREAT.
	- Allow for linkat(2).
	- Allow for symlinkat(2).
	CAP_CREATE new behaviour:
	- Allow for openat(2)+O_CREAT.

	Added CAP_LINKAT:
	- Allow for linkat(2). ABI: Reuses CAP_RMDIR bit.
	- Allow to be target for renameat(2).

	Added CAP_SYMLINKAT:
	- Allow for symlinkat(2).

	Removed CAP_DELETE. Old behaviour:
	- Allow for unlinkat(2) when removing non-directory object.
	- Allow to be source for renameat(2).

	Removed CAP_RMDIR. Old behaviour:
	- Allow for unlinkat(2) when removing directory.

	Added CAP_RENAMEAT:
	- Required for source directory for the renameat(2) syscall.

	Added CAP_UNLINKAT (effectively it replaces CAP_DELETE and CAP_RMDIR):
	- Allow for unlinkat(2) on any object.
	- Required if target of renameat(2) exists and will be removed by this
	  call.

	Removed CAP_MAPEXEC.

	CAP_MMAP old behaviour:
	- Allow for mmap(2) with any combination of PROT_NONE, PROT_READ and
	  PROT_WRITE.
	CAP_MMAP new behaviour:
	- Allow for mmap(2)+PROT_NONE.

	Added CAP_MMAP_R:
	- Allow for mmap(PROT_READ).
	Added CAP_MMAP_W:
	- Allow for mmap(PROT_WRITE).
	Added CAP_MMAP_X:
	- Allow for mmap(PROT_EXEC).
	Added CAP_MMAP_RW:
	- Allow for mmap(PROT_READ | PROT_WRITE).
	Added CAP_MMAP_RX:
	- Allow for mmap(PROT_READ | PROT_EXEC).
	Added CAP_MMAP_WX:
	- Allow for mmap(PROT_WRITE | PROT_EXEC).
	Added CAP_MMAP_RWX:
	- Allow for mmap(PROT_READ | PROT_WRITE | PROT_EXEC).

	Renamed CAP_MKDIR to CAP_MKDIRAT.
	Renamed CAP_MKFIFO to CAP_MKFIFOAT.
	Renamed CAP_MKNODE to CAP_MKNODEAT.

	CAP_READ old behaviour:
	- Allow pread(2).
	- Disallow read(2), readv(2) (if there is no CAP_SEEK).
	CAP_READ new behaviour:
	- Allow read(2), readv(2).
	- Disallow pread(2) (CAP_SEEK was also required).

	CAP_WRITE old behaviour:
	- Allow pwrite(2).
	- Disallow write(2), writev(2) (if there is no CAP_SEEK).
	CAP_WRITE new behaviour:
	- Allow write(2), writev(2).
	- Disallow pwrite(2) (CAP_SEEK was also required).

	Added convinient defines:

	#define	CAP_PREAD		(CAP_SEEK | CAP_READ)
	#define	CAP_PWRITE		(CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_R		(CAP_MMAP | CAP_SEEK | CAP_READ)
	#define	CAP_MMAP_W		(CAP_MMAP | CAP_SEEK | CAP_WRITE)
	#define	CAP_MMAP_X		(CAP_MMAP | CAP_SEEK | 0x0000000000000008ULL)
	#define	CAP_MMAP_RW		(CAP_MMAP_R | CAP_MMAP_W)
	#define	CAP_MMAP_RX		(CAP_MMAP_R | CAP_MMAP_X)
	#define	CAP_MMAP_WX		(CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_MMAP_RWX		(CAP_MMAP_R | CAP_MMAP_W | CAP_MMAP_X)
	#define	CAP_RECV		CAP_READ
	#define	CAP_SEND		CAP_WRITE

	#define	CAP_SOCK_CLIENT \
		(CAP_CONNECT | CAP_GETPEERNAME | CAP_GETSOCKNAME | CAP_GETSOCKOPT | \
		 CAP_PEELOFF | CAP_RECV | CAP_SEND | CAP_SETSOCKOPT | CAP_SHUTDOWN)
	#define	CAP_SOCK_SERVER \
		(CAP_ACCEPT | CAP_BIND | CAP_GETPEERNAME | CAP_GETSOCKNAME | \
		 CAP_GETSOCKOPT | CAP_LISTEN | CAP_PEELOFF | CAP_RECV | CAP_SEND | \
		 CAP_SETSOCKOPT | CAP_SHUTDOWN)

	Added defines for backward API compatibility:

	#define	CAP_MAPEXEC		CAP_MMAP_X
	#define	CAP_DELETE		CAP_UNLINKAT
	#define	CAP_MKDIR		CAP_MKDIRAT
	#define	CAP_RMDIR		CAP_UNLINKAT
	#define	CAP_MKFIFO		CAP_MKFIFOAT
	#define	CAP_MKNOD		CAP_MKNODAT
	#define	CAP_SOCK_ALL		(CAP_SOCK_CLIENT | CAP_SOCK_SERVER)

Sponsored by:	The FreeBSD Foundation
Reviewed by:	Christoph Mallon <christoph.mallon@gmx.de>
Many aspects discussed with:	rwatson, benl, jonathan
ABI compatibility discussed with:	kib
2013-03-02 00:53:12 +00:00
Xin LI
5c737b11df MFV r247575:
Import a fix tighten assertion on SPA versions from vendor (Illumos).

Illumos ZFS issue:

  3543 Feature flags causes assertion in spa.c to miss certain cases

MFC after:	2 weeks
2013-03-01 22:20:13 +00:00
Martin Matuska
4abd59512a MFV r247316:
Merge new read-only zfs properties from vendor (illumos)

Illumos ZFS issues:
  3588 provide zfs properties for logical (uncompressed) space used and
       referenced

References:
  https://www.illumos.org/issues/3588

MFC after:	2 weeks
2013-03-01 21:58:51 +00:00
Martin Matuska
bb508e7732 Fix the zfs_ioctl compat layer to support zfs_cmd size change introduced
in r247265 (ZFS deadman thread). Both new utilities now support the old
kernel and new kernel properly detects old utilities.

For future backwards compatibility, the vfs.zfs.version.ioctl read-only
sysctl has been introduced. With this sysctl zfs utilities will be able
to detect the ioctl interface version of the currently loaded zfs module.

As a side effect, the zfs utilities between r247265 and this revision don't
support the old kernel module. If you are using HEAD newer or equal than
r247265, install the new kernel module (or whole kernel) first.

MFC after:	10 days
2013-03-01 09:42:58 +00:00
Martin Matuska
24245e76ea MFV 247176, 247178, 247315:
Import metaslab_sync() speedup from vendor (illumos).

Illumos ZFS issues:
  3552 condensing one space map burns 3 seconds of CPU in spa_sync() thread
  3564 spa_sync() spends 5-10% of its time in metaslab_sync() (when not
       condensing)
  3578 transferring the freed map to the defer map should be constant time
  3579 ztest trips assertion in metaslab_weight()

References:
  https://www.illumos.org/issues/3552
  https://www.illumos.org/issues/3564
  https://www.illumos.org/issues/3578
  https://www.illumos.org/issues/3579

MFC after:	2 weeks
2013-02-27 14:45:23 +00:00
Martin Matuska
e4428d63a8 Be more verbose on ZFS deadman I/O panic
Patch suggested upstream.

Suggested by:	Olivier Cinquin
MFC after:	12 days
2013-02-26 20:41:27 +00:00
Martin Matuska
e70664bafc MFV v242732:
Merge the ZFS I/O deadman thread from vendor (illumos).
This feature panics the system on hanging ZFS I/O, helps debugging
and resumes failed service.

The panic behavior can be controlled with the loader-only tunables:
vfs.zfs.deadman_enabled (enable or disable panic on stalled ZFS I/O)
vfs.zfs.deadman_synctime (expiration time for stalled ZFS I/O)

By default, ZFS I/O deadman is enabled by default on amd64 and i386
excluding virtual guest machines.

Illumos ZFS issues:
  3246 ZFS I/O deadman thread

References:
  https://www.illumos.org/issues/3246

MFC after:	2 weeks
2013-02-25 12:33:31 +00:00
Martin Matuska
781c0f87d3 MFV r246653:
Import vendor change to avoid "unitialized variable" warnings.

Illumos ZFS issues:
  3522 zfs module should not allow uninitialized variables

References:
  https://www.illumos.org/issues/3522
2013-02-23 11:21:05 +00:00
Justin T. Gibbs
6a8f90edf5 Avoid panic when tearing down the DTrace pid provider for a
process that has crashed.

sys/cddl/contrib/opensolaris/uts/common/dtrace/fasttrap.c:
	In fasttrap_pid_disable(), we cannot PHOLD the proc
	structure for a process that no longer exists, but
	we still have other, fasttrap specific, state that
	must be cleaned up for probes that existed in the
	dead process.  Instead of returning early if the
	process related to our probes isn't found,
	conditionalize the locking and carry on with a NULL
	proc pointer.  The rest of the fasttrap code already
	understands that a NULL proc is possible and does
	the right things in this case.

Sponsored by:	Spectra Logic Corporation
Reviewed by:	rpaulo, gnn
MFC after:	1 week
2013-02-20 17:55:17 +00:00
Xin LI
e469d5a70f Eliminate real_LZ4_uncompress. It's unused and does not perform sufficient
check against input stream (i.e. it could read beyond specified input
buffer).
2013-02-14 21:02:18 +00:00
Martin Matuska
7c695febc9 Change vfs.zfs.write_to_degraded from CTLFLAG_RW to CTLFLAG_RWTUN
Suggested by:	pjd
2013-02-13 23:11:25 +00:00
Xin LI
314caba11c Restore De Bruijn algorithm for sparc64 where the compiler rely on a
library function for __builtin_c?z.

Tested by:	Michael Moll <kvedulv kvedulv de>
2013-02-13 17:30:54 +00:00
Martin Matuska
6a33bbc041 Merge zfs_ioctl.c code that should have been merged together with ZFS v28.
Fixes several problems if working with read-only pools.

Changed code originaly introduced in onnv-gate 13061:bda0decf867b
Contains changes up to illumos-gate 13700:4bc0783f6064

PR:		kern/175897
Suggested by:	avg

MFC after:	2 weeks
2013-02-11 21:10:55 +00:00
Martin Matuska
9689178c3f MFV r246633:
Import vendor bugfixes regarding SA rounding, header size and layout.
This was already partially fixed by avg.

Illumos ZFS issues:
  3512 rounding discrepancy in sa_find_sizes()
  3513 mismatch between SA header size and layout

References:
  https://www.illumos.org/issues/3512
  https://www.illumos.org/issues/3513

MFC after:	2 weeks
2013-02-11 14:29:38 +00:00
Martin Matuska
bb03847418 MFV r246394:
Add tunable to allow block allocation on degraded vdevs.

Illumos ZFS issues:
  3507 Tunable to allow block allocation even on degraded vdevs

References:
  https://www.illumos.org/issues/3507

MFC after:	2 weeks
2013-02-11 13:59:57 +00:00
Martin Matuska
ff20578569 MFV r246392:
Import vendor ZFS bugfix fixing a possible deadlock in arc_read().

Illumos ZFS issues:
  3498 panic in arc_read(): !refcount_is_zero(&pbuf->b_hdr->b_refcnt)

References:
  https://www.illumos.org/issues/3498

MFC after:	2 weeks
2013-02-11 12:42:11 +00:00
Martin Matuska
8a2dc7faae MFV r246390:
Import minor type change in refcount.h header from vendor (illumos).

MFC after:	2 weeks
2013-02-11 07:48:57 +00:00
Martin Matuska
fd9778c236 MFV r246388:
Import vendor bugfixes

Illumos ZFS issues:
  3422 zpool create/syseventd race yield non-importable pool
  3425 first write to a new zvol can fail with EFBIG

References:
  https://www.illumos.org/issues/3422
  https://www.illumos.org/issues/3425

MFC after:	2 weeks
2013-02-10 19:32:55 +00:00
Xin LI
ef17620fc8 MFV r245512:
* Illumos zfs issue #3035 [1] LZ4 compression support in ZFS.

LZ4 is a new high-speed BSD-licensed compression algorithm created
by Yann Collet that delivers very high compression and decompression
performance compared to lzjb (>50% faster on compression, >80% faster
on decompression and around 3x faster on compression of incompressible
data), while giving better compression ratio [1].

This version of LZ4 corresponds to upstream's [2] revision 85.

Please note that for obvious reasons this is not backward read
compatible.  This means once a pool have LZ4 compressed data, these
data can no longer be read by older ZFS implementations.

Local changes:

 - On-stack hash table disabled and using kernel slab allocator
   instead, at this time.  This requires larger kernel thread stack
   for zio workers.  This may change in the future should we adjusted
   the zio workers' thread stack size.
 - likely and unlikely will be undefined if they are already defined,
   this is required for i386 XEN build.
 - Removed De Bruijn sequence based __builtin_ctz family of builtins
   in favor of the latter.  Both GCC and clang supports these builtins.
 - Changed the way the LZ4 code detects endianness.
 - Manual pages modifications to mention the feature based on Illumos
   counterpart.
 - Boot loader changes to make it support LZ4 decompression.

[1] https://www.illumos.org/issues/3035
[2] http://code.google.com/p/lz4/source/list

Obtained from:	Illumos (13921:9d721847e469)
Tested on:	FreeBSD/amd64
MFC after:	1 month
2013-02-09 06:39:28 +00:00
Andriy Gapon
0dcab786b8 zfs_vget, zfs_fhtovp: properly handle the z_shares_dir object
A special gfs vnode corresponds to that object.
A regular zfs vnode must not be returned.

This should be upstreamed.

Reported by:	pluknet
Submitted by:	rmacklem
Tested by:	pluknet
MFC after:	10 days
2013-02-08 07:49:54 +00:00
Andriy Gapon
e2bb19dce5 zfs: update comments about zfid_long_t to match the FreeBSD definitions
MFC after:	1 week
2013-02-08 07:44:15 +00:00
Andriy Gapon
c7d346f269 zfs: fix, improve and re-organize page_lookup and page_unlock
Now they are split into two pairs: page_hold/page_unhold for mappedread
and page_busy/page_unbusy for update_pages.

For mappedread we simply hold a page that is to be used as a source if it
is resident and valid (and not busy).  This is sufficient since we are
only doing page -> user buffer copying.  There is no page <-> backing
storage I/O involved.

update_pages is now better split to properly handle the putpages case
(page -> arc) and the regular write case (arc -> page).

For the latter we use complete protocol of marking an object with
paging-in-progress and marking a page with io_start (busy count).
Also, in this case we remove the write bit from all page mappings and
clear dirty bits of the pages, the former is needed to ensure that the
latter does the right thing.
Additionally we update a page if it is cached instead of just freeing it
as was done before.  This needs to be verified.

A minor detail: ZFS-backed pages should always be either fully valid
or fully invalid.  Assert this and use simpler API that does not deal
with sub-page blocks.

Reviewed by:	kib
MFC after:	26 days
2013-02-03 18:42:20 +00:00
Andriy Gapon
13235aaa89 zfs: add MODULE_VERSION for zfsctrl
This should allow the kernel linker to easily detect a situation
when the module is present both in a kernel and in a preloaded file
(zfs.ko).

Reviewed by:	jhb
MFC after:	5 days
2013-02-02 11:35:18 +00:00
Andriy Gapon
ea84c62f93 spa_generate_rootconf: add support for old vdev labels
It seems that old ZFS versions (v15) completely omit "vdev_children"
property when there is a single child.

Reported by:	jase
Tested by:	jase
MFC after:	1 week
2013-01-26 10:34:17 +00:00
Xin LI
5c74885e99 MFV r245510:
improve the comment in txg.c

Obtained from:	Illumos (13910:f3454e0a097c)
MFC after:	2 weeks
2013-01-16 22:59:50 +00:00
Konstantin Belousov
614b9f9130 For zfs vnodes, use the standard inode number based hash algorithm.
Reviewed and tested by:	peter
Sponsored by:	The FreeBSD Foundation
MFC after:	5 days
2013-01-14 05:45:33 +00:00
Xin LI
290a1ba9a4 The current ZFS code expects ddt_zap_count to always succeed by asserting
the underlying zap_count() to return no errors.  However, it is possible
that the pool reaches to such a state where zap_count would return error,
leading to panics when a pool is imported.

This commit changes the ddt_zap_count to return error returned from
zap_count and handle the error appropriately.  With this change, it's now
possible to let zpool rollback damaged transaction groups and import the
pool.

Obtained from:	ZFS on Linux github (e8fd45a0f9)
MFC after:	1 month
2013-01-10 19:26:56 +00:00
Andriy Gapon
f71fbb1d12 zfs: solaris doesn't have KM_ZERO, kmem_zalloc should be used instead
To do:		remove KM_ZERO declaration
Pointyhat to:	avg (for mindlessly using the pseudo-flag)
MFC after:	instantly (to fix stable/8 build)
2012-12-23 19:58:41 +00:00
Steven Hartland
5780c4a723 Added vfs.zfs.vdev.trim_on_init sysctl which allows full vdev trim on
initialisation to be enabled (1) / disabled (0) defaults to enabled.

This is useful for devices which have a slow trim speed and are either
new or have otherwise already been wiped e.g. secure erase.

PR:     kern/173116
Submitted by:   Steven Hartland
Approved by:    pjd (mentor)
2012-12-13 17:39:07 +00:00
Steven Hartland
c440a359ca Upgrades trim free request sizes before inserting them into to free map,
making range consolidation much more effective particularly for small
deletes.

This reduces memory used by the free map as well as reducing the number
of bio requests down to geom required to process all deletes.

In tests this achieved a factor of 10 reduction of trim ranges / geom
call downs.

While I'm here correct the description of zio_vdev_io_start.

PR:		kern/173254
Submitted by:	Steven Hartland
Approved by:	pjd (mentor)
2012-12-13 17:06:38 +00:00
Steven Hartland
7150222c0a Renamed zfs trim stats removing duplicate zio_trim identifier from the name
Added description option to kstats.
Added descriptions for zio_trim kstats

PR:		kern/173113
Submitted by:	Steven Hartland
Reviewed by:	pjd
Approved by:	pjd
MFC after:	2 weeks
2012-12-12 16:14:14 +00:00
Xin LI
2740382ebd Use SA_ZPL_CRTIME instead of SA_ZPL_CTIME for creation time.
Submitted by:	phil.stone at gmx.com
MFC after:	2 weeks
2012-12-03 04:25:37 +00:00
Andriy Gapon
289b3b96ac zfs_getpages: make use of vm_page_readahead_finish
Suggested by:	kib
MFC after:	5 days
2012-12-01 18:13:53 +00:00
Andriy Gapon
992ffc58ae gfs_file_inactive: replace bad code with ugly code
Also, make it explicit that V_XATTRDIR is not properly supported in gfs
code yet.

The bad code was plain incorrect: (a) it spoiled handling of v_usecount
reaching zero and (b) it leaked v_holdcnt.

The ugly code employs potentially unsafe locking tricks.

Ideally we should separate vnode lifecycle and gfs node lifecycle.
A gfs node should have its own reference count where its child nodes
should be accounted.

PR:		kern/151111
Reviewed by:	kib
MFC after:	13 days
2012-12-01 18:12:55 +00:00
Martin Matuska
7faa32552f MFV r243395:
Introduce a new dataset aclmode setting "restricted" to protect ACL's
being destroyed or corrupted by a drive-by chmod.

illumos-gate 13889:a67716f16746
3254 add support in zfs for aclmode=restricted

References:
https://www.illumos.org/issues/3254

MFC after:	2 weeks
2012-11-26 12:24:39 +00:00
Martin Matuska
53e5858c68 Add loader(8) tunable to enable/disable nopwrite functionality:
vfs.zfs.nopwrite_enabled

MFC after:	2 weeks
2012-11-25 16:54:43 +00:00
Martin Matuska
dd801aa546 MFV r243013 and r243267:
Import the zio nop-write improvement from Illumos. To reduce I/O,
nop-write omits overwriting data if the checksum (cryptographically
secure) of new data matches the checksum of existing data.
It also saves space if snapshots are in use.

It currently works only on datasets with enabled compression, disabled
deduplication and sha256 checksums.

IllumOS 13887:196932ec9e6a and 13888:7204b3392a58
3236 zio nop-write

References:
https://www.illumos.org/issues/3236

MFC after:	2 weeks
2012-11-25 16:32:07 +00:00
Andriy Gapon
3a0e1b57bb zfs_freebsd_reclaim: remove a stray variable
... which leaked from a subsequent local change.
Unfortunately I noticed that only after commit.

MFC after:	5 weeks
X-MFC with:	r243520
2012-11-25 15:46:29 +00:00
Andriy Gapon
4ff1c77d22 zfs: overhaul zfs-vfs glue for vnode life-cycle management
* There is no need for the delayed destruction of znodes via taskqueue,
now that we do not need to fear recursion from getnewvnode into
zfs_inactive and zfs_freebsd_reclaim, thus making znode/vnode state
machine a bit simpler.

* More complete porting of zfs_inactive from Solaris VFS model to FreeBSD
vop_inactive and vop_reclaim model.  All destructive actions are done
in zfs_freebsd_reclaim.
This allows to simplify zfs_zget logic.

* Allow zfs_zget to return a doomed vnode if the current thread already
has an exclusive lock on the vnode.

* Clean up Solaris-isms like bailing out of reclaim/inactive on certain
values of v_usecount (aka v_count) or directly messing with this counter.

* Do not clear z_vnode while znode is still accessible.
z_vnode should be cleared only after zfs_znode_dmu_fini.
Otherwise zfs_zget may get an effectively half-deconstructed znode.
This allows to simplify zfs_zget logic further.

The above changes fix at least two known/reported problems:

o An indefinite wait in the following code path:
vgone -> VOP_RECLAIM -> zfs_freebsd_reclaim -> vnode_destroy_vobject ->
put_pages -> zfs_write -> zil_commit -> zfs_zget
This happened because vgone marks a vnode as VI_DOOMED before calling
VOP_RECLAIM, but zfs_zget would not return a doomed vnode under any
circumstances.
The fix in this change is not complete as it won't fix a deadlock between
two threads doing VOP_RECLAIM where one thread is in zil_commit trying to
zfs_zget a znode/vnode being reclaimed by the other thread, which would be
blocked trying to enter zil_commit.  This type of deadlock has not been
reported as of now.

o An indefinite wait in the unmount path caused by a znode "falling through
the cracks" in inactive+reclaim.  This would happen if the znode is unlinked
while its vnode is still active.

To Do: pass locking flags parameter to zfs_zget, so that the zfs-vfs
glue code doesn't have to re-lock a vnode but could ask for proper locking
from the very start.  This would also allow for the higher level code to
obtain a doomed vnode when it is expected/requested.  Or to avoid blocking
when it is not allowed (see zil_commit example above).

ffs_vgetf seems like a good source of inspiration.

Tested by:	Willem Jan Withagen <wjw@digiware.nl>
MFC after:	6 weeks
2012-11-25 15:33:26 +00:00
Andriy Gapon
7ca5310ea3 zfs_fhtovp: there is no reason to amend lock flags with LK_RETRY here
MFC after:	12 days
2012-11-25 15:07:27 +00:00
Andriy Gapon
7192f62bcc add zfs_bmap to aid vnode_pager_haspage
... otherwise zfs_getpages would mostly be called with one page at a time.

It is expected that ZFS VOP_BMAP is only called from vnode_pager_haspage.
Since ZFS files can have variable block sizes and also because we don't
really know if any given blocks are consecutive, we can not really report
any additional blocks behind or ahead of a given block.  Since physical
block numbers do not make sense for ZFS, we do not do any real translation
and thus pass back blk = lblk.  The net effect is that vnode_pager_haspage
knows that the block exists and that the pages backed by the block can be
accessed.  vnode_pager_haspage may be wrong about the exact count of the
pages backed by the block, because of a variable block size, which
vnode_pager_haspage doesn't really know - it only knows max block size in
a filesystem.  So pages from multiple blocks can be passed to zfs_getpages,
but that is expected and correctly handled.

vnode_pager should not call zfs_bmap for any other reason, because ZFS
implements VOP_PUTPAGES and thus vnode_pager_generic_getpages is not used.

vfs_cluster code vfs_bio code should not be called for ZFS, because ZFS does
not use buffer cache layer.

Also, ZFS does not use vn_bmap_seekhole, it has its prviate mechanism for
working with holes.

The above list should cover all the current calls to VOP_BMAP.

Reviewed by:	kib
MFC after:	6 weeks
2012-11-25 15:01:12 +00:00
Andriy Gapon
b609e5f891 zfs_getpages: optimize for large block sizes
MFC after:	6 weeks
2012-11-25 14:53:26 +00:00
Martin Matuska
2f06dfc9a3 MFV r243012:
Illumos 13886:e3261d03efbf

3349 zpool upgrade -V bumps the on disk version number, but leaves
     the in core version

References:
https://www.illumos.org/issues/3349

MFC after:	1 week
2012-11-25 10:53:42 +00:00
Martin Matuska
2b8d4033cc MFV r242735:
Illumos 13879:4eac7a87eff2:
3329 spa_sync() spends 10-20% of its time in spa_free_sync_cb()
3330 space_seg_t should have its own kmem_cache
3331 deferred frees should happen after sync_pass 1
3335 make SYNC_PASS_* constants tunable

New loader-only tunables:
vfs.zfs.sync_pass_deferred_free
vfs.zfs.sync_pass_dont_compress
vfs.zfs.sync_pass_rewrite

References:
https://www.illumos.org/issues/3329
https://www.illumos.org/issues/3330
https://www.illumos.org/issues/3331
https://www.illumos.org/issues/3335

MFC after:	2 weeks
2012-11-25 09:06:32 +00:00
Andriy Gapon
328998eac1 zfs roopool: add support for multi-vdev configurations
Tested by:	madpilot
MFC after:	10 days
2012-11-24 13:23:15 +00:00
Andriy Gapon
e1fccde2c9 spa_import_rootpool: initialize ub_version before calling spa_config_parse
... because the latter makes some decision based on the version.
This is especially important for raidz vdevs.
This is similar to what spa_load does.

This is not an issue for upstream because they do not seem to support
using raidz as a root pool.

Reported by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
Tested by:	Andrei Lavreniyuk <andy.lavr@gmail.com>
MFC after:	6 days
2012-11-24 13:16:49 +00:00
Andriy Gapon
cfca00a2fb spa_import_rootpool: do not call spa_history_log_version
The call is a NOP, because pool version in spa_ubsync.ub_version is not
initialized and thus appears to be zero.
If the version is properly set then the call leads to a NULL pointer
dereference because the spa object is still under-constructed.

The same change was independently made in the upstream as a part of
a larger change (4445fffbbb1ea25fd0e9ea68b9380dd7a6709025).

MFC after:	6 days
2012-11-24 13:14:53 +00:00
Andriy Gapon
c4f59a3c09 zfs: create devices/geoms from zvols after receiveing them
PR:		kern/167066
Tested by:	Andreas Nilsson <andrnils@gmail.com>
MFC after:	13 days
2012-11-24 13:07:31 +00:00
Andriy Gapon
dbe922173c zfs_remove: assert that delete_now case is never true on FreeBSD
That case is specific to Solaris VFS and it would violate pretty
fundamental contracts of FreeBSD VFS.

Discussed with:	pjd
MFC after:	12 days
2012-11-19 11:30:08 +00:00
Andriy Gapon
7b069f7fee zfs_remove: set VV_NOSYNC flag if a node is unlinked
Suggested by:	kib
MFC after:	12 days
2012-11-19 11:25:20 +00:00
Andriy Gapon
62875a6b10 spa_import_rootpool: fall back to use configuration from zpool.cache...
if we fail to generate a proper root pool configuration based on disk
probing.  Currently we can not properly generate the configuration for
multi-vdev pools.  Make that explicit.

Reported by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
Tested by:	madpilot, Bartosz Stec <bartosz.stec@it4pro.pl>
MFC after:	4 days
2012-11-18 11:47:25 +00:00
Konstantin Belousov
f13b5a0f01 Add the wait6(2) system call. It takes POSIX waitid()-like process
designator to select a process which is waited for. The system call
optionally returns siginfo_t which would be otherwise provided to
SIGCHLD handler, as well as extended structure accounting for child
and cumulative grandchild resource usage.

Allow to get the current rusage information for non-exited processes
as well, similar to Solaris.

The explicit WEXITED flag is required to wait for exited processes,
allowing for more fine-grained control of the events the waiter is
interested in.

Fix the handling of siginfo for WNOWAIT option for all wait*(2)
family, by not removing the queued signal state.

PR:	standards/170346
Submitted by:	"Jukka A. Ukkonen" <jau@iki.fi>
MFC after:	1 month
2012-11-13 12:52:31 +00:00
Andriy Gapon
7631b580ff zfs_ioc_destroy_snaps_nvl: remove disk device entries for zvol snapshots
... before trying to destroy the zvol snapshots themselves.

PR:		kern/173442
Reported by:	Petri Helenius <petri@helenius.fi>,
		mm
Obtained from:	Brian Behlendorf <behlendorf1@llnl.gov>,
		Illumos Bug #3170
Tested by:	Petri Helenius <petri@helenius.fi>
MFC after:	10 days
2012-11-10 12:22:26 +00:00
Xin LI
a9b09a3f3c MFV r242729 (mm):
Illumos r13840:97fd5cdf328a:

3145 single-copy arc
3212 ztest: race condition between vdev_online() and spa_vdev_remove()

Illumos r13849:3468a95b27cd:

3258 ztest's use of file descriptors is unstable
2012-11-10 01:52:52 +00:00
Attilio Rao
bc2258da88 Complete MPSAFE VFS interface and remove MNTK_MPSAFE flag.
Porters should refer to __FreeBSD_version 1000021 for this change as
it may have happened at the same timeframe.
2012-11-09 18:02:25 +00:00
Justin Hibbits
c757049235 Implement DTrace for PowerPC. This includes both 32-bit and 64-bit.
There is one known issue:  Some probes will display an error message along the
lines of:  "Invalid address (0)"

I tested this with both a simple dtrace probe and dtruss on a few different
binaries on 32-bit.  I only compiled 64-bit, did not run it, but I don't expect
problems without the modules loaded.  Volunteers are welcome.

MFC after:	1 month
2012-11-07 23:45:09 +00:00
Andriy Gapon
2c6024ec1b zfs_dirlook: bailout early if directory is unlinked
Otherwise we could fail with an incorrect error if e.g. parent
object id is removed too or we can even return a wrong vnode if
parent object has been already re-used.

Discussed with:	pjd
Also see:	http://article.gmane.org/gmane.os.freebsd.devel.file-systems/13863
MFC after:	26 days
2012-11-04 14:50:08 +00:00
Andriy Gapon
5c997cc429 zfsctl_snapdir_lookup: obtain a snapname in the remount case
... which is triggered if somebody did regular umount on a snapshot mount.

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	20 days
2012-11-04 14:43:15 +00:00
Andriy Gapon
88c8884a71 zfs: set MNTK_EXTENDED_SHARED flag
Discussed with:	kib
MFC after:	20 days
2012-11-04 14:36:11 +00:00
Andriy Gapon
71900cfaf7 zfs_vnode_forget: dispose of larvae vnode using public vfs api (mostly)
Reviewed by:	kib
MFC after:	19 days
2012-11-04 14:24:00 +00:00