Commit Graph

1628 Commits

Author SHA1 Message Date
Mariusz Zaborski
347a39b4a6 Add support for the arrays in nvlist library.
- Add
  nvlist_{add,get,take,move,exists,free}_{number,bool,string,nvlist,
  descriptor} functions.
- Add support for (un)packing arrays.
- Add the nvl_array_next field to the nvlist structure.
  If an array is added by the nvlist_{move,add}_nvlist_array function
  this field will contains next element in the array.
- Add the nitems field to the nvpair and nvpair_header structure.
  This field contains number of elements in the array.
- Add special flag (NV_FLAG_IN_ARRAY) which is set if nvlist is a part of
  an array.
- Add special type (NV_TYPE_NVLIST_ARRAY_NEXT).This type is used only
  on packing/unpacking.
- Add new API for traversing arrays (nvlist_get_array_next).
- Add the nvlist_get_pararr function which combines the
  nvlist_get_array_next and nvlist_get_parent functions. If nvlist is in
  the array it will return next element from array. If nvlist is last
  element in array or it isn't in array it will return his
  container (parent). This function should simplify traveling over nvlist.
- Add tests for new features.
- Add documentation for new functions.
- Add my copyright.
- Regenerate the sys/cddl/compat/opensolaris/sys/nvpair.h file.

PR:		191083
Reviewed by:	allanjude (doc)
Approved by:	pjd (mentor)
2015-08-15 06:34:49 +00:00
Alexander Motin
d8928f479b Remove some random accumulated diff from Illumos.
Submitted by:	avg (partially)
2015-08-14 13:43:12 +00:00
Alexander Motin
6cb8dbf791 2618 arc.c mistypes in the comments
Reviewed by: Jason King <jason.brian.king@gmail.com>
Reviewed by: Josef Sipek <jeffpc@josefsipek.net>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Bart Coddens <bart.coddens@gmail.com>

illumos/illumos-gate@fc98fea58e
2015-08-14 13:10:30 +00:00
Alexander Motin
997d864ce0 Fix r286766 build with debug. 2015-08-14 11:47:53 +00:00
Alexander Motin
ab4d08c3d3 Fix minor mismerge sometimes earlier. 2015-08-14 09:48:23 +00:00
Alexander Motin
5ba12a280a MFV r286765: 5817 change type of arcs_size from uint64_t to refcount_t
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@2fd872a734

As a way to make it more difficult to introduce bugs into the ARC, and to
make it easier to diagnose issues when bugs do creep in, it would be
beneficial to change the type of the arc_state_t's arcs_size field to be
a refcount_t instead of a uint64_t. This would allow us to make stricter
checks when incrementing and decrementing the value with debugging enabled,
but still fallback to simple, fast atomic operations when debugging is
disabled.
2015-08-14 09:39:23 +00:00
Alexander Motin
ab4930d98c MFV r285025: 6033 arc_adjust() should search MFU lists for oldest buffer
when adjusting MFU size.

illumos/illumos-gate@31c46cf23c

https://www.illumos.org/issues/6033
  When we're looking for the list containing oldest buffer we never
  actually look at the MFU lists even when we try to evict from MFU.
  looks like a copy paste error, the fix is here:

Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Xin Li <delphij@delphij.net>
Reviewed by: Prakash Surya <me@prakashsurya.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alek Pinchuk <alek@nexenta.com>
Obtained from:  illumos
2015-08-14 09:33:46 +00:00
Alexander Motin
e0360e14d2 MFV r277431: 5497 lock contention on arcs_mtx
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@244781f10d

This patch attempts to reduce lock contention on the current arc_state_t
mutexes. These mutexes are used liberally to protect the number of LRU
lists within the ARC (e.g. ARC_mru, ARC_mfu, etc). The granularity at
which these locks are acquired has been shown to greatly affect the
performance of highly concurrent, cached workloads.
2015-08-14 09:31:07 +00:00
Alexander Motin
267b62ec43 Revert part of r205231, introducing multiple ARC state locks.
This local implementation will be replaced by one from Illumos to reduce
code divergence and make further merges easier.
2015-08-14 09:25:54 +00:00
Alexander Motin
49114ce463 MFV 286711: 6096 ZFS_SMB_ACL_RENAME needs to cleanup better
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Gordon Ross <gordon.w.ross@gmail.com>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>

illumos/illumos-gate@8f5190a540
2015-08-13 00:13:55 +00:00
Alexander Motin
3b1f51e911 MFV 286709:
6093 zfsctl_shares_lookup should only VN_RELE() on zfs_zget() success

Reviewed by: Gordon Ross <gwr@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@0f92170f1e
2015-08-13 00:10:36 +00:00
Alexander Motin
0d0def87fe MFV 286707: 5959 clean up per-dataset feature count code
Reviewed by: Toomas Soome <tsoome@me.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@ca0cc3918a

A ZFS feature flags (large blocks) tracks its refcounts as the number of
datasets that have ever used the feature. Several features of this type
are planned to be added (new checksum functions). This code should be made
common infrastructure rather than duplicating the code for each feature.
2015-08-12 23:59:17 +00:00
Alexander Motin
b696497df0 MFV r286704: 5960 zfs recv should prefetch indirect blocks
5925 zfs receive -o origin=

Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Author: Paul Dagnelie <pcd@delphix.com>

While running 'zfs recv' we noticed that every 128th 8K block required a
read. We were seeing that restore_write() was calling dmu_tx_hold_write()
and the indirect block was not cached. We should prefetch upcoming indirect
blocks to avoid having to go to disk and blocking the restore_write().

Allow an incremental send stream to be received as a clone, even if the
stream does not mark it as a clone.
2015-08-12 22:41:06 +00:00
Alexander Motin
d0687a01d7 MFV r284763: 5981 Deadlock in dmu_objset_find_dp
illumos/illumos-gate@1d3f896f54

https://www.illumos.org/issues/5981
  When dmu_objset_find_dp gets called with a read lock held, it fans out
  the work to the task queue. Each task in turn acquires its own read
  lock before calling the callback. If during this process anyone tries
  to a acquire a write lock, it will stall all read lock requests.Thus
  the tasks will never finish, the read lock of the caller will never
  get freed and the write lock never acquired.  deadlock.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Arne Jansen <jansen@webgods.de>
2015-08-12 19:10:29 +00:00
Alexander Motin
101a6d4eac MFV r284762: 5269 zpool import slow
illumos/illumos-gate@12380e1e70

https://www.illumos.org/issues/5269
  When importing a pool (at boot or with zpool import) with many
  filesystem, the process can take minutes. It doesn't matter whether
  the pool has been exported cleanly or uncleanly.  The problem is that
  each dataset has its own log chain. On import, all datasets have to be
  checked if there are logs to replay.  The idea is to speed up this
  process by paralellizing it.

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Arne Jansen <jansen@webgods.de>
2015-08-12 18:47:30 +00:00
Alexander Motin
ebf527de10 MFV r286682: 5765 add support for estimating send stream size with
lzc_send_space when source is a bookmark

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@nexenta.com>
Author: Max Grossman <max.grossman@delphix.com>

illumos/illumos-gate@643da460c8
2015-08-12 18:23:08 +00:00
Alexander Motin
2d41b1006f MFV r286224: 5695 dmu_sync'ed holes do not retain birth time
illumos/illumos-gate@70163ac57e

https://www.illumos.org/issues/5695
  In dmu_sync_ready(), a hole block pointer will have it's logical size
  explicitly set as it's necessary for replay purposes. To "undo" this,
  dmu_sync_done() will zero out any hole that it finds. This becomes a
  problem when using the "hole_birth" feature, as this will also wipe out
  any birth time that might have happened to be set on the hole.
  ...
  As a fix, the logic to zero out a hole is only applied to old style
  holes with a birth time of zero. Holes created with the "hole_birth"
  feature enabled will have a non-zero birth time, and will be skipped
  (thus preserving the ltime, type, and level information as well).
  In addition, zdb was updated to also print the ltime, type, and level
  information for these new style holes. Previously, only the logical
  birth time would be printed.

Author: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
2015-08-12 17:21:41 +00:00
Alexander Motin
ef00c83db1 Fix set of sign extension bugs in r286625. 2015-08-12 08:36:58 +00:00
Alexander Motin
b3fc966389 Fix assertion panic caused by combination of r286598 and TRIM. 2015-08-11 19:15:55 +00:00
Alexander Motin
3caed89878 Fix r286625 build on i386. 2015-08-11 12:38:01 +00:00
Alexander Motin
a3b3a9752c Fix minor mismerge in r286574. 2015-08-11 12:22:16 +00:00
Alexander Motin
c350858a50 MFV r277425:
5376 arc_kmem_reap_now() should not result in clearing arc_no_grow
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@2ec99e3e98
2015-08-11 10:39:19 +00:00
Alexander Motin
6be7d38913 Remove extra lock, that IMO only creates potential problems now. 2015-08-11 09:18:51 +00:00
Alexander Motin
1af86496cb MFV 286604: 5812 assertion failed in zrl_tryenter(): zr_owner==NULL
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@8df173054c
2015-08-10 21:36:51 +00:00
Alexander Motin
799f47828d MFV 286602: 5810 zdb should print details of bpobj
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@732885fca0
2015-08-10 21:32:40 +00:00
Alexander Motin
c70c15ffa9 MFV 286599: 5808 spa_check_logs is not necessary on readonly pools
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@23367a2f2c
2015-08-10 21:19:42 +00:00
Alexander Motin
f7bf11ab59 MFV 286597: 5701 zpool list reports incorrect "alloc" value for cache devices
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Alek Pinchuk <alek.pinchuk@nexenta.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@a52fc310ba
2015-08-10 21:13:59 +00:00
Alexander Motin
2e92f38b63 Local addition and mismerge fix for r286579. 2015-08-10 20:34:46 +00:00
Alexander Motin
de8b7ceff1 MFV 286588: 5820 verify failed in zio_done(): BP_EQUAL(bp, io_bp_orig)
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>

illumod/illumos-gate@34e8acef00
2015-08-10 19:38:07 +00:00
Alexander Motin
57f7c5acf5 MFV 286586: 5746 more checksumming in zfs send
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Albert Lee <trisk@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@98110f08fa
2015-08-10 19:32:58 +00:00
Alexander Motin
9b4b955150 MFV r277430: 5313 Allow I/Os to be aggregated across ZIO priority classes
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Will Andrews <willa@SpectraLogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@fe319232d2
2015-08-10 12:39:10 +00:00
Alexander Motin
498b9d6c63 Fix r286574 build in user-space. 2015-08-10 12:25:26 +00:00
Alexander Motin
0702ce1a52 Fix r286570 build with debug. 2015-08-10 11:52:54 +00:00
Alexander Motin
83a6947e11 MFV r277428: 5056 ZFS deadlock on db_mtx and dn_holds
Reviewed by: Will Andrews <willa@spectralogic.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin Gibbs <justing@spectralogic.com>

illumos/illumos-gate@bc9014e6a8
2015-08-10 11:30:07 +00:00
Alexander Motin
f13e9e1470 MFV r277427: 5445 Add more visibility via arcstats; specifically
arc_state_t stats and differentiate between "data" and "metadata"

Reviewed by: Basil Crow <basil.crow@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Bayard Bell <bayard.bell@nexenta.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Prakash Surya <prakash.surya@delphix.com>

illumos/illumos-gate@4076b1bf41
2015-08-10 10:59:58 +00:00
Alexander Motin
c908dc6f4b MFV r277426: 5408 managing ZFS cache devices requires lots of RAM
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Don Brady <dev.fs.zfs@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Chris Williamson <Chris.Williamson@delphix.com>

illumos/illumos-gate@89c86e3229

Currently, every buffer cached in the L2ARC is accompanied by a 240-byte
header in memory, leading to very high memory consumption when using very
large cache devices. These changes significantly reduce this overhead.

Currently:

L1-only header = 176 bytes
L1 + L2 or L2-only header = 176 bytes + 32 byte checksum + 32 byte l2hdr
    = 240 bytes

Memory-optimized:

L1-only header = 176 bytes
L1 + L2 header = 176 bytes + 32 byte checksum = 208 bytes
L2-only header = 96 bytes + 32 byte checksum = 128 bytes

So overall:

          Trunk  Optimized
        +-----------------+
L1-only | 176 B  | 176 B  | (same)
        +-----------------+
L1 & L2 | 240 B  | 208 B  | (saved 32 bytes)
        +-----------------+
L2-only | 240 B  | 128 B  | (saved 116 bytes)
        +-----------------+

For an average blocksize of 8KB, this means that for the L2ARC, the ratio
of metadata to data has gone down from about 2.92% to 1.56%.  For a
'storage optimized' EC2 instance with 1600GB of SSD and 60GB of RAM, this
means that we expect a completely full L2ARC to use (1600 GB * 0.0156) /
60GB = 41% of the available memory, down from 78%.
2015-08-10 10:34:23 +00:00
Alexander Motin
1838fc920f MFV 286555: Avoid 128K kmem allocations in mzap_upgrade()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Approved by: Rich Lowe <richlowe@richlowe.net>

illumos/illumos-gate@be3e2ab906
2015-08-09 20:41:44 +00:00
Alexander Motin
475e2caa3d MFV 286553: 5769 Cast 'zfs bad bloc' to ULL for x86
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Richard PALO <richard@NetBSD.org>
Approved by: Dan McDonald <danmcd@omniti.com>

illumos/illumos-gate@8c76e0763b
2015-08-09 20:32:30 +00:00
Alexander Motin
a5a2e782be MFV 286550: 5694 traverse_prefetcher does not prefetch enough
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: George Wilson <george.wilson@delphix.com>

illumos/illumos-gate@34d7ce052c
2015-08-09 20:23:35 +00:00
Alexander Motin
927211755b MFV 286548:
5693 ztest fails in dbuf_verify: buf[i] == 0, due to dedup and bp_override

Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@7f7ace3700
2015-08-09 20:08:38 +00:00
Alexander Motin
4ff9527edc MFV 286546:
5661 ZFS: "compression = on" should use lz4 if feature is enabled

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Xin LI <delphij@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@db1741f555
2015-08-09 20:02:16 +00:00
Alexander Motin
9834135947 MFV 286544:
5630 stale bonus buffer in recycled dnode_t leads to data corruption

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>
2015-08-09 19:35:39 +00:00
Alexander Motin
07470009be MFV 286542: 5592 NULL pointer dereference in dsl_prop_notify_all_cb()
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Will Andrews <will@freebsd.org>
Approved by: Robert Mustacchi <rm@joyent.com>

illumos/illumos-gate@9d47dec048
2015-08-09 19:29:10 +00:00
Alexander Motin
1d367fea94 MFV 286540: 5531 NULL pointer dereference in dsl_prop_get_ds()
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@e57a022b8f
2015-08-09 19:26:21 +00:00
Alexander Motin
c44a37d0b7 MFV 286538:
5562 ZFS sa_handle's violate kmem invariants, debug kernels panic on boot

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Robert Mustacchi <rm@fingolfin.org>
Reviewed by: George Wilson <george@delphix.com>
Reviewed by: Rich Lowe <richlowe@richlowe.net>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Justin T. Gibbs <justing@spectralogic.com>

illumos/illumos-gate@0fda3cc5c1
2015-08-09 19:17:59 +00:00
Mark Johnston
1c9a705223 Remove a couple of unused fields from the FBT probe struct. 2015-08-03 17:39:36 +00:00
Steven Hartland
ebbc56ecd6 Fix KSTACK_PAGES check in ZFS module
The check introduced by r285946 failed to add the dependency on
opt_kstack_pages.h which meant the default value for the platform instead
of the customised options KSTACK_PAGES=X was being tested.

Also wrap in #ifdef __FreeBSD__ for portability.

MFC after:	3 days
Sponsored by:	Multiplay
2015-08-03 09:34:09 +00:00
Mark Johnston
48fcd357c4 Avoid dereferencing curthread->td_proc->p_cred in DTrace probe context.
When a process is exiting, there is a narrow window where p_cred may be
NULL while its threads are still executing. Specifically, the last thread
to exit a process sets the process state to PRS_ZOMBIE with the proc
spinlock held and then calls thread_exit(). thread_exit() drops the spin
lock, permitting the process to be reaped and thus causing its cred struct
to be released. However, the exiting thread may still cause DTrace probes
to fire by calling sched_throw(), resulting in a double fault if such a
probe enabling attempts to access the GID or UID DIF variables.

The thread's cred reference is not susceptible to this race since it is not
released until after the thread has exited.

MFC after:	1 week
2015-08-02 00:11:56 +00:00
Steven Hartland
1635369e99 Add warning about low KSTACK_PAGES for ZFS use
As ZFS requires a more kernel stack pages than is the default on some
architectures e.g. i386, warn if KSTACK_PAGES is less than
ZFS_MIN_KSTACK_PAGES (which is 4 at the time of writing).

MFC after:	3 days
Sponsored by:	Multiplay
2015-07-28 11:19:38 +00:00
Mark Johnston
d258fd1d98 Remove checks for a NULL return value from M_WAITOK allocations. 2015-07-21 23:44:36 +00:00
Ed Schouten
5a170c1b0e Add an API for easily creating userspace threads in kernelspace.
This change refactors the existing create_thread() function to be more
generic. It replaces almost all of its arguments by a callback that can
be used to extract the thread ID and copy it out to the right place, but
also to perform additional initialization steps, such as setting the
trapframe. This also makes the difference between thr_new() and
thr_create() more clear in my opinion.

This function is going to be used by the CloudABI compatibility layer.

It looks like the OpenSolaris compatibility framework already provides a
function called thread_create(). Rename this function to
do_thread_create() and use a macro to deal with the namespacing
conflict. A similar approach is already used for thread_exit().

MFC after:	1 month
2015-07-20 10:20:04 +00:00
Mark Johnston
32cd0147fa Implement the lockstat provider using SDT(9) instead of the custom provider
in lockstat.ko. This means that lockstat probes now have typed arguments and
will utilize SDT probe hot-patching support when it arrives.

Reviewed by:	gnn
Differential Revision:	https://reviews.freebsd.org/D2993
2015-07-19 22:14:09 +00:00
Mark Johnston
efe8b26b82 Modify lockstat_nsecs() to just return unless lockstat probes are actually
enabled. The cost of a timecounter read can be quite significant, and the
problem became more apparent after r284297, since that change resulted in
a call to lockstat_nsecs() for each acquisition of an rwlock read lock.

PR:		201642
Reviewed by:	avg
Tested by:	Jason Unovitch
MFC after:	3 days
Differential Revision:	https://reviews.freebsd.org/D3073
2015-07-18 00:22:00 +00:00
Konstantin Belousov
888e282ab4 When checking for the valid value of the frame pointer, verify that it
belongs to the kernel stack address range for the thread.  Right now,
code checks that new frame is not farther then KSTACK_PAGES pages from
the current frame, which allows the address to point past the top of
the stack.

Reviewed by:	andrew, emaste, markj
Differential revision:	https://reviews.freebsd.org/D3108
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-07-16 19:40:18 +00:00
Mateusz Guzik
752fc07d33 vfs: implement v_holdcnt/v_usecount manipulation using atomic ops
Transitions 0->1 and 1->0 (which decide e.g. on putting the vnode on the free
list) of either counter are still guarded with vnode interlock.

Reviewed by:	kib (earlier version)
Tested by:	pho
2015-07-16 13:57:05 +00:00
Mateusz Guzik
8a08cec166 Create a dedicated function for ensuring that cdir and rdir are populated.
Previously several places were doing it on its own, partially
incorrectly (e.g. without the filedesc locked) or even actively harmful
by populating jdir or assigning rootvnode without vrefing it.

Reviewed by:	kib
2015-07-11 16:22:48 +00:00
Mariusz Zaborski
306a82f8f4 Rename zfs nvpair files to not colidate with our nvlist.
PR:		201356
Approved by:	pjd (mentor)
2015-07-09 21:53:40 +00:00
Mateusz Guzik
f131759f54 fd: make 'rights' a manadatory argument to fget* functions 2015-07-05 19:05:16 +00:00
Konstantin Belousov
6fdfd88220 Use single instance of the identical INKERNEL() and PMC_IN_KERNEL()
macros on amd64 and i386.  Move the definition to machine/param.h.
kgdb defines INKERNEL() too, the conflict is resolved by renaming kgdb
version to PINKERNEL().

On i386, correct the lowest kernel address.  After the shared page was
introduced, USRSTACK no longer points to the last user address + 1 [*]

Submitted by:	Oliver Pinter [*]
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-07-02 14:37:21 +00:00
Andriy Gapon
74f75cb1bd zfs_mount(MS_REMOUNT): protect zfs_(un)register_callbacks calls
We now take z_teardown_lock as a writer to ensure that there is no I/O
while the filesystem state is in a flux.  Also, zfs_suspend_fs() ->
zfsvfs_teardown() call zfs_unregister_callbacks() and zfs_resume_fs() ->
zfsvfs_setup() call zfs_unregister_callbacks().  Previously there was no
synchronization between those calls and the calls in the re-mounting
case.  That could lead to concurrent execution and a crash.

PR:		180060
Differential Revision:	https://reviews.freebsd.org/D2865
Suggested by:	mahrens
Reviewed by:	delphij, pho, mahrens, will
MFC after:	13 days
Sponsored by:	ClusterHQ
2015-07-02 08:32:02 +00:00
Ruslan Bukin
b78ee15e9f First cut of DTrace for AArch64.
Reviewed by:	andrew, emaste
Sponsored by:	ARM Limited
Differential Revision:	https://reviews.freebsd.org/D2738
2015-07-01 15:51:11 +00:00
Ruslan Bukin
0ff41755cd Add a central location for exclusion checks. We check
here if function is excluded from FBT instrumentation.

Reviewed by:	andrew, emaste, markj
Differential Revision:	https://reviews.freebsd.org/D2899
2015-07-01 14:09:59 +00:00
Andriy Gapon
bc97daa07e MFV r284412: 5911 ZFS "hangs" while deleting file
Reviewed by: Bayard Bell <buffer.g.overflow@gmail.com>
Reviewed by: Alek Pinchuk <alek@nexenta.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>

illumos/illumos-gate@46e1baa6cf

https://www.illumos.org/issues/5911
Sometimes ZFS appears to hang while deleting a file. It is actually
making slow progress at the file deletion, but other operations
(administrative and writes via the data path) "hang" until the file
removal completes, which can take a long time if the file has many
blocks. The deletion (or most of it) happens in a single txg, and the
sync thread spends most of its time reading indirect blocks via this
stack trace:
	swtch+0x141()
	cv_wait+0x70()
	zio_wait+0x5b()
	dbuf_read+0x2c0()
	free_children+0x50()
	free_children+0x12a()
	free_children+0x12a()
	free_children+0x12a()
	dnode_sync_free_range_impl+0xdf()
	dnode_sync_free_range+0x52()
	range_tree_vacate+0x65()
	dnode_sync+0x1d8()
	dmu_objset_sync_dnodes+0x77()
	dmu_objset_sync+0x19f()
	dsl_dataset_sync+0x51()
	dsl_pool_sync+0x9a()
	spa_sync+0x2ff()
	txg_sync_thread+0x21f()
	thread_start+8()
One way to reproduce the problem is if we are over the arc_meta_limit,
e.g. because lots of indirect blocks are pinned because we have L0
dbufs under them.  It could be that most of the L1 indirects are cached,
in which case when dmu_free_long_range_impl() calls dmu_tx_hold_free(),
it will complete very quickly. This allows dmu_free_long_range_impl() to
put many (perhaps all of its) transactions in the same TXG. However,
dmu_free_long_range_impl() calls dnode_evict_dbufs (and
dnode_free_range()), which removes the L0 dbufs, thus reducing the hold
count on the L1 indirect blocks above it, allowing them to be evicted.
Because we are over the arc_meta_limit(), these L1 blocks will be
evicted ASAP. Thus when we get to syncing context, the L1 indirects are
no longer cached and must be read in.

Obtained from:	illumos
MFC after:	15 days
2015-06-19 06:58:05 +00:00
Andriy Gapon
ab50c99d40 illums compat: use flsl/flsll for highbit/highbit64
Do that only when when fast inline versions are available.
At the moment that can be the case only in the kernel and not for all
platforms.

The original code uses the binary search and that's kept as a fallback.
This is a micro optimization.

Differential Revision:	https://reviews.freebsd.org/D2839
Reviewed by:	delphij, mahrens, mav
MFC after:	17 days
2015-06-19 06:41:53 +00:00
Gleb Smirnoff
093ebe1d28 o Un-inline vm_pager_get_pages(), vm_pager_get_pages_async().
o Provide an extensive set of assertions for input array of pages.
o Remove now duplicate assertions from different pagers.

Sponsored by:	Nginx, Inc.
Sponsored by:	Netflix
2015-06-17 22:44:27 +00:00
Andriy Gapon
783379a942 Revert r284511 because it caused build failures on many platforms
The problem is that when inline versions of flsl and flsll are not
available, then libkern.h must be included for their declarations
in kernel sources.
The fix would be trivial, but I would like to figure out first if
it even makes sense to use the libkern provided implementations.

Reported by:	bz
Pointyhat to:	avg
2015-06-17 17:16:06 +00:00
Andriy Gapon
6470c31911 l2arc: pass correct size to trim requests
b_size is a logical size of a buffer in memory, b_asize is its physical
size that accounts for possible compression.
Currently the latter is the best approximation for the allocated, on-disk
size.

L2ARC TRIM support was committed a few weeks before L2ARC compression
was imported, so originally the code was correct, because b_size was
the size.

Further thoughts.  Given that the cache device is being overwritten
in a circular fashion it is not clear if a TRIM per each evicted L2ARC
buffer has any benefits.
Maybe it would be sufficient to issue a single trim request for the whole
device when it is loaded, e.g. after a bootup, or when it is unloaded, e.g.
before a shutdown.  At least as long as L2ARC is not persistent across
reboots.

Discussed with:	smh
MFC after:	19 says
2015-06-17 12:28:13 +00:00
Andriy Gapon
1fa1d4a651 illumos compat: use flsl/flsll for highbit/highbit64
This is a micro optimization.
The upstream code uses the binary search.

Differential Revision:	https://reviews.freebsd.org/D2839
Reviewed by:	delphij, mav
MFC after:	15 days
2015-06-17 12:05:04 +00:00
Andriy Gapon
bab89d0897 MFV r284036: 5961 Fix stack overflow in zfs_create_fs
illumos/illumos-gate@c701fde691

Author:		glebius
MFC after:	11 days
2015-06-12 11:10:49 +00:00
Andriy Gapon
ff7e06fbf4 MFV r284030: 5818 zfs {ref}compressratio is incorrect with 4k sector size
illumos/illumos-gate@81cd5c555f

Author:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	17 days
2015-06-12 10:57:05 +00:00
Andriy Gapon
8e9f0d5803 MFV r283534: 5515 dataset user hold doesn't reject empty tags
illumos/illumos-gate@752fd8dabc

Author:	Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
MFC after:	10 days
2015-06-12 10:52:53 +00:00
Andriy Gapon
dde4126314 MFV r284040: check that datasets are snapshots
5946 zfs_ioc_space_snaps must check that firstsnap and lastsnap refer to snapshots
5945 zfs_ioc_send_space must ensure that fromsnap refers to a snapshot
Reviewed by: Steven Hartland <killing@multiplay.co.uk>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Gordon Ross <gordon.ross@nexenta.com>

illumos/illumos-gate@24218bebb4

Note that the upstream commit is modified during MFV: in the upstream
the check is done by inspecting ds_is_snapshot field while in FreeBSD
we call dsl_dataset_is_snapshot().
This is because illumos/illumos-gate@bc9014e6a8
(r277428 in vendor-sys/illumos) is not MFV-ed yet.

MFC after:	10 days
2015-06-12 10:41:24 +00:00
Ruslan Bukin
8bd0e17595 Don't re-define LOCORE when dtrace is built-in to the kernel. 2015-06-10 09:59:26 +00:00
Andriy Gapon
de93769f1d compat nvpair.h: make sure that the names are mangled only for kernel
Currently there is no good reason to mangle the userland API.
The change was introduced in eac1d566b4,
r279437.  Also see https://reviews.freebsd.org/D1881.

I am still convinced that nv should not have introduced intentionally
conflicting API.

Discussed with:	rstone
X-MFC with:	r279437
Sponsored by:	ClusterHQ
2015-06-07 08:54:25 +00:00
Konstantin Belousov
63261dad32 Add missed {}.
Noted by:	Morten Rodal <morten@rodal.no>
MFC after:	2 weeks
2015-05-27 19:28:14 +00:00
Konstantin Belousov
780dca1b1e Right now, dounmount() is called with unreferenced mount point.
Nothing stops a parallel unmount to suceed before the given call to
dounmount() checks and locks the covered vnode.  Prevent dounmount()
from acting on the freed (although type-stable) memory by changing the
interface to require the mount point to be referenced.  dounmount()
consumes the reference on return, regardless of the sucessfull or
erronous result.

Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2015-05-27 09:22:50 +00:00
Andriy Gapon
4b040d9513 zfs: fixes for a full stream received into an existing dataset
- this should fail early unless the force flag is set
- if the force flag is set then any local modifications including
  snapshots should be undone

See:	https://www.illumos.org/issues/5912
See:	https://reviews.csiden.org/r/220/

Reviewed by:	mahrens, Paul Dagnelie <pcd@delphix.com>
MFC after:	15 days
Sponsored by:	ClusterHQ
2015-05-25 11:56:57 +00:00
Andriy Gapon
e80d8b4b7c dsl_dataset_promote_check: ensure that shared snaps do not become too long
... after they are transfered from the old origin to the new one.

See:	https://www.illumos.org/issues/5909
See:	https://reviews.csiden.org/r/219/

Reviewed by:	mahrens
MFC after:	10 days
Sponsored by:	ClusterHQ
2015-05-25 11:48:15 +00:00
Konstantin Belousov
e61d4e626e Remove excess Giant acquisition around the dounmount() call.
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2015-05-25 09:08:19 +00:00
Mark Johnston
11027ebcbb Remove unused references to calltrap.
MFC after:	3 days
2015-05-25 01:22:56 +00:00
Jung-uk Kim
fd90e2ed54 CALLOUT_MPSAFE has lost its meaning since r141428, i.e., for more than ten
years for head.  However, it is continuously misused as the mpsafe argument
for callout_init(9).  Deprecate the flag and clean up callout_init() calls
to make them more consistent.

Differential Revision:	https://reviews.freebsd.org/D2613
Reviewed by:	jhb
MFC after:	2 weeks
2015-05-22 17:05:21 +00:00
Steven Hartland
c017a87e08 Add copyright info missing from r282205
Add the copyright info missing from ZoL origin version.

MFC after:	2 days
Sponsored by:	Multiplay
2015-05-14 08:13:01 +00:00
Andriy Gapon
defce67748 zfs ioctls: use fget_write / fget_read instead of getf wrapper for fget
This allows to ensure that we do not write to a file that was opened
for reading only or vice versa.

Also, use the correct capability in in zfs_ioc_send_new().

Differential Revision:	https://reviews.freebsd.org/D2382
Reviewed by:	delphij
MFC after:	17 days
Sponsored by:	ClusterHQ
2015-05-11 10:07:31 +00:00
Mark Johnston
5a9f9cb38e Remove some commented-out upstream code for handling traps from usermode
DTrace probes. This handling is already done in trap() on i386 and amd64.
2015-05-10 22:27:48 +00:00
Justin Hibbits
45bf6d59de Fix a couple bugs in 64-bit powerpc fasttrap argument retrieval.
Found by code inspection.
2015-05-10 04:33:01 +00:00
Andriy Gapon
96b60db0d7 MFV r282630: 5809 Blowaway full receive in v1 pool causes kernel panic
MFC after:	5 days
2015-05-08 14:03:14 +00:00
Andriy Gapon
24dd1a8242 zfs: do not hold an extra reference on a root vnode while a filesystem is mounted
At present zfs_domount() acquires a reference on the filesystem's root vnode
and that reference is kept until zfs_umount.
The latter calls vflush(rootrefs = 1) to dispose of the extra reference.

There is no explanation of why that reference is kept - what problem it
solves or what behavior it improves.
Also, that logic is FreeBSD specific.

There is one real problem with that reference, though.
zfs recv -F may receive a full, non-incremental stream to a mounted filesystem.
In that case the received root object is likely to have a different z_gen
attribute value. Because of that, zfs_rezget will leave the previous root znode
and vnode disassociated from the actual object (z_sa_hdl == NULL).
Thus, future calls to VFS_ROOT() -> zfs_root() will produce a new vnode-znode
pair, while the old one will be kept alive by the outstanding reference.
So, the outstanding reference will not actually be for the new root vnode
(or, more precisely, vnodes - because a root vnode may be recycled and a newer
one can be created).
As a result, when vflush(rootrefs = 1) s called there will be two problems:

- a leaked reference on the old root vnode preventing a graceful unmount
- insufficient references on the actual root vnode leading to a crash upon
  access to the vnode after it is destroyed by vgone() + vdrop()

The second issue will actually override the first one.

Differential Revision:	https://reviews.freebsd.org/D2353
Reviewed by:		delphij, kib, smh
MFC after:	17 days
2015-05-05 11:01:06 +00:00
Andriy Gapon
ce0023d851 dmu_recv_end_check: don't leak hold if dsl_destroy_snapshot_check_impl fails
The leak may happen if !drc_newfs && drc_force and there is an error
iterating through snapshots or any of snapshot checks fails.

See https://www.illumos.org/issues/5870
See https://reviews.csiden.org/r/206/

Reviewed by:	mahrens (as mahrens@delphix.com)
MFC after:	15 days
Sponsored by:	ClusterHQ
2015-05-05 10:56:16 +00:00
Steven Hartland
aeb9d4dad9 Fix misuse of input argument in traverse_visitbp
In traverse_visitbp(), the input argument dnp is modified in the middle
to point to a temporary buffer. Originally this doesn't matter, because
no user of TRAVERSE_POST dereferences it. However, in fbeddd6 a piece of
code is added dereferencing dnp after the modification, creating a possible
bug.

We fix this by creating a new local variable cdnp for the DMU_OT_DNODE case,
so we don't modify the input argument. Also we introduce different local
variables in the DMU_OT_OBJSET case to prevent confusion between the input
argument.

Obtained from:	zfsonlinux (a585f2f844ed3d4270221fed88f5e494eb55d932)
MFC after:	2 weeks
Sponsored by:	Multiplay
2015-04-28 22:46:58 +00:00
Andriy Gapon
9bc3222765 replace a comment about zfs recv -F corner case with a longer, more detailed one
The old comment in zfs_rezget explains what situation the code handles,
the new comment also describes how the situation can arise.

Also, re-join a line that became sufficiently shorti some time ago.

Differential Revision:	https://reviews.freebsd.org/D2352
Reviewed by:	delphij, smh
MFC after:	12 days
2015-04-28 09:19:40 +00:00
Andriy Gapon
1af760ce1b zfs_onexit_fd_hold: return EBADF even if devfs_get_cdevpriv gave ENOENT
/dev/zfs always has per-open data, so when it is missing the file
descriptor is for some other file.  Returning ENOENT in this case
is confusing as a variety of other conditions (like a missing dataset)
may result in the same error.  It's better to consistently return
EBADF for any problems with the file descriptor.

Note that zfs_onexit_fd_hold() is used with 'automatic cleanup fd'
- when that fd is closed, typically because a process is terminated,
some cleanup action is taken by ZFS driver.  E.g. a temporary
snapshot hold is released.

Perhaps, it would even be worthwhile changing devfs_get_cdevpriv()
to return EBADF if there is no associated data.

Differential Revision:	https://reviews.freebsd.org/D2370
Reviewed by:	delphij, smh
MFC after:	12 days
2015-04-28 09:11:47 +00:00
Andriy Gapon
37a9b4136e dsl_dir_rename_check: return EXDEV on cross-pool rename attempt
Obtained from:	zfsonlinux/zfs@9063f65476
Obtained from:	Boris Protopopov <boris.protopopov@actifio.com>
MFC after:	10 days
2015-04-28 08:04:16 +00:00
Andriy Gapon
99d058c8a7 MFV r282123: 5610 zfs clone from different source and target pools produces coredump
MFC after:	10 days
2015-04-28 07:42:28 +00:00
Andriy Gapon
28d15239af MFV r282124: 5393 spurious failures from dsl_dataset_hold_obj()
The actual bugfix was pro-actively committed in r275515.
This MFV is cosmetic, it just aligns code style with the upstream.

MFC after:	10 days
2015-04-28 07:37:38 +00:00
Andriy Gapon
39b6f1d6c1 nvpair_type_is_array: DATA_TYPE_INT8_ARRAY was not recognized
To do:	upstream (https://www.illumos.org/issues/5778)
MFC after:	10 days
2015-04-28 06:34:55 +00:00
Robert Watson
a12df97ed2 Adjust PROF_ARTIFICIAL_FRAMES in the DTrace profile provider on ARM to
skip 10, rather than 9, frames.  This appears to work quite well in
practice on the BeagleBone Black, so remove a comment about the value
being bogus and replace it with a slightly less negative one.  However,
the number of frames to skip is quite sensitive to details of the timer
and interrupt handling paths, so this is necessarily fragile -- but no
more so than on x86.

Sponsored by:	DARPA, AFRL
2015-04-25 15:43:12 +00:00
Mark Johnston
8241ee3b2c Fix DTrace's panic() action.
It would previously call into some unfinished Solaris compatibility code and
return without actually calling panic(9). The compatibility code is
unneeded, however, so just remove it and have dtrace_panic() call vpanic(9)
directly.

Differential Revision:	https://reviews.freebsd.org/D2349
Reviewed by:	avg
MFC after:	2 weeks
Sponsored by:	EMC / Isilon Storage Division
2015-04-24 03:19:30 +00:00
Xin LI
384f656a1a Remove vfs.zfs.snapshot_list_prefetch, the corresponding code was
gone in r248571 already.

MFC after:	1 week
2015-04-17 21:21:11 +00:00
Mark Johnston
67cf27b70f libdtrace: add support for lazyload mode.
Passing "-x lazyload" to dtrace -G during compilation causes dtrace(1) to
not link drti.o into the output object file, so the USDT probes are not created
during process startup. Instead, dtrace(1) will automatically discover and
create probes on the process' behalf when attaching.

Differential Revision:	https://reviews.freebsd.org/D2203
Reviewed by:		rpaulo
MFC after:		1 month
2015-04-08 02:36:37 +00:00
Alexander Motin
91b9f63738 Add DTrace probe to the new ARC reclaim cause added in r281026.
MFC after:	1 month
2015-04-05 14:45:52 +00:00
Alexander Motin
2e9ccb32a1 Make ZFS ARC track both KVA usage and fragmentation.
Even on Illumos, with its much larger KVA, ZFS ARC steps back if KVA usage
reaches certain threshold (3/4 on i386 or 16/17 otherwise).  FreeBSD has
even less KVA, but had no such limit on archs with direct map as amd64.
As result, on machines with a lot of RAM, during load with very small user-
space memory pressure, such as `zfs send`, it was possible to reach state,
when there is enough both physical RAM and KVA (I've seen up to 25-30%),
but no continuous KVA range to allocate even single 128KB I/O request.

Address this situation from two sides:
 - restore KVA usage limitations in a way the most close to Illumos;
 - introduce new requirement for KVA fragmentation, specifying that we
should have at least one sequential KVA range of zfs_max_recordsize bytes.

Experiments show that first limitation done alone is not sufficient.  On
machine with 64GB of RAM it is sometimes needed to drop up to half of ARC
size to get at leats one 1MB KVA chunk.  Statically limiting ARC to half
of KVA/RAM is too strict, so second limitation makes it to work in cycles:
accumulate trash up to certain critical mass, do massive spring-cleaning,
and then start littering again. :)

MFC after:	1 month
2015-04-03 14:45:48 +00:00
Andrew Turner
7572a8c8f1 Add the arm64 defines for cddl code.
Differential Revision:	https://reviews.freebsd.org/D2186
Reviewed by:	emaste
Sponsored by:	The FreeBSD Foundation
2015-04-01 08:31:56 +00:00
Mark Johnston
09a15aa38d Import a missing piece of commit b8fac8e162eda7e98d from illumos-gate.
This adds an upper bound, dtrace_ustackdepth_max, to the number of frames
traversed when computing the userland stack depth. Some programs - notably
firefox - are otherwise able to trigger an infinite loop in
dtrace_getustack_common(), causing a panic.

MFC after:	1 week
2015-03-30 03:55:51 +00:00
Alexander Motin
e5dcb72f45 Some cosmetic polishing. No functional change.
MFC after:	1 week
2015-03-29 20:28:18 +00:00
Mark Johnston
97f2f66479 Remove unused upstream DTrace provider implementations that are duplicates
of providers under sys/cddl/dev/. Also remove sdt_subr.c, which isn't used
in FreeBSD's SDT implementation.

Suggested by:	rwatson
2015-03-16 01:15:08 +00:00
Robert Watson
9dcce6e267 Now that DTrace stack traces handle exception frames better, skip fewer
stack frames for FBT 'entry' probes on ARM.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2015-03-15 15:19:02 +00:00
Robert Watson
2a1d3dee4f On ARM, unlike some other architectures, saved $pc values from in-kernel
traps do appear in the regular call stack, rather than only in a special
trap frame, so we don't need to inject the trap-frame $pc into a returned
stack trace in DTrace.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2015-03-15 15:17:34 +00:00
Robert Watson
808d83b01a Replace the completely arbitrary '3' with '9' for the number of frames to
skip using the DTrace 'profile' provider on ARM.  This causes stack traces
to skip various driver-and callout-related things as they do on x86, where
the likewise arbitrary values are '6' (32-bit) and '10' (64-bit) for
similar sorts of reasons.

MFC after:	3 days
Sponsored by:	DARPA, AFRL
2015-03-15 14:12:40 +00:00
Steven Hartland
208264283d Allow zvol_geom_worker to process BIO_DELETE's
If zvol_geom_start is called with a BIO_DELETE from a thread which can
sleep it queues it for later processing by the zvol_geom_worker. The
zvol_geom_worker didn't have a delete case so would simply loose the bio
hence preventing the original caller from every completing. In addition
an other unknown types would suffer the same fate.

Allow zvol_geom_worker to process BIO_DELETE's via zvol_strategy and
return unsupported for all unknown bio types.

MFC after:	2 weeks
Sponsored by:	Multiplay
2015-03-14 17:35:04 +00:00
Alexander Motin
0d45c37cb6 Make DIOCGATTR in device mode handle "GEOM::candelete".
MFC after:	3 days
2015-03-12 16:19:18 +00:00
George V. Neville-Neil
a340dc5348 Add support for walltimestamp to DTrace on ARM. 2015-03-07 04:38:25 +00:00
Andrew Turner
be9bc81174 dtrace_cas32 and dtrace_casptr should retrn the data loaded from target
not the new value.

Sponsored by:	ABT Systems Ltd
2015-03-05 18:03:42 +00:00
Andrew Turner
4a8169d97b Add the MD parts of dtrace needed to use fbt on ARM. For this we need to
emulate the instructions used in function entry and exit.

For function entry ARM will use a push instruction to push up to 16
registers to the stack. While we don't expect all 16 to be used we need to
handle any combination the compiler may generate, even if it doesn't make
sense (e.g. pushing the program counter).

On function return we will either have a pop or branch instruction. The
former is similar to the push instruction, but with care to make sure we
update the stack pointer and program counter correctly in the cases they
are either in the list of registers or not. For branch we need to take the
24-bit offset, sign-extend it, and add that number of 4-byte words to the
program counter. Care needs to be taken as, due to historical reasons, the
address the branch is relative to is not the current instruction, but 8
bytes later.

This allows us to use the following probes on ARM boards:
  dtrace -n 'fbt::malloc:entry { stack() }'
and
  dtrace -n 'fbt:🆓return { stack() }'

Differential Revision:	https://reviews.freebsd.org/D2007
Reviewed by:	gnn, rpaulo
Sponsored by:	ABT Systems Ltd
2015-03-05 17:55:31 +00:00
Nathan Whitehorn
4c790d26d1 Fix build after unifying DAR/DEAR storage in trap frame. 2015-03-05 17:02:22 +00:00
Robert Watson
ca162c0888 Don't all DTrace's FBT on ARM to instrument undefinedinstruction(), as
this would lead to DTrace reentrance.

Sponsored by:	DARPA, AFRL
2015-03-05 07:40:41 +00:00
Andrew Turner
2b6af94bc8 Fix the dtrace ARM atomic compare-and-set functions. These functions are
expected to return the data in the memory location pointed at by target
after the operation. The FreeBSD atomic functions previously used return
either 0 or 1 to indicate if the comparison succeeded or not respectively.

With this change these functions only support ARMv6 and later are supported
by these functions.

Sponsored by:	ABT Systems Ltd
2015-03-01 10:04:14 +00:00
Ryan Stone
296a8144aa Allow Illumos code to co-exist with nv(9)
Differential Revision:		https://reviews.freebsd.org/D1881
Reviewed by:			jfv, will
Suggested by:			pjd
MFC after:			1 month
Sponsored by:			Sandvine Inc
2015-03-01 00:22:45 +00:00
Andrew Turner
aeca5b8bc9 Use the ARM unwinder with dtrace to extract the stack when asked. With this
dtrace is able to display a stack trace similar to the one below.

# dtrace -p 603 -n 'tcp:kernel::receive { stack(); }'
  0     70                         :receive
              kernel`ip_input+0x140
              kernel`netisr_dispatch_src+0xb8
              kernel`ether_demux+0x1c4
              kernel`ether_nh_input+0x3a8
              kernel`netisr_dispatch_src+0xb8
              kernel`ether_input+0x60
              kernel`cpsw_intr_rx+0xac
              kernel`intr_event_execute_handlers+0x128
              kernel`ithread_loop+0xb4
              kernel`fork_exit+0x84
              kernel`swi_exit
              kernel`swi_exit

Tested by:	gnn
Sponsored by:	ABT Systems Ltd
2015-02-19 12:20:21 +00:00
George V. Neville-Neil
a57dc98e3a Clean up machine dependent code for DTrace on ARM.
Submitted by:	markj
2015-02-11 17:27:37 +00:00
George V. Neville-Neil
fcb5606706 Initial version of DTrace on ARM32.
Submitted by:	Howard Su based on work by Oleksandr Tymoshenko
Reviewed by:	ian, andrew, rpaulo, markj
2015-02-10 19:41:30 +00:00
Mark Johnston
3277b9a257 Fix a typo in r278137: make sure to free provider state.
X-MFC-With:     r278136
2015-02-08 03:55:12 +00:00
Pedro F. Giffuni
3ccccdc17d MFV r266995:
4767 dtrace_probe() always has the timestamp

Reference:
https://illumos.org/issues/4767

Obtained from:	Illumos
MFC after:	2 weeks
2015-02-03 20:06:30 +00:00
Pedro F. Giffuni
eadcd0fadf MFV r266993:
4469 DTrace helper tracing should be dynamic

Reference:
https://illumos.org/issues/4469

Obtained from:	Illumos
Phabric:	D1551
Reviewed by:	markj
MFC after:	2 weeks
2015-02-03 19:39:53 +00:00
Mark Johnston
c36bd253fa Continue to handle the case where state is NULL, though this currently
cannot happen on FreeBSD. r278136 overlooked the fact that a destructor
registered with devfs_set_cdevpriv(9) is invoked even in the case of an
error.

X-MFC-With:	r278136
2015-02-03 06:04:16 +00:00
Mark Johnston
ac21b651bf Diff reduction with illumos, in preparation for merging r266993 from the
vendor branch. No functional change.

MFC after:	1 week
2015-02-03 05:38:52 +00:00
Steven Hartland
370a13bfff Prevent inlining txg_quiesce
This allows dtrace to monitor the calls to txg_quiesce which can be really
helpful.

Also standardise __noinline order for arc_kmem_reap_now.

Sponsored by:	Multiplay
2015-02-02 00:17:36 +00:00
Mark Johnston
a70a59ea73 Don't attempt to disable enabled fasttrap probes in an exiting process.
There's no need to do so, and we can't hold an exiting process, so this
race can result in panics.

MFC after:	1 week
2015-01-30 05:03:23 +00:00
Mark Johnston
1eb8ad64ea In fasttrap_sigtrap(), use tdsendsignal() rather than tdksignal() to send
SIGTRAP. The latter requires that its thread argument be non-NULL, but
fasttrap_sigtrap() does not.

PR:		193593
MFC after:	1 week
Reported by:	danilo
2015-01-30 04:51:59 +00:00
Xin LI
63cffd61d1 MFV r255258:
Diff reduction with upstream.  The actual change was merged in r272483
already.

MFC after:	2 weeks
2015-01-28 08:56:48 +00:00
Will Andrews
b4e360d239 When creating or updating a node, use vfs_timestamp() for "now" instead
of gethrestime(), to allow the administrator to decide the appropriate
timestamp precision instead of always using nanosecond precision.
2015-01-24 00:43:02 +00:00
Will Andrews
bd3a7c08c4 Remove commented log messages. 2015-01-21 19:30:01 +00:00
Will Andrews
35b540bfb2 Ignore sync requests from the system syncher, i.e. VFS_SYNC(waitfor=MNT_LAZY).
ZFS already commits outstanding data every zfs_txg_timeout seconds, so these
syncs are unnecessarily intrusive.

Submitted by:	gibbs
Sponsored by:	Spectra Logic
MFSpectraBSD:	1105759 on 2014/12/11
2015-01-21 19:25:57 +00:00
Will Andrews
2a2c1d424a Eliminate an #ifdef illumos for zfs_ioc_rename().
Since allow_mounted is a FreeBSD-specific change, default to B_TRUE, then
locally check for the magic bit.  Unconditionally check allow_mounted below.
Convert the setting of allow_mounted to an explicit boolean.

MFC after:	1 week
Sponsored by:	Spectra Logic
MFSpectraBSD:	672578 (in part) on 2013/07/19
2015-01-21 19:20:36 +00:00
Will Andrews
55ddf051d8 Add vfs.zfs.reference_tracking_enable sysctl/tunable.
This is primarily for developer/debugging use; it enables built-in tagged
tracking of refcounts inside ZFS.  It can only be enabled from the loader,
since it modifies how in-core state is managed.  Default remains disabled.

MFC after:	1 week
Sponsored by:	Spectra Logic
2015-01-21 17:03:11 +00:00
Will Andrews
798cbb7523 Fix arc__shrink DTrace probe's to_free argument.
Remove the unnecessary #ifdef _KERNEL, which did not differ in the true or
false cases.  Actually set the value of to_free before using it.

MFC after:	1 week
Sponsored by:	Spectra Logic
2015-01-20 22:39:10 +00:00
Will Andrews
fe20fb9fb0 Use the "zfs_gfs" tag for GFS vnodes to make them easier to identify.
MFC after:	1 week
Sponsored by:	Spectra Logic
2015-01-20 22:31:26 +00:00
Will Andrews
34abed55f3 NSEC_TO_TICK(usec) -> NSEC_TO_TICK(nsec) 2015-01-20 22:29:27 +00:00
Will Andrews
c9c5e04711 Remove unused strdup() #define. 2015-01-20 22:27:45 +00:00
Alexander Motin
d6245e3d44 Allow skipping dmu_buf_will_dirty() call in dsl_dir_transfer_space().
dsl_dir_transfer_space() is mostly called after dsl_dir_diduse_space(),
which already calls dmu_buf_will_dirty() for the same dbuf and tx, so
its duplicate call in those cases will change nothing, only spend time.

Skipping this call by four times reduces time spent in dbuf_write_done()
and descendants, updating dataset statistics with several congested lock
acquisitions.  When rewriting 8K zvol blocks at 1GB/s rate, this reduces
CPU time spent inside dbuf_write_done(), according to profiling, from 45%
of 683K samples to 18% of 422K.

MFC after:	2 weeks
2015-01-20 13:09:12 +00:00
Steven Hartland
5eab7e5406 Clean ZFS spa config before syncing
A number of entries that can be present in the spa config shouldn't be saved
to disk so add a method to ensure this is case. Without this if the last
caller to vdev_config_generate requested stats then we can end up in the
cache file.

Also only skip a none writable pool in the cache file generation if its
active. This prevents unavailable pools incorrectly getting removed from
cache file.

Tested by:	delphij
MFC after:	2 weeks
Sponsored by:	Multiplay
2015-01-18 23:15:49 +00:00
Steven Hartland
bc96366c86 Mechanically convert cddl sun #ifdef's to illumos
Since the upstream for cddl code is now illumos not sun, mechanically
convert all sun #ifdef's to illumos #ifdef's which have been used in all
newer code for some time.

Also do a manual pass to correct the use if #ifdef comments as per style(9)
as well as few uses of #if defined(__FreeBSD__) vs #ifndef illumos.

MFC after:	1 month
Sponsored by:	Multiplay
2015-01-17 14:44:59 +00:00
Alexander Motin
38feff972b Fix overflow bug from r248577, turning 30s TRIM timeout into ~4s.
MFC after:	2 weeks
2015-01-14 16:22:00 +00:00
Alexander Motin
d4f46a775d Reimplement TRIM throttling added in r248577.
Previous throttling implementation approached problem from the wrong side.
It significantly limited useful delaying of TRIM requests and aggregation
potential, while not so much controlled TRIM burstiness under heavy load.

With this change random 4K write benchmarks (probably the worst case for
TRIM) show me IOPS increase by 20%, average latency reduction by 30%, peak
TRIM bursts reduction by 3 times and same peak TRIM map size (memory usage).

Also the new logic does not force map size down so heavily, really allowing
to keep deleted data for 32 TXG or 30 seconds under moderate load.  It was
practically impossible with old throttling logic, which pushed map down to
only 64 segments.

Reviewed by:	smh
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2015-01-14 09:39:57 +00:00
Alexander Motin
5b3a65823d Skip extra bcopy() when scrubbing vdev without redundancy.
According to profiler, this bcopy() can use about 10% of CPU time.

MFC after:	2 weeks
2015-01-12 22:38:55 +00:00
Alexander Motin
f5b85f6551 When aggregating TRIM segments, move the new one to the list end.
New segment at the list head may block all TRIM requests until txg of that
segment can be processed.  On my random I/O tests this change reduce peak
TRIM list length from 650 to 450 segments.  Hopefully it should reduce TRIM
burstiness when list processing is unblocked.

MFC after:	2 weeks
2015-01-11 16:36:39 +00:00
Alexander Motin
2de874ed23 Add LBA as secondary sort key for synchronous I/O requests.
On FreeBSD gethrtime() implemented via getnanouptime(), that has 1ms (1/hz)
precision.  It makes primary sort key (timestamp) collision very possible.
In such situations sorting by secondary key of LBA is much more reasonable
then by totally meaningless zio pointer value.

With this change on multi-threaded synchronous ZVOL read I've measured 10%
throughput increase and average latency reduction.

MFC after:	2 weeks
2015-01-11 00:26:18 +00:00
Alexander Motin
13ea8106d9 Use new optimized dmu_read_uio_dbuf() for ZVOLs in device mode.
This slightly reduces overhead by avoiding dnode_hold()/dnode_rele() calls.

MFC after:	2 weeks
2015-01-10 18:28:58 +00:00
Steven Hartland
8de799ea3a Correct zpool list displaying invalid EXPANDSZ for unavailable pool vdevs
When pools are unavailable their vdevs are also unavailable which means
that vdev_max_asize remains at the default zero. This default was being
used to calculate vs_esize resulting in a negative number as vdev_asize >
vdev_max_asize, which caused zpool list -v to display 16.0E for EXPANDSZ
of these vdevs.
2014-12-31 04:54:48 +00:00
Mark Johnston
cafe874475 Restore the trap type argument to the DTrace trap hook, removed in r268600.
It's redundant at the moment since it can be obtained from the trapframe
on the architectures where DTrace is supported, but this won't be the case
with ARM.
2014-12-23 15:38:19 +00:00
Steven Hartland
51f529b50b Always sync the global ZFS config cache to reflect the new mosconfig
This fixes out of date zpool.cache for root pools, which can cause issues
such as confusion of zdb etc.

MFC after:	1 month
2014-12-23 09:31:24 +00:00
Steven Hartland
6831cf6aa6 Fix panic when resizing ZFS zvol's
Resizing a ZFS ZVOL with debug enabled would result in a panic due to
recursion on dp_config_rwlock.

The upstream change "3464 zfs synctask code needs restructuring" changed
zvol_set_volsize to avoid the recursion on dp_config_rwlock, but this was
missed when originally merged in by r248571 due to significant differences
in our codebases in this area.

These changes also relied on bring in changes from upstream:
3557 dumpvp_size is not updated correctly when a dump zvol's size is
changed, which where also not present.

In order to help prevent future issues in this area a direct comparison
and diff minimisation from current upstream version (b515258) of zvol.c.

Differential Revision:	https://reviews.freebsd.org/D1302
MFC after:	1 month
X-MFC-With:	r276063 & r276066
Sponsored by:	Multiplay
2014-12-22 18:39:38 +00:00
Steven Hartland
08c0f91ecc Refactor zvol locking to minimise diff with upstream
Use #define zfsdev_state_lock spa_namespace_lock instead of replacing all
zfsdev_state_lock with spa_namespace_lock to minimise changes from upstream.

Differential Revision:	D1302
MFC after:	1 month
X-MFC-With	r276063
Sponsored by:	Multiplay
2014-12-22 17:04:51 +00:00
Steven Hartland
39c478eac4 Standardise on illumos for #ifdef's in zvol.c
Also correct as per style(9) on the use of #ifdef comments.

This is a no-op change as pre-cursor to a full cleanup and merge with
upstream zvol changes.

Sponsored by:	Multiplay
2014-12-22 16:38:29 +00:00
Konstantin Belousov
789bdfdbc6 Handle MAKEENTRY cnp flag in the VOP_CREATE(). Curiously, some
fs, e.g. smbfs, already did it.

Tested by:	pho (previous version)
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-21 13:29:33 +00:00
Xin LI
62722f7d5b Add missing continue: we can't proceed further if the
kernel does not panic with zfs_panic_recover.

Illumos issue:
    5438 zfs_blkptr_verify should continue after zfs_panic_recover

Reported by:	Coverity
CID:		1232014
2014-12-19 00:20:29 +00:00
Xin LI
7d073f8411 MFV r275914:
As of r270383, the dbuf_compare comparator compares the dbuf
attributes in the following order:

	db_level (indirect level)
	db_blkid (block number)
	db_state (current state)
	the address of the element

Because db_state is being considered before the element's state,
changing of db_state would affect balancedness of the AVL tree,
even when the address of element compares differently.  For
instance, in dbuf_create, db_state may be altered after the
node is inserted into the AVL tree and may break AVL tree
balancedness.

Instead of using db_state as a comparision critera (introduced
in r270383), consider it only when we are doing a lookup, that
is one of the two dbuf pointers contains DB_SEARCH.

Illumos issue:
    5422 preserve AVL invariants in dn_dbufs

MFC after:	2 weeks
2014-12-18 23:45:26 +00:00
Konstantin Belousov
6c21f6edb8 The VOP_LOOKUP() implementations for CREATE op do not put the name
into namecache, to avoid cache trashing when doing large operations.
E.g., tar archive extraction is not usually followed by access to many
of the files created.

Right now, each VOP_LOOKUP() implementation explicitely knowns about
this quirk and tests for both MAKEENTRY flag presence and op != CREATE
to make the call to cache_enter().  Centralize the handling of the
quirk into VFS, by deciding to cache only by MAKEENTRY flag in VOP.
VFS now sets NOCACHE flag for CREATE namei() calls.

Note that the change in semantic is backward-compatible and could be
merged to the stable branch, and is compatible with non-changed
third-party filesystems which correctly handle MAKEENTRY.

Suggested by:	Chris Torek <torek@pi-coral.com>
Reviewed by:	mckusick
Tested by:	pho
Sponsored by:	The FreeBSD Foundation
MFC after:	2 weeks
2014-12-18 10:01:12 +00:00
Xin LI
a771fba68c MFV r275783:
Convert ARC flags to use enum.  Previously, public flags are defined in
arc.h and private flags are defined in arc.c which can lead to confusion
and programming errors.

Consistently use 'hdr' (when referencing arc_buf_hdr_t) instead of 'buf'
or 'ab' because arc_buf_t are often named 'buf' as well.

Illumos issue:
    5369 arc flags should be an enum
    5370 consistent arc_buf_hdr_t naming scheme

MFC after:	2 weeks
2014-12-15 18:22:45 +00:00
Xin LI
7728b3d28c MFV r275551:
Remove "dbuf phys" db->db_data pointer aliases.

Use function accessors that cast db->db_data to the appropriate
"phys" type, removing the need for clients of the dmu buf user
API to keep properly typed pointer aliases to db->db_data in order
to conveniently access their data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c:
        In zap_leaf() and zap_leaf_byteswap, now that the pointer alias
        field l_phys has been removed, use the db_data field in an on
        stack dmu_buf_t to point to the leaf's phys data.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
        Remove the db_user_data_ptr_ptr field from dbuf and all logic
        to maintain it.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dbuf.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dnode.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dbuf.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dmu.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:
        Modify the DMU buf user API to remove the ability to specify
        a db_data aliasing pointer (db_user_data_ptr_ptr).

cddl/contrib/opensolaris/cmd/zdb/zdb.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_diff.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_objset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_send.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_traverse.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu_tx.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_bookmark.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dataset.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deadlist.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_deleg.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_destroy.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_dir.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_pool.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_scan.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_synctask.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_userhold.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_history.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_leaf.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zap_micro.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_ioctl.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h:
        Create and use the new "phys data" accessor functions
        dsl_dir_phys(), dsl_dataset_phys(), zap_m_phys(),
        zap_f_phys(), and zap_leaf_phys().

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dataset.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/dsl_dir.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zap_leaf.h:
        Remove now unused "phys pointer" aliases to db->db_data
        from clients of the DMU buf user API.

Illumos issue:
    5314 Remove "dbuf phys" db->db_data pointer aliases in ZFS

MFC after:	2 weeks
2014-12-15 07:52:23 +00:00
Xin LI
a676454a98 MFV r275550:
In addition to r273158, make the code in spa_sync() that checks if the
current TXG is a no-op TXG less fragile.

Illumos issue:
    5347 idle pool may run itself out of space

MFC after:	2 weeks
2014-12-15 05:10:55 +00:00
Xin LI
51e527396b MFV r275549:
Add a loader tunable, vfs.zfs.arc_meta_min, which controls how much metadata
ZFS should keep in ARC at minimum.

In arc_evict(), when doing recycle, take more factors into account by
applying the following policy:

  1. If no evictable data, evict metadata;
  2. If no evictable metadata, evict data;
  3. If we hit arc_meta_limit, evict metadata;
  4. If we haven't hit arc_meta_min, evict data;
  5* (Illumos only, not present in new FreeBSD code, yet) evict the oldest
     cached element from data and metadata.
     (FreeBSD) evict the data type specified by caller, which is the
     existing behavior.

Note that because of our splitted locks (implemented in r205231 to improve
scalability by reducing lock contention), implementing the fifth Illumos
behavior will not be cheap, so for now just implement the 1-4 and fall back
to current behavior for 5.

Illumos issue:
    5368 ARC should cache more metadata

MFC after:	2 months (assuming we didn't found better solution)
2014-12-15 04:51:36 +00:00
Xin LI
5571af1d8f MFV r247174:
Expose arc_meta_limit, et al via kstats.

Note that as a result, vfs.zfs.arc_meta_used is removed.
The existing vfs.zfs.arc_meta_limit sysctl/tunable is retained
with a SYSCTL_PROC wrapper.

Illumos ZFS issues:
    3561 arc_meta_limit should be exposed via kstats

Relnotes:	yes
MFC after:	2 weeks
2014-12-13 19:17:28 +00:00
Xin LI
d25a5fd878 MFV r275548:
Verify that the block pointer is structurally valid, before attempting to
read it in.  It can only be invalid in the case of a ZFS bug, but this
change will help identify such bugs in a more transparent way, by
panic'ing with a relevant message, rather than indexing off the end of an
array or something.

Illumos issue:
    5349 verify that block pointer is plausible before reading

MFC after:	2 weeks
2014-12-13 02:08:18 +00:00
Xin LI
ca7354fa07 MFV r275546:
Reduce scrub activities when system there is enough dirty data, namely when
dirty data is more than zfs_vdev_async_write_active_min_dirty_percent (once
we start to increase the number of concurrent async writes).

While there also correct rounding error which would make scrub end up
pausing for (zfs_txg_timeout + 1) seconds instead of the desired
zfs_txg_timeout seconds.

Illumos issue:
    5351 scrub goes for an extra second each txg
    5352 scrub should pause when there is some dirty data

MFC after:	2 weeks
2014-12-13 01:39:24 +00:00
Xin LI
8b2dbf8af3 MFV r275545:
If zio_checksum_error() returns other than ECKSUM (e.g. EINVAL), it does not
fill in the "zio_bad_cksum_t *info" parameter. Caller should not attempt to
use it in this case.

Illumos issue:
    5348 zio_checksum_error() only fills in info if ECKSUM

MFC after:	2 weeks
2014-12-13 01:26:06 +00:00
Xin LI
08fda6cd66 MFV r275544:
Clean up some duplicated code in dnode_sync() around freeing spill blocks.

Illumos issue:
    5350 clean up code in dnode_sync()

MFC after:	2 weeks
2014-12-13 01:18:23 +00:00
Xin LI
48438ce4c4 MFV r275543:
Remove always true tests for ds->ds_phys' presence.

Clean up assertions in dsl_dataset_disown.

Remove unreachable code in dsl_dataset_disown().

Illumos issue:
    5310 Remove always true tests for non-NULL ds->ds_phys

MFC after:	2 weeks
2014-12-13 01:14:59 +00:00
Xin LI
c90d7474a8 MFV r275542:
If a dnode has a spill block and there is an error while accessing
a data block then traverse_dnode() loses information about that error
and returns a status of visiting the spill block.

This issue is discovered by Spectra Logic.

Illumos issue:
    5311 traverse_dnode may report success when it should not

Original author:	gibbs
MFC after:		2 weeks
2014-12-13 01:10:17 +00:00
Xin LI
c2161091ad MFV r275540:
When importing a pool, don't assume that the passed pool configuration
at vdev_load is always vaild.  It's possible that a stale configuration
that comes with extra vdevs, where metaslab_init() would fail because
of lower layer returns error.

Change the code to make metaslab_init() handle and return errors from
lower layer and pass it back to upper layer and handle it there.

Illumos issue:
    5213 panic in metaslab_init due to space_map_open returning ENXIO

MFC after:	2 weeks
2014-12-08 06:04:42 +00:00
Andriy Gapon
036a8c5dac remove opensolaris cyclic code, replace with high-precision callouts
In the old days callout(9) had 1 tick precision and that was inadequate
for some uses, e.g. DTrace profile module, so we had to emulate cyclic
API and behavior.  Now we can directly use callout(9) in the very few
places where cyclic was used.

Differential Revision:	https://reviews.freebsd.org/D1161
Reviewed by:	gnn, jhb, markj
MFC after:	2 weeks
2014-12-07 11:21:41 +00:00
Andrew Turner
f9307cced7 Apply the same fix in r274697 to the ARM case. 2014-12-06 12:03:09 +00:00
Xin LI
54f76dcb4a MFV r275535:
Unexpand ISP2() and MSEC2NSEC().

Illumos issue:
    5255 uts shouldn't open-code ISP2

MFC after:	2 weeks
2014-12-06 09:38:28 +00:00
Xin LI
6054c38913 MFV r275534:
Sync with Illumos.  This have no effect to FreeBSD.

Illumos issue:
    5285 pass in cpu_pause_func via pause_cpus

MFC after:	2 weeks
2014-12-06 09:14:46 +00:00
Xin LI
d4548c2e8e MFC r275533:
Sync with Illumos.  This have no effect to FreeBSD.

Illumos issue:
    5100 sparc build failed after 5004

MFC after:	2 weeks
2014-12-06 09:11:13 +00:00
Xin LI
4603a0aeb2 Use %d instead of %u for error number. This way we see ERESTART as -1
not 4294967295 when doing DTrace.

MFC after:	2 weeks
2014-12-05 22:56:10 +00:00
Xin LI
26f96d922b Fix a regression introduced in r274337 (large block support)
In dsl_dataset_hold_obj() we used zap_contains(.., DS_FIELD_LARGE_BLOCKS)
to determine whether the extensible (zapifyed) dataset have large blocks.
The code expects the result be either 0 (found) or ENOENT (not found),
however reused the variable 'err' which later code expects to be 0.

Fix this by adopting similar code construct that is used later for
DS_FIELD_BOOKMARK_NAMES, which uses a temporary variable zaperr to catch
errors from zap_* rountines.

Reported by:	Peter J. Creath (on FreeNAS; FreeNAS bug #6848)
Illumos issue:	5393 spurious failures from dsl_dataset_hold_obj()
Reviewed by:	mahrens
Sponsored by:	iXsystems, Inc.
X-MFC with:	r274337
2014-12-05 18:29:01 +00:00
Alexander Motin
ef8daf3fed Add GET LBA STATUS command support to CTL.
It is implemented for LUNs backed by ZVOLs in "dev" mode and files.
GEOM has no such API, so for LUNs backed by raw devices all LBAs will
be reported as mapped/unknown.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-12-04 11:34:19 +00:00
Andriy Gapon
782c06dfc8 zfs_putpages: actually update mtime and ctime
Reported by:	Paul Koch <paul.koch@akips.com>
Tested by:	Paul Koch <paul.koch@akips.com>
MFC after:	2 weeks
2014-12-02 11:44:56 +00:00
Xin LI
d10e52d627 Revert r273060 per discussion with avg@ as we need to make L2ARC
aware of 4K devices and this one is not the right fix anyway.
2014-11-26 02:20:25 +00:00
Dimitry Andric
c577699e4b Fix the following -Werror warning from clang 3.5.0, while building cddl/lib/libctf:
In file included from cddl/contrib/opensolaris/common/ctf/ctf_create.c:31:
In file included from sys/cddl/contrib/opensolaris/uts/common/sys/sysmacros.h:34:
sys/cddl/contrib/opensolaris/uts/common/sys/isa_defs.h:334:9: warning: '_ILP32' macro redefined [-Wmacro-redefined]
#define _ILP32
        ^
<built-in>:26:9: note: previous definition is here
#define _ILP32 1
        ^
1 warning generated.

This is because clang 3.5.0 started predefining _ILP32 and __ILP32__ for
the i386 arch.  (Earlier versions already predefined _LP64 and __LP64__
for the x86_64 arch.)

Reviewed by:	emaste, avg, smh, delphij, markj
Differential Revision: https://reviews.freebsd.org/D1187
2014-11-19 07:44:21 +00:00
Xin LI
8c3d6a4ab2 Make vfs.zfs.max_recordsize read-write at runtime.
MFC after:	2 weeks
2014-11-18 22:35:19 +00:00
Xin LI
8efcd876ca Add a tunable for spa_slop_shift which controls how much space we
would reserve by default.  Tuning is not recommended.

MFC after:	2 weeks
2014-11-18 18:52:38 +00:00
Xin LI
18144ab1a3 Allow tuning zfs_max_recordsize via loader tunable. Tuning is NOT
recommended.

Requested by:	Slawa Olhovchenkov <slw zxy spb ru>
MFC after:	2 weeks
2014-11-18 18:40:01 +00:00
Andriy Gapon
2c51c83bc8 l2arc: restore correct rounding up of asize of compressed data
This rounding up was lost in a mismerge of illumos code.
See r268075 MFV r267565.
After that commit zio_compress_data() no longer performs any compressed
size adjustment, so it needs to be done externally.  On FreeBSD we round
up the size using vdev_ashift rather than SPA_MINBLOCKSIZE so that 4KB
devices are properly supported.

Additionally, zero out the buffer tail only if compression succeeds.
The compression is considered successful if the size of compressed
data after rounding up to account for the vdev ashift is less than the
original data size.  It does not make sense to have the data compressed
if all the savings are lost to rounding up.
With the new zio_compress_data() it could have been possible that the
rounded compressed size would be greater than the original size and thus
we could zero beyond the allocated buffer if the zeroing code was kept
at the original place.

Discussed with:	delphij, gibbs
MFC after:	2 weeks
X-MFC with:	r274627
2014-11-17 14:45:42 +00:00
Andriy Gapon
0908b20b7e Revert r269093 which introduced physical zio alignment transform
Size of physical ZIOs must never be implicitly adjusted, it's
a responsibility of a caller to make sure that such a ZIO has proper offset
and size.

Discussed with:	delphij, gibbs
MFC after:	2 weeks
2014-11-17 14:16:02 +00:00
Steven Hartland
a559adfbce Disable TRIM on file backed ZFS vdevs and fix TRIM on init
After r265152 TRIM requests are ZIO_TYPE_FREE instead of ZIO_TYPE_IOCTL
this meant file backed vdevs to attempted to process the ZIO as a write
causing a panic.

We now disable TRIM on file backed vdevs and ASSERT the ZIO types supported
by each vdev type to ensure we explicity support the ZIO type being
processed.

Also ensure that TRIM on init is not procesed for devices which declare they
didn't support TRIM via vdev_notrim.

PR:		195061, 194976, 191573
Sponsored by:	Multiplay
2014-11-17 11:32:10 +00:00
Konstantin Belousov
6e646651d3 Remove the no-at variants of the kern_xx() syscall helpers. E.g., we
have both kern_open() and kern_openat(); change the callers to use
kern_openat().

This removes one (sometimes two) levels of indirection and
consolidates arguments checks.

Reviewed by:	mckusick
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-11-13 18:01:51 +00:00
Xin LI
8bcd603968 MFV r274273:
ZFS large block support.

Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool).  This *may* remain unchanged because of memory constraint.

Limited safety belt is provided for mounted root filesystem but use
caution is advised.

Illumos issue:
    5027 zfs large block support

MFC after:	1 month
2014-11-10 08:20:21 +00:00
Xin LI
42350b6bde MFV r274272 and diff reduction with upstream.
Illumos issue:
    5244 zio pipeline callers should explicitly invoke next stage

Tested with:	ztest plus ZFS over GELI configuration
MFC after:	1 month
2014-11-09 07:37:00 +00:00
Xin LI
81f1255e58 MFV r274271:
Improve zdb -b performance:

 - Reduce gethrtime() call to 1/100th of blkptr's;
 - Skip manipulating the size-ordered tree;
 - Issue more (10, previously 3) async reads;
 - Use lighter weight testing in traverse_visitbp();

Illumos issue:
    5243 zdb -b could be much faster

MFC after:	2 weeks
2014-11-08 07:30:40 +00:00
Andriy Gapon
2fd3cc0cb2 fix l2arc compression buffers leak
We have observed that arc_release() can be called concurrently with a
l2arc in-flight write.
Also, we have observed that arc_hdr_destroy() can be called from
arc_write_done() for a zio with ZIO_FLAG_IO_REWRITE flag in similar
circumstances.

Previously the l2arc headers would be freed while leaking their
associated compression buffers.  Now the buffers are placed on
l2arc_free_on_write list for delayed freeing.  This is similar to what
was already done to arc buffers that were supposed to be freed
concurrently with in-flight writes of those buffers.

In addition to fixing the discovered leaks this change also adds some
protective code to assert that a compression buffer associated with a
l2arc header is never leaked.

A new kstat l2_cdata_free_on_write is added.  It keeps a count of
delayed compression buffer frees which previously would have been leaks.

Tested by:	Vitalij Satanivskij <satan@ukr.net> et al
Requested by:	many
MFC after:	2 weeks
Sponsored by:	HybridCluster / ClusterHQ
2014-11-06 11:08:02 +00:00
Alexander Motin
c3e7ba3e6d Add to CTL support for logical block provisioning threshold notifications.
For ZVOL-backed LUNs this allows to inform initiators if storage's used or
available spaces get above/below the configured thresholds.

MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-11-06 00:48:36 +00:00
Josh Paetzel
14127f5b21 This change addresses 4 bugs in ZFS exposed by Richard Kojedzinszky's
crash.sh script attached to FreeNAS bug 4109:
https://bugs.freenas.org/issues/4109

Three are in the snapshot layer:
a) AVG explains in his notes: https://wiki.freebsd.org/AvgVfsSolarisVsFreeBSD

"VOP_INACTIVE must not do any destructive actions to a vnode
and its filesystem node, nor invalidate them in any way."
gfs_vop_inactive and zfsctl_snapshot_inactive did just that. In
OpenSolaris VOP_INACTIVE is much closer to FreeBSD's VOP_RECLAIM.
Rename & move them to gfs_vop_reclaim and zfsctl_snapshot_reclaim
and merge in the requisite vnode_destroy from zfsctl_common_reclaim.

b) gfs_lookup_dot and various zfsctl functions do not honor the
FreeBSD VFS convention of only locking from the root downward. When
looking up ".." the convention is to drop the current leaf vnode lock before
acquiring the directory vnode and then subsequently re-acquiring the lock on the
leaf vnode. This fixes that in all the places that our exercised by crash.sh.

c) The snapshot may already be unmounted when the directory vnode is reclaimed.
Check for this case and return.

One in the common layer:
d) Callers of traverse expect the reference to the vnode passed in to be
maintained. Don't release it.

This last one may be an unclear contract. There may in fact be some callers that
do expect the reference to be dropped on success in addition to callers that
expect it to be released. In this case a further audit of the callers is needed
and a consensus on the correct behavior.

PR:	184677
Submitted by:	kmacy
Reviewed by:	delphij, will, avg
MFC after:	2 weeks
Sponsored by:	iXsystems
2014-10-25 17:42:44 +00:00
Justin Hibbits
3ff2096995 Whitespace
X-MFC-with:	r273570
MFC after:	1 week
2014-10-24 03:34:21 +00:00
Justin Hibbits
24d5dfb116 Three updates to PowerPC FBT:
* Use a constant to define the number of stack frames in a probe exception.
* Only allow function symbols in powerpc64 ('.' prefixed)
* Set the fbtp_roffset for return probes, so the correct dtrace_probe call is
  made.

MFC after:	1 week
2014-10-24 03:33:01 +00:00
Hans Petter Selasky
f0188618f2 Fix multiple incorrect SYSCTL arguments in the kernel:
- Wrong integer type was specified.

- Wrong or missing "access" specifier. The "access" specifier
sometimes included the SYSCTL type, which it should not, except for
procedural SYSCTL nodes.

- Logical OR where binary OR was expected.

- Properly assert the "access" argument passed to all SYSCTL macros,
using the CTASSERT macro. This applies to both static- and dynamically
created SYSCTLs.

- Properly assert the the data type for both static and dynamic
SYSCTLs. In the case of static SYSCTLs we only assert that the data
pointed to by the SYSCTL data pointer has the correct size, hence
there is no easy way to assert types in the C language outside a
C-function.

- Rewrote some code which doesn't pass a constant "access" specifier
when creating dynamic SYSCTL nodes, which is now a requirement.

- Updated "EXAMPLES" section in SYSCTL manual page.

MFC after:	3 days
Sponsored by:	Mellanox Technologies
2014-10-21 07:31:21 +00:00
Xin LI
78701de4b7 Add tunable vfs.zfs.space_map_blksz for space map's maximum block size.
MFC after:	2 weeks
2014-10-18 22:11:10 +00:00
Davide Italiano
2be111bf7d Follow up to r225617. In order to maximize the re-usability of kernel code
in userland rename in-kernel getenv()/setenv() to kern_setenv()/kern_getenv().
This fixes a namespace collision with libc symbols.

Submitted by:   kmacy
Tested by:      make universe
2014-10-16 18:04:43 +00:00
Steven Hartland
ca6505b818 Prevent ZFS leaking pool free space
When processing async destroys ZFS would leak space every txg timeout
(5 seconds by default), if no writes occurred, until the pool is totally
full. At this point it would be unfixable without a pool recreation.

In addition if the machine was rebooted with the pool in this situation
would fail to import on boot, hanging indefinitely, as the import process
requires the ability to write data to the pool. Any attempts to query the
pool status during the hung import would not return as the import holds
the pool lock.

The only way to import such a pool would be to specify -o readonly=on
to the zpool import.

zdb -bb <pool> can be used to check for "deferred free" size which is where
this lost space will be counted.

MFC after:	3 days
Sponsored by:	Multiplay
2014-10-16 02:23:27 +00:00
Xin LI
ba6e85e0cf Use write_psize instead of write_asize when doing vdev_space_update.
Without this change the accounting of L2ARC usage would be wrong and
give 16EB free space because the number became negative and overflows.

Obtained from:	FreeNAS (issue #6239)
MFC after:	2 weeks
2014-10-13 20:39:51 +00:00
Xin LI
a4f5b8db9f Add a tunable for arc_shrink_shift (vfs.zfs.arc_shrink_shift) that
controls how much fraction, 1/2^arc_shrink_shift, should be reclaimed
when there is memory pressure.

Submitted by:	Richard Kojedzinszky <krichy at tvnetwork.hu>
MFC after:	2 weeks
2014-10-13 05:34:10 +00:00
Xin LI
eba15cf463 MFV r272804:
Refactor the code and stop restore_object from creating two transactions.

Illumos issue:
    3693 restore_object uses at least two transactions to restore an object

MFC after:	2 weeks
2014-10-09 07:52:51 +00:00
Xin LI
ce44f14b41 MFV r272803:
Illumos issue:
    5175 implement dmu_read_uio_dbuf() to improve cached read performance

MFC after:	2 weeks
2014-10-09 07:18:40 +00:00
Andriy Gapon
c3d1d2e104 l2arc_write_buffers: reduce headroom value
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.

headroom determines how much data is scanned on a single list
during each run of the l2arc feed thread.
Because FreeBSD has more lists we proportionally decrease the limit.

Reviewed by:	Brendan Gregg (earlier version)
MFC after:	2 weeks
Sponsored by:	HybridCluster
2014-10-07 16:08:21 +00:00
Andriy Gapon
9f96723ec5 revert r272702: wrong (earlier) change was committed 2014-10-07 16:06:10 +00:00
Andriy Gapon
4c3b02bfce reduce L2ARC_WRITE_SIZE on FreeBSD
FreeBSD has ARC_BUFC_NUMMETADATALISTS metadata lists and ARC_BUFC_NUMDATALISTS
data lists (currently both are 16) while illumos has just a single list
of each kind.

L2ARC_WRITE_SIZE determines the default value of l2arc_write_max which
defines limits on how much data is scanned and written to a cache device
during each run of the l2arc feed thread.  The limits are applied on the
per buffer list basis.
Because FreeBSD has more lists we proportionally reduce the limits.

Reviewed by:	Brendan Gregg (earlier version)
MFC after:	2 weeks
Sponsored by:	HybridCluster
2014-10-07 14:30:24 +00:00
Andriy Gapon
ab26525af2 make userland __assfail from opensolaris compat honor 'aok' variable
This should allow zdb -A option to actually make difference.

MFC after:	2 weeks
2014-10-07 14:15:50 +00:00
Xin LI
1b5bcb8425 MFV r272591:
Use loaned ARC buffer for zfs receive to avoid copy.

Illumos issue:
    5162 zfs recv should use loaned arc buffer to avoid copy

MFC after:	2 weeks
2014-10-06 07:29:17 +00:00
Xin LI
8fb26f5aef MFV r272585:
Split the godfather zio into CPU number's to reduce lock
contention.

Illumos issue:
    5176 lock contention on godfather zio

MFC after:	2 weeks
2014-10-06 07:03:17 +00:00
Xin LI
dcb20006f0 MFV r272501:
Illumos issue:
    5177 remove dead code from dsl_scan.c

MFC after:	2 weeks
2014-10-06 05:46:51 +00:00
Xin LI
00769ce74d MFV r272500:
Don't inherit flags other than DS_FLAG_CI_DATASET and DS_FLAG_INCONSISTENT
when cloning.  This prevents DS_FLAG_DEFER_DESTROY being inherited from a
clone that is marked for deferred destroy, which causes snapshots of the
clone being destroyed when getting a hold or clone.

Illumos issue:
    5150 zfs clone of a defer_destroy snapshot causes strangeness

MFC after:	1 week
2014-10-06 05:42:20 +00:00
Xin LI
4bb264ae15 Don't make nested definition for range_seg_cache.
Reported by:	ian
MFC after:	1 week
X-MFC-With:	r272506
2014-10-04 15:42:52 +00:00
Xin LI
4750c382a9 MFV r272499:
Illumos issue:
    5174 add sdt probe for blocked read in dbuf_read()

MFC after:	2 weeks
2014-10-04 08:55:08 +00:00
Xin LI
eb0b70068c Add a new sysctl, vfs.zfs.vol.unmap_enabled, which allows the system
administrator to toggle whether ZFS should ignore UNMAP requests.

Illumos issue:
    5149 zvols need a way to ignore DKIOCFREE

MFC after:	2 weeks
2014-10-04 08:51:57 +00:00
Xin LI
2d36d67c72 Diff reduction with upstream. The code change is not really applicable
to FreeBSD.

Illumos issue:
    5148 zvol's DKIOCFREE holds zfsdev_state_lock too long

MFC after:	1 month
2014-10-04 08:41:23 +00:00
Xin LI
523b4c7fdf MFV r272496:
Add tunable for number of metaslabs per vdev
(vfs.zfs.vdev.metaslabs_per_vdev).  The default remains
at 200.

Illumos issue:
    5161 add tunable for number of metaslabs per vdev

MFC after:	2 weeks
2014-10-04 08:29:48 +00:00
Xin LI
a8d7512709 MFV r272495:
In arc_kmem_reap_now(), reap range_seg_cache too to reclaim memory in
response of memory pressure.

Illumos issue:
    5163 arc should reap range_seg_cache

MFC after:	1 week
2014-10-04 08:14:10 +00:00
Xin LI
8c20e2ff11 MFV r272494:
Make space_map_truncate() always do space_map_reallocate().  Without
this, setting space_map_max_blksz would cause panic for existing pool,
as dmu_objset_set_blocksize would fail if the object have multiple blocks.

Illumos issues:
   5164 space_map_max_blksz causes panic, does not work
   5165 zdb fails assertion when run on pool with recently-enabled
	spacemap_histogram feature

MFC after:	2 weeks
2014-10-04 08:05:39 +00:00
Steven Hartland
14a0d74ea8 Refactor ZFS ARC reclaim checks and limits
Remove previously added kmem methods in favour of defines which
allow diff minimisation between upstream code base.

Rebalance ARC free target to be vm_pageout_wakeup_thresh by default
which eliminates issue where ARC gets minimised instead of balancing
with VM pageout. The restores the target point prior to r270759.

Bring in missing upstream only changes which move unused code to
further eliminate code differences.

Add additional DTRACE probe to aid monitoring of ARC behaviour.

Enable upstream i386 code paths on platforms which don't define
UMA_MD_SMALL_ALLOC.

Fix mixture of byte an page values in arc_memory_throttle i386 code
path value assignment of available_memory.

PR:		187594
Review:		D702
Reviewed by:	avg
MFC after:	1 week
X-MFC-With:	r270759 & r270861
Sponsored by:	Multiplay
2014-10-03 20:34:55 +00:00
Steven Hartland
99140218aa Fix various issues with zvols
When performing snapshot renames we could deadlock due to the locking
in zvol_rename_minors. In order to avoid this use the same workaround
as zvol_open in zvol_rename_minors.

Add missing zvol_rename_minors to dsl_dataset_promote_sync.

Protect against invalid index into zv_name in zvol_remove_minors.

Replace zvol_remove_minor calls with zvol_remove_minors to ensure
any potential children are also renamed.

Don't fail zvol_create_minors if zvol_create_minor returns EEXIST.

Restore the valid pool check in zfs_ioc_destroy_snaps to ensure we
don't call zvol_remove_minors when zfs_unmount_snap fails.

PR:		193803
MFC after:	1 week
Sponsored by:	Multiplay
2014-10-03 14:49:48 +00:00
Marcelo Araujo
d8a5961f88 Fix failures and warnings reported by newpynfs20090424 test tool.
This fix addresses only issues with the pynfs reports, none of these
issues are know to create problems for extant real clients.

Submitted by:	Bart Hsiao <bart.hsiao@gmail.com>
Reworked by:	myself
Reviewed by:	rmacklem
Approved by:	rmacklem
Sponsored by:	QNAP Systems Inc.
2014-10-03 02:24:41 +00:00
Xin LI
43ac3722ac Diff reduction with kernel code: instruct the compiler that the data of
these types may be unaligned to their "normal" alignment and exercise
caution when accessing them.

PR:		194071
MFC after:	3 days
2014-10-02 00:13:08 +00:00
Will Andrews
fbce0221eb zfsvfs_create(): Refuse to mount datasets whose names are too long.
This is checked for in the zfs_snapshot_004_neg STF/ATF test (currently
still in projects/zfsd rather than head).

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vfsops.c:
- zfsvfs_create(): Check whether the objset name fits into
  statfs.f_mntfromname, and return ENAMETOOLONG if not.  Although
  the filesystem can be unmounted via the umount(8) command, any
  interface that relies on iterating on statfs (e.g. libzfs) will
  fail to find the filesystem by its objset name, and thus assume
  it's not mounted.  This causes "zfs unmount", "zfs destroy",
  etc. to fail on these filesystems, whether or not -f is passed.

MFC after:	1 month
Sponsored by:	Spectra Logic
MFSpectraBSD:	974872 on 2013/08/09
2014-10-01 14:12:02 +00:00
Xin LI
0b66c7c514 Fix a mismerge in r260183 which prevents snapshot zvol devices being
removed and re-instate the fix in r242862.

Reported by:	Leon Dang <ldang nahannisys com>, smh
MFC after:	3 days
2014-09-30 18:50:45 +00:00
Steven Hartland
8caa3daf35 Remove sys/types.h include as per style (9)
SDT requries sys/param.h due to use of NULL

Reported by:	Garrett
Sponsored by:	Multiplay
2014-09-18 20:38:18 +00:00
Steven Hartland
71f3caaf31 Add dtrace probe support for zfs SET_ERROR(..)
MFC after:	1 week
Sponsored by:	Multiplay
2014-09-18 20:00:36 +00:00
Will Andrews
91dda985cc Remove debug.zfs_flags in favor of the new vfs.zfs.debug_flags.
Replace TUNABLE_INT with CTLFLAG_RWTUN.

Submitted by:	avg (debug.zfs_flags removal), smh (TUNABLE_INT replacement)
2014-09-18 18:46:38 +00:00
Will Andrews
f8c2f66a6c Enable ZFS debug flags to be modified via vfs.zfs.debug_flags.
This is primarily only of interest to ZFS developers, but it makes it
easier to get additional debugging.

Submitted by:	gibbs
MFC after:	1 month
Sponsored by:	Spectra Logic
MFSpectraBSD:	517074 on 2011/12/15 (by will), 662343 on 2013/03/20 (by gibbs)
2014-09-18 16:55:41 +00:00
Will Andrews
cf0a1157d7 Reorder sysctls for spa.c global tunables; add sysctl for ccw_retry_interval.
MFC after:	1 month
Sponsored by:	Spectra Logic
2014-09-18 16:38:03 +00:00
Will Andrews
cf7a096e72 bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
If bpobj_space() returned non-zero here, the sublist would have been
left open, along with the bonus buffer hold it requires.  This call
does not invoke any calls to bpobj_close() itself.

This bug doesn't have any known vector, but was found on inspection.

MFC after:	1 week
Sponsored by:	Spectra Logic
Affects:	All ZFS versions starting 21 May 2010 (illumos cde58dbc)
MFSpectraBSD:	r1050998 on 2014/03/26
2014-09-18 15:37:53 +00:00
Steven Hartland
d1d469e22b Remove unused ZFS ARC functions
* arc_data_buf_alloc
* arc_data_buf_free

MFC after:	1 week
Sponsored by:	Multiplay
2014-09-18 10:46:51 +00:00
Justin Hibbits
e40a5cd3ec Fix the stack tracing for dtrace/powerpc.
Summary:
Fix the stack tracing for dtrace/powerpc by using the trapexit/asttrapexit
return address sentinels instead of checking within the kernel address space.

As part of this, I had to add new inline functions.  FBT traces the kernel, so
we have to have special case handling for this, since a trap will create a full
new trap frame, and there's no way to pass around the 'real' stack.  I handle
this by special-casing 'aframes == 0' with the trap frame.  If aframes counts
out to the trap frame, then assume we're looking for the full kernel trap frame,
so switch to the real stack pointer.

Test Plan: Tested on powerpc64

Reviewers: rpaulo, markj, nwhitehorn

Reviewed By: markj, nwhitehorn

Differential Revision: https://reviews.freebsd.org/D788

MFC after:	3 week
Relnotes:	Yes
2014-09-17 02:43:47 +00:00
Steven Hartland
a889b18c52 Added missing ZFS sysctls
* vfs.zfs.vdev.async_write_active_min_dirty_percent
* vfs.zfs.vdev.async_write_active_max_dirty_percent

Added validation of min / max for ZFS sysctl
* vfs.zfs.dirty_data_max_percent

MFC after:	3 days
2014-09-14 12:23:00 +00:00
Xin LI
f9290bc2c9 MFV r271518:
Correctly report hole at end of file.

When asked to find a hole, the DMU sees that there are no holes in the
object, and returns ESRCH.  The ZPL interprets this as "no holes before
the end of the file", and therefore inserts the "virtual hole" at the
end of the file.  Because DMU and ZPL have different ideas of where the
end of an object/file is, we will end up returning the end of file,
which is generally larger, instead of returning the end of object.

The fix is to handle the "virtual hole" in the DMU. If no hole is found,
the DMU will return a hole at the end of the file, rather than an error.

Illumos issue:
    5139 SEEK_HOLE failed to report a hole at end of file

MFC after:	1 week
2014-09-13 17:48:44 +00:00
Xin LI
dc147754b7 MFV r271517:
In zil_claim, don't issue warning if we get EBUSY (inconsistent) when
opening an objset, instead, ignore it silently.

Illumos issue:

    5140 message about "%recv could not be opened" is printed when booting after crash

MFC after:	1 week
2014-09-13 17:36:34 +00:00
Xin LI
be1b14a063 MFV r271515:
Add a new tunable/sysctl, vfs.zfs.free_max_blocks, which can be used to
limit how many blocks can be free'ed before a new transaction group is
created.  The default is no limit (infinite), but we should probably have
a lower default, e.g. 100,000.

With this limit, we can guard against the case where ZFS could run out of
memory when destroying large numbers of blocks in a single transaction
group, as the entire DDT needs to be brought into memory.

Illumos issue:
    5138 add tunable for maximum number of blocks freed in one txg

MFC after:	2 weeks
2014-09-13 17:24:56 +00:00
Xin LI
ff0fc48bde MFV r271512:
Illumos issue:
    5136 fix write throttle comment in dsl_pool.c

MFC after:	2 weeks
2014-09-13 16:51:23 +00:00
Xin LI
263f396e2b MFV r271510:
Enforce 4K as smallest indirect block size (previously the smallest
indirect block size was 1K but that was never used).

This makes some space estimates more accurate and uses less memory
for some data structures.

Illumos issue:
    5141 zfs minimum indirect block size is 4K

MFC after:	2 weeks
2014-09-13 16:26:14 +00:00
Steven Hartland
3cdd9138c3 Persist vdev_resilver_txg changes to avoid panic caused by validation
vs a vdev_resilver_txg value from a previous resilver.

MFC after:	1 week
2014-09-11 16:21:51 +00:00
Gleb Smirnoff
27ad26d8c7 Remove unused arguments for VOP_GETPAGES(), VOP_PUTPAGES(). 2014-09-10 12:36:41 +00:00
Alexander Motin
ee9534ed96 Make ZVOL writes in device mode support IO_SYNC flag.
MFC after:	1 month
2014-09-09 11:29:55 +00:00
Xin LI
817d804595 MFV r271223:
In dnode_sync(), do dnode_increase_indirection() before processing
the dn_next_nblkptr.

Illumos issue:
    5117 space map reallocation can cause corruption

MFC after:	3 days
2014-09-07 13:13:42 +00:00
Peter Wemm
d903c21a64 Move the restored #ifdef i386 test back inside the #ifdef _KERNEL block
where it originally was.
2014-08-31 09:05:02 +00:00
Steven Hartland
92ac3eb59f Ensure that ZFS ARC free memory checks include cached pages
Also restore kmem_used() check for i386 as it has KVA limits that the raw
page counts above don't consider

PR:		187594
Reviewed by:	peter
X-MFC-With: r270759
Review:	D700
Sponsored by:	Multiplay
2014-08-30 21:44:32 +00:00
Mateusz Guzik
6662ce5aab Add missing proctree locking to fill_kinfo_proc consumers.
This fixes r270444.

Pointy hat:	mjg
Reported by:	many
MFC after:	1 week
2014-08-30 03:10:55 +00:00
Steven Hartland
4d19f4ad1f Refactor ZFS ARC reclaim logic to be more VM cooperative
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).

This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.

The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.

We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.

Credit to Karl Denninger for the original patch on which this update was
based.

PR:		191510 and 187594
Tested by:	dteske
MFC after:	1 week
Relnotes:	yes
Sponsored by:	Multiplay
2014-08-28 19:50:08 +00:00
Mark Johnston
35127d3c0f Restore the correct value when disabling probes. Otherwise the instrumented
tracepoints would continue to generate traps, which would be ignored but
could consume noticeable amounts of CPU if, say, all functions in the kernel
were instrumented.

X-MFC-With:	r270067
2014-08-24 17:10:47 +00:00
Xin LI
ec1b564650 Instead of using timestamp in the AVL, use the memory address when
comparing.

Illumos issue:
    5095 panic when adding a duplicate dbuf to dn_dbufs

MFC after:	3 days
2014-08-22 23:13:53 +00:00
Xin LI
fa4484104c MFV r270197:
Illumos issue:
    5066 remove support for non-ANSI compilation
    5068 Remove SCCSID() macro from <macros.h>

MFC after:	2 weeks
2014-08-22 22:13:36 +00:00
Xin LI
d291a3bd9c Provide compatibility shim for atomic_dec_64_nv.
X-MFC-with:	r270247
MFC after:	13 days
2014-08-21 08:25:46 +00:00
Xin LI
7c1db36b28 MFV r270196:
Illumos issue:
    5047 don't use atomic_*_nv if you discard the return value

MFC after:	2 weeks
2014-08-20 22:39:26 +00:00
Xin LI
249ddb42f6 MFC r270195:
Illumos issue:
    5045 use atomic_{inc,dec}_* instead of atomic_add_*

MFC after:	2 weeks
2014-08-20 21:44:48 +00:00
Xin LI
2bcc37f99c MFV r270193:
Illumos issues:
    5042 stop using deprecated atomic functions

MFC after:	2 weeks
2014-08-20 18:29:18 +00:00
Mark Johnston
266b4a78c2 Factor out the common code for function boundary tracing instead of
duplicating the entire implementation for both x86 and powerpc. This makes
it easier to add support for other architectures and has no functional
impact.

Phabric:	D613
Reviewed by:	gnn, jhibbits, rpaulo
Tested by:	jhibbits (powerpc)
MFC after:	2 weeks
2014-08-16 21:42:55 +00:00
Xin LI
60723bfe21 MFV r269542:
In vdev_get_stats, check that the vdev is not a hole before computing the
fragmentation.  This fixes a panic when removing log device.

Illumos issue:
    5049 panic when removing log device

Author:		Alex Reece <alex@delphix.com>
MFC after:	2 weeks
2014-08-05 00:07:21 +00:00
Mark Johnston
2661328745 Return 0 for the PPID of threads in process 0, as process 0 doesn't have a
parent process.

MFC after:	2 weeks
2014-08-04 19:02:30 +00:00
Xin LI
cd741a5e1d Revert r269404 and use cpu_ticks() for dbuf allocation.
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().

Reviewed by:	mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after:	2 weeks
2014-08-03 09:47:51 +00:00
Xin LI
1dcef10eac MFV r269427:
In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.

This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.

Illumos issue:
    5038 Remove "old-style" flexible array usage in ZFS.
    Author: Justin T. Gibbs <justing@spectralogic.com>

MFC after:	2 weeks
2014-08-02 08:34:22 +00:00
Ian Lepore
c311f7078c When arm 64-bit atomic ops are available, define ARM_HAVE_ATOMIC64. Use
that symbol (which will be correct in both kernel and userland contexts)
rather than just __arm__ to decide whether to use a local implementation.
2014-08-02 03:44:27 +00:00
Ian Lepore
814f4c5896 Use the 64-bit atomics now provided by arm machine/atomic.h instead of
(conflicting) local versions.
2014-08-01 23:45:50 +00:00
Steven Hartland
6a369c018c Don't return ZIO_PIPELINE_CONTINUE from vdev_op_io_start methods
This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.

We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.

zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.

Cleanup flow in vdev_geom_io_start while I'm here.

Also fix some cases not using SET_ERROR(..)

MFC after:	2 weeks
X-MFC-With:	r265321
2014-08-01 23:16:48 +00:00
Xin LI
125f68e708 Split gethrtime() and gethrtime_waitfree() and make the former use
nanouptime() instead of getnanouptime().  nanouptime(9) provides more
precise result at expense of being slower.

In r269223, gethrtime() is used as creation time of dbuf, which in turn
acts as portion of lookup key to maintain AVL invariant where there can
not be duplicate items.  Before this change, gethrtime() have preferred
better execution time by sacrificing precision, which may lead to panic
on busy systems with:

	panic: avl_find() succeeded inside avl_add()

Reported by:	allanjude, mav
PR:		kern/192284
MFC after:	11 days
X-MFC-with:	r269223
2014-08-01 22:33:23 +00:00
Rui Paulo
d18aa577d5 Copy strtolctype.h to sys/cddl/contrib/opensolaris/common/util to keep
the kernel self-contained.

Requested by:	jhb
2014-07-31 08:07:23 +00:00
Xin LI
9b046b421f MFV r269224:
Increase default ARC buf_hash_table size.  When typical block size is small,
the hash table could be too small, which would lead to long hash chains and
limit performance for cached reads.

A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which
allows users to override the default assumption of average (typical) block
size.  Old default was 65536 (64 KiB) and new default is 8192 (8 KiB).

Illumos issue:
    5034 ARC's buf_hash_table is too small

MFC after:	2 weeks
2014-07-29 09:36:48 +00:00
Xin LI
a3cbca537e MFV r269223:
Change dn->dn_dbufs from linked list to AVL tree.

Illumos issues:
  4873 zvol unmap calls can take a very long time for larger datasets

MFC after:	2 weeks
2014-07-29 08:42:22 +00:00
Xin LI
343c95a24e Reschedule the 'deadman' callout after handling, this makes our
code behave more like it is on Solaris.

Reported by:	avg
Reviewed by:	avg, mav (but bugs are mine)

Differential Revision: https://phabric.freebsd.org/D457
2014-07-29 06:57:13 +00:00
Konstantin Belousov
fe0e9a63e0 Initialize zfs vnode v_hash when the vnode is allocated, instead of
postponing it to zfs_vget().  zfs_root() returned vnode with the
default value of v_hash, which caused inconsistent v_hash value when
root vnode was obtained from zfs_vget().

Nullfs allocated two upper vnodes for the root zfs vnode due to
different hashes, causing consistency problems.

Reported and tested by:	Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Sponsored by:	The FreeBSD Foundation
MFC after:	1 week
2014-07-28 14:24:18 +00:00
Xin LI
50b74c6ef1 Add two sysctls for newly added tunables.
MFC after:	2 weeks
2014-07-26 19:07:08 +00:00
Xin LI
7e37b1e609 MFV r269010:
Import Illumos changes to address the following Illumos issues:
  4976 zfs should only avoid writing to a failing non-redundant
       top-level vdev
  4978 ztest fails in get_metaslab_refcount()
  4979 extend free space histogram to device and pool
  4980 metaslabs should have a fragmentation metric
  4981 remove fragmented ops vector from block allocator
  4982 space_map object should proactively upgrade when feature
       is enabled
  4984 device selection should use fragmentation metric

MFC after:	2 weeks
2014-07-26 10:20:48 +00:00
Alexander Motin
1bc04f6a8c Make sysctls under vfs.zfs.zfetch writeable.
I don't see any reason for them to be read-only, while tuning them without
reboot is much more convenient for experiments.

MFC after:	2 weeks
2014-07-26 09:09:14 +00:00
Xin LI
0aa4ce9b7d Transform the I/O when vdev_physical_ashift is greater than
SPA_MINBLOCKSHIFT.

MFC after:	2 weeks
2014-07-25 18:41:56 +00:00
Xin LI
883d80c104 As of r268075, the responsibility of rounding up buffer to optimal size have
been transferred from zio_compress_data to its caller.  Therefore, passing
the 'minblocksize' down will be a no-op.

Eliminate the parameter to reduce diff against upstream.

MFC after:	2 weeks
2014-07-25 06:53:20 +00:00
Xin LI
3d4d6b0883 Correct typo introduced with r268855.
MFC after:	10 days
X-MFC with:	r268855
2014-07-22 08:37:01 +00:00
Mark Johnston
5a5f9d21dd Use a C wrapper for trap() instead of checking and calling the DTrace trap
hook in assembly.

Suggested by:	kib
Reviewed by:	kib (original version)
X-MFC-With:	r268600
2014-07-19 02:27:31 +00:00
Xin LI
b4bb49887b Reduce lock contention on the z_teardown_lock under heavily cached
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.

Read acquisitions are randomly distributed among these locks based
on curthread pointer.  Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.

Illumos issue:
    5008 lock contention (rrw_exit) while running a read only load

MFC after:	2 weeks
2014-07-19 00:26:03 +00:00
Xin LI
82599d31fe MFV r268851:
When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).

Illumos issue:
    4753 increase number of outstanding async writes when sync task is waiting

MFC after:	2 weeks
2014-07-18 22:34:01 +00:00
Xin LI
f886b6e3bc MFV r268850:
Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC.  Instead
we simply coordinate the destruction of the DMU's data with the ARC.

The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.

Illumos issue:
    4631 zvol_get_stats triggering too many reads

MFC after:	2 weeks
2014-07-18 22:04:21 +00:00
Xin LI
7882b61f60 MFV r268848:
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.

Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.

This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).

While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.

Illumos issue:
  4958 zdb trips assert on pools with ashift >= 0xe

MFC after:	2 weeks
2014-07-18 20:41:40 +00:00
Xin LI
7079d5877c MFV r268714:
Improve extreme rewind import.

When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os.  This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.

For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm.  Three kernel tunables have been added:

	vfs.zfs.spa_load_verify_maxinflight
	vfs.zfs.spa_load_verify_metadata
	vfs.zfs.spa_load_verify_data

The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.

Make 'zpool import -T' imply scrub.

Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.

Skip txg's for which there is no uberblock when doing extreme rewind.

Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.

Illumos issues:
  4970 need controls on i/o issued by zpool import -XF
  4971 zpool import -T should accept hex values
  4972 zpool import -T implies extreme rewind, and thus a scrub
  4973 spa_load_retry retries the same txg
  4974 spa_load_verify() reads all data twice

MFC after:	2 weeks
2014-07-15 22:44:04 +00:00
Xin LI
eb75155228 MFV r268702:
Add missing *_destroy() calls in various places with ZFS.

Illumos issue:
  4975 missing mutex_destroy() calls in zfs

MFC after:	2 weeks
2014-07-15 20:32:23 +00:00
Mark Johnston
291624fdf6 Invoke the DTrace trap handler before calling trap() on amd64. This matches
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.

This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.

Submitted by:	Anton Rang <anton.rang@isilon.com> (original version)
Phabric:	D95
2014-07-14 04:38:17 +00:00
Xin LI
1b174fa1eb MFV r268455:
Use reserved space for ZFS administrative commands.

We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use.  Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.

Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used.  This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.

A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.

MFC after:	 2 weeks
2014-07-09 23:14:59 +00:00
Xin LI
fdc0ee2cf5 MFV r268452:
Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.

Illumos issue:	4950 files sometimes can't be removed from a full
		filesystem
MFC after:	2 weeks
2014-07-09 18:32:40 +00:00
Alexander Motin
e327a057a7 Remove IO_SYNC flag when writing extended file attributes on ZFS.
While it is possible to create and write file, modify its permissions, etc.
without ever doing sync, it looks odd that it is required for setting
extended file attributes on ZFS.  UFS does not do sync there too.

Samba uses those extended attributes to store some its data, and doing it
synchronously by many times reduces file creation performance for systems
without SLOG device.

Reviewed by:	delphij, jpaetzel, silence on fs@
MFC after:	2 weeks
Sponsored by:	iXsystems, Inc.
2014-07-08 17:26:08 +00:00
Marcel Moolenaar
e7d939bda2 Remove ia64.
This includes:
o   All directories named *ia64*
o   All files named *ia64*
o   All ia64-specific code guarded by __ia64__
o   All ia64-specific makefile logic
o   Mention of ia64 in comments and documentation

This excludes:
o   Everything under contrib/
o   Everything under crypto/
o   sys/xen/interface
o   sys/sys/elf_common.h

Discussed at: BSDcan
2014-07-07 00:27:09 +00:00
Pedro F. Giffuni
5f40879138 Merge from OpenSolaris (24-Jul-2010):
6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops
6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink

This finishes a set of merges from the older OpenSolaris releases.
Still the FreeBSD port has many differences that are difficult to
account for but that seems normal given that the kernels are different.

MFC after:	1 week
2014-07-05 15:36:17 +00:00
Pedro F. Giffuni
99d8c6efbd Merge from OpenSolaris (30-Jun-2009):
6851093 system drops to kmdb with anonymous dtrace probes + kmdb

This has no effect on FreeBSD (code is ifdef'ed) but is useful as
reference for future merges.

MFC after:	1 week
2014-07-03 19:25:24 +00:00
Pedro F. Giffuni
87e109c3e0 Merge from OpenSolaris (22-Apr-2008):
6823388 DTrace ioctl handlers must validate all structure members

MFC after:	1 week
2014-07-03 19:07:37 +00:00
Pedro F. Giffuni
e099b3a948 Merge from OpenSolaris (20-Apr-2008):
6822482 DOF validation needs to handle loadable sections flagged as unloadable

MFC after:	1 week
2014-07-03 17:36:59 +00:00
Alexander Motin
5a178afd41 Fix bug in sync control in new "dev" mode of ZVOL (r265678).
Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".

MFC after:	3 days
2014-07-02 21:25:32 +00:00
Pedro F. Giffuni
0b8f286e83 Merge from OpenSolaris (15-Sep-2008):
6735480 race between probe enabling and provider registration

MFC after:	1 week
2014-07-01 23:37:24 +00:00
Xin LI
30324e945a MFV r268122:
4929 want prevsnap property

illumos/illumos-gate@b461c7460e

MFC after:	2 weeks
2014-07-01 22:42:53 +00:00
Xin LI
9cc8a15b2e MFV r268121:
4924 LZ4 Compression for metadata

illumos/illumos-gate@b8289d24d8

MFC after:	2 weeks
2014-07-01 22:31:09 +00:00
Pedro F. Giffuni
f384ec379c Small merges from OpenSolaris:
These have no effect on FreeBSD, in fact they are ifdef'ed,
but make easier future merges:

6699767 panic in spec_open()

6718877 crgetzoneid() use can cause problems when forking processes with
USDT providers in a non global zone

MFC after:	3 days
2014-07-01 22:16:44 +00:00
Xin LI
aa882b9048 MFV r268119:
4914 zfs on-disk bookmark structure should be named *_phys_t

illumos/illumos-gate@7802d7bf98

MFC after:	2 weeks
2014-07-01 21:51:30 +00:00
Xin LI
55f6421982 - Fix handling of "new" style of ioctl in compatiblity mode [1];
- Reorganize code and reduce diff from upstream;
 - Improve forward compatibility shims for previous kernel;

Reported by:	sbruno [1]
X-MFC-With:	r268075
2014-07-01 20:57:39 +00:00
Pedro F. Giffuni
c6d712caf3 Revert r268007, and re-adapt MFV r260708:
4427 pid provider rejects probes with valid UTF-8 names

Use of u8_textprep.c required -Wno-cast-qual for powerpc.

MFC after:	2 weeks
2014-07-01 15:36:05 +00:00
Xin LI
be78a8db97 MFV r267570:
4756 metaslab_group_preload() could deadlock

illumos/illumos-gate@30beaff42d

MFC after:	2 weeks
2014-07-01 08:36:56 +00:00
Xin LI
3a0f8ff95e MFV r267569:
4897 Space accounting mismatch in L2ARC/zpool

illumos/illumos-dist@3038a2b421

MFC after:	2 weeks
2014-07-01 08:28:49 +00:00
Xin LI
93b8d53c09 MFV r267567:
4881 zfs send performance degradation when embedded block pointers are
     encountered

illumos/illumos-gate@06315b795c

MFC after:	2 weeks
2014-07-01 07:56:07 +00:00
Xin LI
71eaf0fda7 MFV r267566:
4390 i/o errors when deleting filesystem/zvol can lead to space map corruption

MFC after:	2 weeks
2014-07-01 07:29:42 +00:00
Xin LI
29441ba3fa MFV r267565:
4757 ZFS embedded-data block pointers ("zero block compression")
4913 zfs release should not be subject to space checks

MFC after:	2 weeks
2014-07-01 06:43:15 +00:00
Pedro F. Giffuni
0135aadfc3 Reduce some warnings in the Solaris unicode support.
Clean some warnings from parenthesis and minor style issues.

MFC after:	3 days
2014-06-29 02:28:05 +00:00
Pedro F. Giffuni
f34dd28f7d Revert r267869:
MFV	r260708
4427 pid provider rejects probes with valid UTF-8 names

Use of u8_textprep.c broke the build on powerpc.

Reported by:	bz, rpaulo and tinderbox.
Pointyhat:	me
2014-06-28 19:59:12 +00:00
Hans Petter Selasky
af3b2549c4 Pull in r267961 and r267973 again. Fix for issues reported will follow. 2014-06-28 03:56:17 +00:00
Glen Barber
37a107a407 Revert r267961, r267973:
These changes prevent sysctl(8) from returning proper output,
such as:

 1) no output from sysctl(8)
 2) erroneously returning ENOMEM with tools like truss(1)
    or uname(1)
 truss: can not get etype: Cannot allocate memory
2014-06-27 22:05:21 +00:00
Hans Petter Selasky
3da1cf1e88 Extend the meaning of the CTLFLAG_TUN flag to automatically check if
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.

Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.

MFC after:	2 weeks
Sponsored by:	Mellanox Technologies
2014-06-27 16:33:43 +00:00
Rui Paulo
a43f0be9fe MFV illumos
4471 DTrace count() with histogram
4472 DTrace full width distribution histograms
4473 DTrace frequency trails

MFC after:	2 weeks
2014-06-26 23:24:59 +00:00
Rui Paulo
8e648814b0 MFV illumos
4474 DTrace Userland CTF Support
4475 DTrace userland Keyword
4476 DTrace tests should be better citizens
4479 pid provider types
4480 dof emulation is missing checks

MFC after:	2 weeks
2014-06-26 23:21:11 +00:00
Rui Paulo
b1f9167f94 MFV illumos
4477 DTrace should speak JSON

MFC after:	2 weeks
2014-06-26 21:45:49 +00:00
Rui Paulo
0c2b601953 MFV illumos r266986:
2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access

MFC after:	2 weeks
2014-06-26 19:38:16 +00:00
Rui Paulo
dd9b2abed8 Revert r267898. 2014-06-26 17:34:42 +00:00
Rui Paulo
d8e37c5f72 Bring the following change from the illumos-joyent repository:
commit 78e24ab6803bbe11ba37642624e1498ede5b239d
Author: Bryan Cantrill <bryan@joyent.com>
Date:   Thu Oct 31 01:20:54 2013

    OS-1688 DTrace count() with histogram
    OS-2360 DTrace full width distribution histograms
    OS-2361 DTrace frequency trails

MFC after:	2 weeks
2014-06-26 07:06:43 +00:00
Pedro F. Giffuni
af8bd6e468 MFV r260708
4427 pid provider rejects probes with valid UTF-8 names

This make use of Solaris' u8_validate() which we happen to
use since r185029 for ZFS.

Illumos Revision:	1444d846b126463eb1059a572ff114d51f7562e5

Reference:
https://www.illumos.org/issues/4427

Obtained from:	Illumos
MFC after:	2 weeks
2014-06-25 14:23:30 +00:00
Davide Italiano
a99098e2ba Continue the crusade towards a dev_clone()-free kernel, removing its
usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD
8, so this change should be quite harmless.

Reviewed by:	markj
Approved by:	markj
MFC after:	never
2014-06-25 03:54:02 +00:00
Mark Johnston
efa1aff675 Fix some bugs when fetching probe arguments in i386. Firstly ensure that
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.

MFC after:	3 weeks
2014-06-23 02:00:14 +00:00
Mark Johnston
8382ec9e6a Fix a couple of bugs on amd64 when fetching probe arguments beyond the
first five for probes entered through a UD fault (i.e. FBT probes).

Specifically, handle the fact that dtrace_invop_callsite must be
16 byte-aligned and thus may not immediately follow the call to
dtrace_invop() in dtrace_invop_start(). Also fetch register arguments and
the stack pointer through a struct trapframe instead of a struct reg.

PR:		191260
Submitted by:	luke.tw@gmail.com
MFC after:	3 weeks
2014-06-23 01:10:56 +00:00
Mark Johnston
9338d20884 Allow creation of SDT probes from a module in which no providers are
defined. This ensures that the sdt:zfs:: probes appear despite the fact
the sdt provider is defined in the kernel rather than in zfs.ko.

Reported by:	hiren
Tested by:	hiren
MFC after:	2 weeks
2014-06-21 19:29:40 +00:00
Steven Hartland
74ddec2b18 Removed stale comment about multi-vdev root pool config not working
MFC after:	1 week
2014-06-09 13:04:58 +00:00
Bryan Drewery
f3a7518361 - Naively fix build by partially reverting r267029 to still use
gethrtime() when building libzpool.

X-MFC-With:	267029
2014-06-04 05:04:15 +00:00
Alexander Motin
4220ebcf71 Replace gethrtime() with cpu_ticks(), as source of random for the taskqueue
selection.  gethrtime() in our port updated with HZ rate, so unusable for
this specific purpose, completely draining benefit of multiple taskqueues.

MFC after:	2 weeks
2014-06-03 21:06:03 +00:00
Xin LI
f4c7dd6dd0 MFV 266913+266914:
3897 zfs filesystem and snapshot limits (fix leak)
4901 zfs filesystem/snapshot limit leaks

MFC after:	3 days
2014-05-31 01:00:22 +00:00
Xin LI
2bdf7f79bc MFV r266766:
Add a new zfs property, "redundant_metadata" which can have values "all" or
"most".  The default will be "all", which is the current behavior.  When set
to all, ZFS stores an extra copy of all metadata.  If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.

Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files.  This can improve performance of random writes,
because less metadata has to be written.  In practice,  at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.

The exact behavior of which metadata blocks are stored redundantly may change
in future releases.

Illumos issue: 3835 zfs need not store 2 copies of all metadata

MFC after:	2 weeks
2014-05-27 19:46:11 +00:00
Allan Jude
ecd9567c1a Improve sysctl descriptions for new ZFS sysctls:
vfs.zfs.dirty_data_max
vfs.zfs.dirty_data_max_max
vfs.zfs.dirty_data_sync

Reviewed by:	smh
Approved by:	wblock (mentor)
2014-05-22 05:30:38 +00:00
Steven Hartland
df23182a62 Added sysctls / tunables for ZFS dirty data tuning
Added the following new sysctls / tunables:
* vfs.zfs.dirty_data_max
* vfs.zfs.dirty_data_max_max
* vfs.zfs.dirty_data_max_percent
* vfs.zfs.dirty_data_sync
* vfs.zfs.delay_min_dirty_percent
* vfs.zfs.delay_scale

PR:		kern/189865
MFC after:	2 weeks
2014-05-21 13:36:04 +00:00
Peter Grehan
c3ddb60e2d Update dis_tables.c to the latest Illumos version.
This includes decodes of recent Intel instructions, in particular
VT-x and related instructions. This allows the FBT provider to
locate the exit points of routines that include these new
instructions.

Illumos issues:
 3414 Need a new word of AT_SUN_HWCAP bits
 3415 Add isainfo support for f16c and rdrand
 3416 Need disassembler support for rdrand and f16c
 3413 isainfo -v overflows 80 columns
 3417 mdb disassembler confuses rdtscp for invlpg
 1518 dis should support AMD SVM/AMD-V/Pacifica instructions
 1096 i386 disassembler should understand complex nops
 1362 add kvmstat for monitoring of KVM statistics
 1363 add vmregs[] variable to DTrace
 1364 need disassembler support for VMX instructions
 1365 mdb needs 16-bit disassembler support

This corresponds to Illumos-gate (github) version
eb23829ff08a873c612ac45d191d559394b4b408

Reviewed by:	markj
MFC after:	1 week
2014-05-15 01:06:27 +00:00
Xin LI
b8cdcb8ad8 Import George Wilson's change for Illumos #4730:
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
	Reviewed by: Alex Reece <alex.reece@delphix.com>
	Reviewed by: Matthew Ahrens <mahrens@delphix.com>
	Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>

	Original author: George Wilson

MFC after:	3 days
2014-05-06 19:03:04 +00:00
Steven Hartland
4f64781818 Use a zio flag to prevent recursion of vdev_queue_io_done which can
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.

This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.

MFC after:	1 month
X-MFC-With:	r265152
2014-05-04 14:05:14 +00:00
Steven Hartland
573621a6d6 Don't treat TRIM requests returning ENOTSUP as an unexpected error.
MFC after:	1 month
X-MFC-With:	r265152
2014-05-03 02:30:01 +00:00
Steven Hartland
10138166cf Removed pointless / duplicated call to trim_map_first.
MFC after:	1 month
X-MFC-With:	r265152
2014-05-02 09:31:21 +00:00
Steven Hartland
82ce008538 Reintroduce priority for the TRIM ZIOs instead of using the "NOW" priority
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.

Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.

As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.

This is based off avg's original work, so credit to him.

MFC after:	1 month
2014-04-30 17:46:29 +00:00
Steven Hartland
101dfa0ed4 Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start
returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting
in the wrong ZIO continuing its pipeline.

This is a serious issue which could cause data loss / corruption but appears
to be limited to error handling such as when vdev_readable(vd) returns false.

MFC after:	2 days
2014-04-28 09:00:00 +00:00
Steven Hartland
c2b2c5fc76 Eliminate duplicate checks in vdev_geom_io_intr error handling
MFC after:	1 month
2014-04-24 15:36:00 +00:00
Steven Hartland
5b245b8ae0 Add the ability to set a minimum ashift size for ZFS pool creation or root level
vdev addition.

Change max_auto_ashift sysctl to error when an invalid value is requested instead
of silently limiting it.
2014-04-24 01:06:03 +00:00
Xin LI
754180f4ae MFV r264830:
4745 fix AVL code misspellings

MFC after:	2 weeks
2014-04-23 20:32:39 +00:00
Xin LI
f8587167e4 MFV r264829:
3897 zfs filesystem and snapshot limits

MFC after:	2 weeks
2014-04-23 20:29:46 +00:00
Xin LI
18ab4bd8d9 MFV r264668:
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

illumos/illumos@b6240e830b

MFC after:	2 weeks
2014-04-18 22:04:58 +00:00
Xin LI
d301d390a7 MFV r264667:
4752 fan out read zio taskqs

illumos/illumos-gate@1b497ab83e
2014-04-18 21:35:23 +00:00
Xin LI
613074ec08 MFV r264666:
4374 dn_free_ranges should use range_tree_t

illumos/illumos-gate@bf16b11e8d

MFC after:	2 weeks
2014-04-18 21:15:12 +00:00
Mark Johnston
38e6967f04 Ensure that all eight syscall arguments are available to dtrace_probe(),
rather than just the first five. This is done by calling dtrace_probe()
through a function pointer, as in illumos.

MFC after:	3 weeks
2014-04-14 00:23:18 +00:00
Mark Johnston
0626f3e435 DTrace's pid provider works by inserting breakpoint instructions at probe
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.

In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.

This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.

With this change it's possible to trace all libc functions in a program,
e.g. with

  pid$target:libc.so.*::entry {@[probefunc] = count();}

Previously this would generally cause the victim process to crash, as
tracing memcpy on amd64 requires the functionality described above.

Tested by:	Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version)
MFC after:	6 weeks
2014-04-14 00:22:42 +00:00
Davide Italiano
2f9e29745c Fix a panic in zfs_rename().
this is due to a wrong dereference of a vnode when it's not locked and
can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename()
entry point because the VFS can't be sure that this scenario is
LOR-free (it might violate the parent->child lock acquisition rule).
Dereference 'tdvp' instead, which is already locked on entry, and access
'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope.

While at it, remove the usage of VOP_REALVP, as long as this is a NOP
on FreeBSD.

Discussed with:	avg
Reviewed by:	pjd
2014-04-13 01:15:37 +00:00
Alexander Motin
f6e1dc83c3 Create zvol devices on zfs clone.
While big and shiny patch is not ready, it is better to have something.

PR:		kern/178999
MFC after:	1 week
2014-04-11 11:56:16 +00:00
Alexander Motin
a96fefe042 In addition to r264077, tell GEOM that we do support BIO_DELETE now. 2014-04-06 16:31:28 +00:00
Alexander Motin
537650f54d Add property and sysctl to control how ZVOLs are exposed to OS.
New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
 geom -- existing fully functional behavior (default);
 dev -- exposing volumes only as raw disk device file in devfs;
 none -- not exposing volumes outside ZFS.

The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.

Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2014-04-05 13:01:44 +00:00
Alexander Motin
89e84aead6 MFV r258922:
3580 Want zvols to return volblocksize when queried for physical block size

illumos/illumos-gate@a0b60564df

It is irrelevant for FreeBSD, just reducing diff.
2014-04-03 20:18:55 +00:00
Alexander Motin
4a03e8b64d Add BIO_DELETE support to ZVOL.
It is an adapted merge from the vendor branch of:
    701 UNMAP support for COMSTAR  (in part related to ZFS)
    2130 zvol DKIOCFREE uses nested DMU transactions
2014-04-03 15:04:32 +00:00
Pedro F. Giffuni
23e4da439c MFV r258379;
4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included

Illumos Revision:	4a20ab41aadcb81c53e72fc65886e964e9add59

Reference:
https://www.illumos.org/issues/4248
https://www.illumos.org/issues/4249

Obtained from:	Illumos
MFC after:	1 month
2014-04-02 15:32:44 +00:00
Bryan Drewery
44f1c91610 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
Robert Watson
4a14441044 Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00