1009 Commits

Author SHA1 Message Date
smh
7648307537 Fix ZIO reordering done by vdev_queue_io causing panics when zio_vdev_io_start
returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting
in the wrong ZIO continuing its pipeline.

This is a serious issue which could cause data loss / corruption but appears
to be limited to error handling such as when vdev_readable(vd) returns false.

MFC after:	2 days
2014-04-28 09:00:00 +00:00
smh
dc51a2f994 Eliminate duplicate checks in vdev_geom_io_intr error handling
MFC after:	1 month
2014-04-24 15:36:00 +00:00
smh
694b1f9e91 Add the ability to set a minimum ashift size for ZFS pool creation or root level
vdev addition.

Change max_auto_ashift sysctl to error when an invalid value is requested instead
of silently limiting it.
2014-04-24 01:06:03 +00:00
delphij
4d48e09fd7 MFV r264830:
4745 fix AVL code misspellings

MFC after:	2 weeks
2014-04-23 20:32:39 +00:00
delphij
4a1b97c2c0 MFV r264829:
3897 zfs filesystem and snapshot limits

MFC after:	2 weeks
2014-04-23 20:29:46 +00:00
delphij
b281166740 MFV r264668:
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed

illumos/illumos@b6240e830b

MFC after:	2 weeks
2014-04-18 22:04:58 +00:00
delphij
aff2f85dda MFV r264667:
4752 fan out read zio taskqs

illumos/illumos-gate@1b497ab83e
2014-04-18 21:35:23 +00:00
delphij
3344ccfa37 MFV r264666:
4374 dn_free_ranges should use range_tree_t

illumos/illumos-gate@bf16b11e8d

MFC after:	2 weeks
2014-04-18 21:15:12 +00:00
markj
42a27d59d5 Ensure that all eight syscall arguments are available to dtrace_probe(),
rather than just the first five. This is done by calling dtrace_probe()
through a function pointer, as in illumos.

MFC after:	3 weeks
2014-04-14 00:23:18 +00:00
markj
801a01a441 DTrace's pid provider works by inserting breakpoint instructions at probe
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.

In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.

This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.

With this change it's possible to trace all libc functions in a program,
e.g. with

  pid$target:libc.so.*::entry {@[probefunc] = count();}

Previously this would generally cause the victim process to crash, as
tracing memcpy on amd64 requires the functionality described above.

Tested by:	Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version)
MFC after:	6 weeks
2014-04-14 00:22:42 +00:00
davide
6865adbf59 Fix a panic in zfs_rename().
this is due to a wrong dereference of a vnode when it's not locked and
can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename()
entry point because the VFS can't be sure that this scenario is
LOR-free (it might violate the parent->child lock acquisition rule).
Dereference 'tdvp' instead, which is already locked on entry, and access
'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope.

While at it, remove the usage of VOP_REALVP, as long as this is a NOP
on FreeBSD.

Discussed with:	avg
Reviewed by:	pjd
2014-04-13 01:15:37 +00:00
mav
d92cc72a75 Create zvol devices on zfs clone.
While big and shiny patch is not ready, it is better to have something.

PR:		kern/178999
MFC after:	1 week
2014-04-11 11:56:16 +00:00
mav
2e86f49ec8 In addition to r264077, tell GEOM that we do support BIO_DELETE now. 2014-04-06 16:31:28 +00:00
mav
dd28e6bbc0 Add property and sysctl to control how ZVOLs are exposed to OS.
New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
 geom -- existing fully functional behavior (default);
 dev -- exposing volumes only as raw disk device file in devfs;
 none -- not exposing volumes outside ZFS.

The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.

Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.

MFC after:	1 month
Sponsored by:	iXsystems, Inc.
2014-04-05 13:01:44 +00:00
mav
d7b75f7a6d MFV r258922:
3580 Want zvols to return volblocksize when queried for physical block size

illumos/illumos-gate@a0b60564df

It is irrelevant for FreeBSD, just reducing diff.
2014-04-03 20:18:55 +00:00
mav
cd80e1706c Add BIO_DELETE support to ZVOL.
It is an adapted merge from the vendor branch of:
    701 UNMAP support for COMSTAR  (in part related to ZFS)
    2130 zvol DKIOCFREE uses nested DMU transactions
2014-04-03 15:04:32 +00:00
pfg
3c7cca5de7 MFV r258379;
4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included

Illumos Revision:	4a20ab41aadcb81c53e72fc65886e964e9add59

Reference:
https://www.illumos.org/issues/4248
https://www.illumos.org/issues/4249

Obtained from:	Illumos
MFC after:	1 month
2014-04-02 15:32:44 +00:00
bdrewery
6fcf6199a4 Rename global cnt to vm_cnt to avoid shadowing.
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.

Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.

Exp-run revealed no ports using it directly.

No objection from:	arch@
Sponsored by:	EMC / Isilon Storage Division
2014-03-22 10:26:09 +00:00
rwatson
33fdc14c0c Update kernel inclusions of capability.h to use capsicum.h instead; some
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.

MFC after:	3 weeks
2014-03-16 10:55:57 +00:00
mav
a99f629278 Report ZVOL block size as GEOM stripesize.
MFC after:	2 weeks
2014-03-13 19:26:26 +00:00
delphij
b01107b750 MFV r262983:
4638 Panic in ZFS via rfs3_setattr()/rfs3_write(): dirtying snapshot!

illumos/illumos-gate@2144b121c0
2014-03-11 00:23:50 +00:00
delphij
6ab407aa01 All callers of static method load_nvlist() in spa.c handles error case,
so there is no reason to assert that we won't hit an error.  Instead,
just return that error to caller and have the upper layer handle it.

Obtained from:	FreeNAS
Reported by:	rodrigc
Reviewed by:	Matthew Ahrens
MFC after:	2 weeks
2014-03-02 02:41:33 +00:00
markj
9caf62197b Expose a few DTrace parameters as sysctls under kern.dtrace and add
descriptions for several existing sysctls.

PR:		187027
Submitted by:	Fedor Indutny <fedor@indutny.com> (original version)
MFC after:	2 weeks
2014-03-01 19:06:43 +00:00
markj
f6a884d05d Fix emulation of call and jmp instructions on i386 and for 32-bit processes
on amd64.

Submitted by:	Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after:	2 weeks
2014-03-01 17:55:20 +00:00
markj
a7aada62d5 4478 dtrace_dof_maxsize is far too small
illumos/illumos-gate@d339a29bb4

PR:		187027
MFC after:	1 week
2014-02-28 02:04:41 +00:00
markj
b76098bbce Fix the struct reg mappings for i386 and amd64, which differ between illumos
and FreeBSD.

Submitted by:	Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after:	2 weeks
2014-02-27 01:24:47 +00:00
markj
451c3aecb6 Move some files that are identical on i386 and amd64 to an x86 subdirectory
rather than keeping duplicate copies.

Discussed with:	avg
MFC after:	1 week
2014-02-27 01:04:35 +00:00
markj
9b658401ea Revert r262466, as it does not compile on PowerPC.
Reported by:	jhibbits
2014-02-26 01:00:00 +00:00
markj
b3fc1da12e Make all 8 syscall arguments available to syscall probes in the same way
that this is done for SDT probes. This fixes the syscall/tst.args.d test,
which was failing because mmap(2)'s sixth argument wasn't available to the
probe.

MFC after:	2 weeks
2014-02-25 02:58:11 +00:00
markj
428d834b5b 1452 DTrace buffer autoscaling should be less violent
illumos/illumos-gate@6fb4854bed

This fixes the tst.resize1.d and tst.resize2.d DTrace tests, which have
been failing since r261122 since they were causing dtrace(1) to attempt to
allocate and use large amounts of memory, and get killed by the OOM killer
as a result.

MFC after:	1 month
2014-02-22 05:18:55 +00:00
markj
95ac3c80e1 Define the KM_NORMALPRI flag for kmem_alloc(), as it is used in some
upstream DTrace code. It indicates that the kernel memory allocator need not
attempt to satisfy non-blocking allocations in low-memory conditions. This
has no direct equivalent in the malloc(9) flags, so it is just defined to 0
for now.
2014-02-22 05:13:35 +00:00
delphij
1dfe96a066 MFV r261619:
4574 get_clones_stat does not call zap_count in non-debug kernel

zap_count(...) is never called in non-DEBUG kernel.
As result "count" variable is always 0, and "goto fail" is always
reached.  This means get_clones_stat function never makes up list
of clones for "clones" properties.

MFC after:	2 weeks
2014-02-08 05:35:36 +00:00
delphij
4b064bf9ac MFV r260834:
Fix memory leak of compressed buffers in l2arc_write_done (Illumos
#3995).
2014-01-18 01:45:39 +00:00
avg
6b143ee35a traverse_visitbp: visit DMU_GROUPUSED_OBJECT before DMU_USERUSED_OBJECT
This is done to ensure that visited object IDs are always increasing.
Also, pass correct object ID to prefetch_dnode_metadata for
os_groupused_dnode.

Without this change we would hit an assert if traversal was paused on
a GROUPUSED object, which is unlikely but possible.

Apparently the same change was independently developed by Deplhix.

Reviewed by:	Matthew Ahrens <mahrens@delphix.com>
MFC after:	10 days
Sponsored by:	HybridCluster
2014-01-17 10:23:46 +00:00
avg
d107399017 fix a build problem with INVARIANTS enabled introduced in r260704
Reported by:	glebius
MFC after:	5 days
X-MFC with:	r260704
2014-01-16 13:44:37 +00:00
avg
e186f564bc fix a bug in ZFS mirror code for handling multiple DVAa
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt.  Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds.  But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.

Reviewed by:	gibbs
MFC after:	13 days
Sponsored by:	HybridCluster
2014-01-16 13:24:10 +00:00
avg
d1329f5a22 Revert r260705: wrong patch committed by accident
An earlier, less efficient version was committed by accident.
2014-01-16 13:20:20 +00:00
avg
113f9a4f53 zfs_deleteextattr: name buffer from namei is needed by zfs_rename
If we prematurely free the name buffer and it gets quickly recycled,
then zfs_rename may see data from another lookup or even unmapped memory
via cn_nameptr.

MFC after:	6 days
Sponsored by:	HybridCluster
2014-01-16 12:31:27 +00:00
avg
31b7f68d80 fix a bug in ZFS mirror code for handling multiple DVAa
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt.  Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds.  But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.

Reviewed by:	gibbs
MFC after:	13 days
Sponsored by:	HybridCluster
2014-01-16 12:26:54 +00:00
avg
97986ccb0b zfs: getnewvnode_reserve must be called outside of a zfs transaction
Otherwise we could run into the following deadlock.
A thread has a transaction open and assigned to a transaction group.
That would prevent the transaction group from be quiesced and synced.
The thread is blocked in getnewvnode_reserve waiting for a vnode to
a be reclaimed.  vnlru thread is blocked trying to enter ZFS VOP because
a filesystem is suspended by an ongoing rollback or receive operation.
In its turn the operation is waiting for the current transaction group
to be synced.

zfs_zget is always used outside of active transactions, but zfs_mknode
is always used in a transaction context.  Thus, we hoist
getnewvnode_reserve from zfs_mknode to its callers.

While there, assert that ZFS always calls getnewvnode while having
a vnode reserved.

Reported by:	adrian
Tested by:	adrian
MFC after:	17 days
Sponsored by:	HybridCluster
2014-01-16 12:22:46 +00:00
marcel
5e2984b1f1 In atomic_or_8_nv() load 1 and not 8 bytes from the address
given. Note that atomic_or_8_nv() is not used at this time.
2014-01-06 05:00:58 +00:00
mav
0b0d3d9762 Fix build after r260234 by converting ddi_get_lbolt64() from inline into
a macro.  Otherwise compiler complains that hz variable used there either
undefined or defined twice, thanks to header mess caused by compat shims.
2014-01-05 19:07:42 +00:00
mav
8e1d7be31e In dmu_zfetch_stream_reclaim() replace division with multiplication and
move it out of the loop and lock.
2014-01-03 18:44:37 +00:00
mav
7402ef4897 Remove extra conversion to nanoseconds from ddi_get_lbolt64().
As result this uses one multiplication and shifts instead of one division
and two multiplications.
2014-01-03 18:08:31 +00:00
delphij
37fa7d5554 MFV r260155:
When we encounter an I/O error on a piece of metadata while deleting
a file system or zvol, we don't update the bptree_entry_phys_t's
bookmark.  This would lead to double free of bp's which will lead to
space map corruption.

Instead of tolerating and allowing the corruption, panic immediately.

See Illumos #4390 for more details.

4391 panic system rather than corrupting pool if we hit bug 4390

Illumos/illumos-gate@8b36997aa2

MFC after:	2 weeks
2014-01-02 08:10:35 +00:00
delphij
5137277761 MFV r260154 + 260182:
4369 implement zfs bookmarks
4368 zfs send filesystems from readonly pools

Illumos/illumos-gate@78f1710053

MFC after:	2 weeks
2014-01-02 07:34:36 +00:00
delphij
05d7d24d7e Fix build on platforms where atomic_swap_64 is not available. 2014-01-02 03:24:44 +00:00
delphij
843c1c95ad MFV r260153:
4121 vdev_label_init should treat request as succeeded when pool
     is read only

Illumos/illumos-gate@973c78e94b

MFC after:	2 weeks
2014-01-01 01:26:39 +00:00
delphij
82c2441b7d MFV r259170:
4370 avoid transmitting holes during zfs send

4371 DMU code clean up

illumos/illumos-gate@43466aae47

NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.

MFC after:	2 weeks
2014-01-01 00:45:28 +00:00
delphij
302e136d55 MFV r258385:
(Note: this change is not applicable to FreeBSD and the file
is not included in build.  It's integrated for completeness).

4128 disks in zpools never go away when pulled

illumos/illumos-gate@39cddb10a3

MFC after:	2 weeks
2013-12-31 21:24:00 +00:00