If bpobj_space() returned non-zero here, the sublist would have been
left open, along with the bonus buffer hold it requires. This call
does not invoke any calls to bpobj_close() itself.
This bug doesn't have any known vector, but was found on inspection.
MFC after: 1 week
Sponsored by: Spectra Logic
Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc)
MFSpectraBSD: r1050998 on 2014/03/26
Summary:
Fix the stack tracing for dtrace/powerpc by using the trapexit/asttrapexit
return address sentinels instead of checking within the kernel address space.
As part of this, I had to add new inline functions. FBT traces the kernel, so
we have to have special case handling for this, since a trap will create a full
new trap frame, and there's no way to pass around the 'real' stack. I handle
this by special-casing 'aframes == 0' with the trap frame. If aframes counts
out to the trap frame, then assume we're looking for the full kernel trap frame,
so switch to the real stack pointer.
Test Plan: Tested on powerpc64
Reviewers: rpaulo, markj, nwhitehorn
Reviewed By: markj, nwhitehorn
Differential Revision: https://reviews.freebsd.org/D788
MFC after: 3 week
Relnotes: Yes
* vfs.zfs.vdev.async_write_active_min_dirty_percent
* vfs.zfs.vdev.async_write_active_max_dirty_percent
Added validation of min / max for ZFS sysctl
* vfs.zfs.dirty_data_max_percent
MFC after: 3 days
Correctly report hole at end of file.
When asked to find a hole, the DMU sees that there are no holes in the
object, and returns ESRCH. The ZPL interprets this as "no holes before
the end of the file", and therefore inserts the "virtual hole" at the
end of the file. Because DMU and ZPL have different ideas of where the
end of an object/file is, we will end up returning the end of file,
which is generally larger, instead of returning the end of object.
The fix is to handle the "virtual hole" in the DMU. If no hole is found,
the DMU will return a hole at the end of the file, rather than an error.
Illumos issue:
5139 SEEK_HOLE failed to report a hole at end of file
MFC after: 1 week
In zil_claim, don't issue warning if we get EBUSY (inconsistent) when
opening an objset, instead, ignore it silently.
Illumos issue:
5140 message about "%recv could not be opened" is printed when booting after crash
MFC after: 1 week
Add a new tunable/sysctl, vfs.zfs.free_max_blocks, which can be used to
limit how many blocks can be free'ed before a new transaction group is
created. The default is no limit (infinite), but we should probably have
a lower default, e.g. 100,000.
With this limit, we can guard against the case where ZFS could run out of
memory when destroying large numbers of blocks in a single transaction
group, as the entire DDT needs to be brought into memory.
Illumos issue:
5138 add tunable for maximum number of blocks freed in one txg
MFC after: 2 weeks
Enforce 4K as smallest indirect block size (previously the smallest
indirect block size was 1K but that was never used).
This makes some space estimates more accurate and uses less memory
for some data structures.
Illumos issue:
5141 zfs minimum indirect block size is 4K
MFC after: 2 weeks
In dnode_sync(), do dnode_increase_indirection() before processing
the dn_next_nblkptr.
Illumos issue:
5117 space map reallocation can cause corruption
MFC after: 3 days
Also restore kmem_used() check for i386 as it has KVA limits that the raw
page counts above don't consider
PR: 187594
Reviewed by: peter
X-MFC-With: r270759
Review: D700
Sponsored by: Multiplay
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).
This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.
The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.
We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.
Credit to Karl Denninger for the original patch on which this update was
based.
PR: 191510 and 187594
Tested by: dteske
MFC after: 1 week
Relnotes: yes
Sponsored by: Multiplay
tracepoints would continue to generate traps, which would be ignored but
could consume noticeable amounts of CPU if, say, all functions in the kernel
were instrumented.
X-MFC-With: r270067
duplicating the entire implementation for both x86 and powerpc. This makes
it easier to add support for other architectures and has no functional
impact.
Phabric: D613
Reviewed by: gnn, jhibbits, rpaulo
Tested by: jhibbits (powerpc)
MFC after: 2 weeks
In vdev_get_stats, check that the vdev is not a hole before computing the
fragmentation. This fixes a panic when removing log device.
Illumos issue:
5049 panic when removing log device
Author: Alex Reece <alex@delphix.com>
MFC after: 2 weeks
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().
Reviewed by: mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after: 2 weeks
In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.
This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.
Illumos issue:
5038 Remove "old-style" flexible array usage in ZFS.
Author: Justin T. Gibbs <justing@spectralogic.com>
MFC after: 2 weeks
This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.
We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.
zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.
Cleanup flow in vdev_geom_io_start while I'm here.
Also fix some cases not using SET_ERROR(..)
MFC after: 2 weeks
X-MFC-With: r265321
nanouptime() instead of getnanouptime(). nanouptime(9) provides more
precise result at expense of being slower.
In r269223, gethrtime() is used as creation time of dbuf, which in turn
acts as portion of lookup key to maintain AVL invariant where there can
not be duplicate items. Before this change, gethrtime() have preferred
better execution time by sacrificing precision, which may lead to panic
on busy systems with:
panic: avl_find() succeeded inside avl_add()
Reported by: allanjude, mav
PR: kern/192284
MFC after: 11 days
X-MFC-with: r269223
Increase default ARC buf_hash_table size. When typical block size is small,
the hash table could be too small, which would lead to long hash chains and
limit performance for cached reads.
A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which
allows users to override the default assumption of average (typical) block
size. Old default was 65536 (64 KiB) and new default is 8192 (8 KiB).
Illumos issue:
5034 ARC's buf_hash_table is too small
MFC after: 2 weeks
Change dn->dn_dbufs from linked list to AVL tree.
Illumos issues:
4873 zvol unmap calls can take a very long time for larger datasets
MFC after: 2 weeks
code behave more like it is on Solaris.
Reported by: avg
Reviewed by: avg, mav (but bugs are mine)
Differential Revision: https://phabric.freebsd.org/D457
postponing it to zfs_vget(). zfs_root() returned vnode with the
default value of v_hash, which caused inconsistent v_hash value when
root vnode was obtained from zfs_vget().
Nullfs allocated two upper vnodes for the root zfs vnode due to
different hashes, causing consistency problems.
Reported and tested by: Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Import Illumos changes to address the following Illumos issues:
4976 zfs should only avoid writing to a failing non-redundant
top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature
is enabled
4984 device selection should use fragmentation metric
MFC after: 2 weeks
been transferred from zio_compress_data to its caller. Therefore, passing
the 'minblocksize' down will be a no-op.
Eliminate the parameter to reduce diff against upstream.
MFC after: 2 weeks
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.
Read acquisitions are randomly distributed among these locks based
on curthread pointer. Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.
Illumos issue:
5008 lock contention (rrw_exit) while running a read only load
MFC after: 2 weeks
When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).
Illumos issue:
4753 increase number of outstanding async writes when sync task is waiting
MFC after: 2 weeks
Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC. Instead
we simply coordinate the destruction of the DMU's data with the ARC.
The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.
Illumos issue:
4631 zvol_get_stats triggering too many reads
MFC after: 2 weeks
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.
Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.
This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).
While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.
Illumos issue:
4958 zdb trips assert on pools with ashift >= 0xe
MFC after: 2 weeks
Improve extreme rewind import.
When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os. This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.
For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm. Three kernel tunables have been added:
vfs.zfs.spa_load_verify_maxinflight
vfs.zfs.spa_load_verify_metadata
vfs.zfs.spa_load_verify_data
The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.
Make 'zpool import -T' imply scrub.
Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.
Skip txg's for which there is no uberblock when doing extreme rewind.
Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.
Illumos issues:
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
MFC after: 2 weeks
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.
This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.
Submitted by: Anton Rang <anton.rang@isilon.com> (original version)
Phabric: D95
Use reserved space for ZFS administrative commands.
We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use. Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.
Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used. This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.
A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.
MFC after: 2 weeks
Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.
Illumos issue: 4950 files sometimes can't be removed from a full
filesystem
MFC after: 2 weeks
While it is possible to create and write file, modify its permissions, etc.
without ever doing sync, it looks odd that it is required for setting
extended file attributes on ZFS. UFS does not do sync there too.
Samba uses those extended attributes to store some its data, and doing it
synchronously by many times reduces file creation performance for systems
without SLOG device.
Reviewed by: delphij, jpaetzel, silence on fs@
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops
6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink
This finishes a set of merges from the older OpenSolaris releases.
Still the FreeBSD port has many differences that are difficult to
account for but that seems normal given that the kernels are different.
MFC after: 1 week
6851093 system drops to kmdb with anonymous dtrace probes + kmdb
This has no effect on FreeBSD (code is ifdef'ed) but is useful as
reference for future merges.
MFC after: 1 week
Don't check ZVOL_WCE flag, used in Solaris to control device "write cache".
It is not applicable on FreeBSD and by default set to "disable".
MFC after: 3 days
These have no effect on FreeBSD, in fact they are ifdef'ed,
but make easier future merges:
6699767 panic in spec_open()
6718877 crgetzoneid() use can cause problems when forking processes with
USDT providers in a non global zone
MFC after: 3 days
MFV r260708
4427 pid provider rejects probes with valid UTF-8 names
Use of u8_textprep.c broke the build on powerpc.
Reported by: bz, rpaulo and tinderbox.
Pointyhat: me
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access
MFC after: 2 weeks
4427 pid provider rejects probes with valid UTF-8 names
This make use of Solaris' u8_validate() which we happen to
use since r185029 for ZFS.
Illumos Revision: 1444d846b126463eb1059a572ff114d51f7562e5
Reference:
https://www.illumos.org/issues/4427
Obtained from: Illumos
MFC after: 2 weeks
usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD
8, so this change should be quite harmless.
Reviewed by: markj
Approved by: markj
MFC after: never
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.
MFC after: 3 weeks
first five for probes entered through a UD fault (i.e. FBT probes).
Specifically, handle the fact that dtrace_invop_callsite must be
16 byte-aligned and thus may not immediately follow the call to
dtrace_invop() in dtrace_invop_start(). Also fetch register arguments and
the stack pointer through a struct trapframe instead of a struct reg.
PR: 191260
Submitted by: luke.tw@gmail.com
MFC after: 3 weeks
defined. This ensures that the sdt:zfs:: probes appear despite the fact
the sdt provider is defined in the kernel rather than in zfs.ko.
Reported by: hiren
Tested by: hiren
MFC after: 2 weeks
selection. gethrtime() in our port updated with HZ rate, so unusable for
this specific purpose, completely draining benefit of multiple taskqueues.
MFC after: 2 weeks
Add a new zfs property, "redundant_metadata" which can have values "all" or
"most". The default will be "all", which is the current behavior. When set
to all, ZFS stores an extra copy of all metadata. If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.
Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files. This can improve performance of random writes,
because less metadata has to be written. In practice, at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.
The exact behavior of which metadata blocks are stored redundantly may change
in future releases.
Illumos issue: 3835 zfs need not store 2 copies of all metadata
MFC after: 2 weeks
This includes decodes of recent Intel instructions, in particular
VT-x and related instructions. This allows the FBT provider to
locate the exit points of routines that include these new
instructions.
Illumos issues:
3414 Need a new word of AT_SUN_HWCAP bits
3415 Add isainfo support for f16c and rdrand
3416 Need disassembler support for rdrand and f16c
3413 isainfo -v overflows 80 columns
3417 mdb disassembler confuses rdtscp for invlpg
1518 dis should support AMD SVM/AMD-V/Pacifica instructions
1096 i386 disassembler should understand complex nops
1362 add kvmstat for monitoring of KVM statistics
1363 add vmregs[] variable to DTrace
1364 need disassembler support for VMX instructions
1365 mdb needs 16-bit disassembler support
This corresponds to Illumos-gate (github) version
eb23829ff08a873c612ac45d191d559394b4b408
Reviewed by: markj
MFC after: 1 week
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Original author: George Wilson
MFC after: 3 days
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.
This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.
MFC after: 1 month
X-MFC-With: r265152