In dnode_sync(), do dnode_increase_indirection() before processing
the dn_next_nblkptr.
Illumos issue:
5117 space map reallocation can cause corruption
MFC after: 3 days
Also restore kmem_used() check for i386 as it has KVA limits that the raw
page counts above don't consider
PR: 187594
Reviewed by: peter
X-MFC-With: r270759
Review: D700
Sponsored by: Multiplay
Prior to this change we triggered ARC reclaim when kmem usage passed 3/4
of the total available, as indicated by vmem_size(kmem_arena, VMEM_ALLOC).
This could lead large amounts of unused RAM e.g. on a 192GB machine with
ARC the only major RAM consumer, 40GB of RAM would remain unused.
The old method has also been seen to result in extreme RAM usage under
certain loads, causing poor performance and stalls.
We now trigger ARC reclaim when the number of free pages drops below the
value defined by the new sysctl vfs.zfs.arc_free_target, which defaults
to the value of vm.v_free_target.
Credit to Karl Denninger for the original patch on which this update was
based.
PR: 191510 and 187594
Tested by: dteske
MFC after: 1 week
Relnotes: yes
Sponsored by: Multiplay
tracepoints would continue to generate traps, which would be ignored but
could consume noticeable amounts of CPU if, say, all functions in the kernel
were instrumented.
X-MFC-With: r270067
duplicating the entire implementation for both x86 and powerpc. This makes
it easier to add support for other architectures and has no functional
impact.
Phabric: D613
Reviewed by: gnn, jhibbits, rpaulo
Tested by: jhibbits (powerpc)
MFC after: 2 weeks
In vdev_get_stats, check that the vdev is not a hole before computing the
fragmentation. This fixes a panic when removing log device.
Illumos issue:
5049 panic when removing log device
Author: Alex Reece <alex@delphix.com>
MFC after: 2 weeks
Encode CPU's number by XOR'ing the CPU ID against the 64-bit cpu_ticks().
Reviewed by: mav, gibbs
Differential Revision: https://phabric.freebsd.org/D521
MFC after: 2 weeks
In dnode_children_t, use C99's "[]" idiom for declaring the variable
sized array dnc_children at the end of the structure.
This prevents the compiler from mistakenly optimizing away accesses
beyond the array's defined size.
Illumos issue:
5038 Remove "old-style" flexible array usage in ZFS.
Author: Justin T. Gibbs <justing@spectralogic.com>
MFC after: 2 weeks
This prevents recursion of vdev_queue_io_done as per r265321 but
using a different method as recommended on the openzfs list.
We now use zio_interrupt(zio) and return ZIO_PIPELINE_STOP instead
of returning ZIO_PIPELINE_CONTINUE from vdev_*_io_start methods.
zio_vdev_io_start now ASSERTS the that vdev_op_io_start returns
ZIO_PIPELINE_STOP to ensure future changes don't reintroduce
ZIO_PIPELINE_CONTINUE returns.
Cleanup flow in vdev_geom_io_start while I'm here.
Also fix some cases not using SET_ERROR(..)
MFC after: 2 weeks
X-MFC-With: r265321
nanouptime() instead of getnanouptime(). nanouptime(9) provides more
precise result at expense of being slower.
In r269223, gethrtime() is used as creation time of dbuf, which in turn
acts as portion of lookup key to maintain AVL invariant where there can
not be duplicate items. Before this change, gethrtime() have preferred
better execution time by sacrificing precision, which may lead to panic
on busy systems with:
panic: avl_find() succeeded inside avl_add()
Reported by: allanjude, mav
PR: kern/192284
MFC after: 11 days
X-MFC-with: r269223
Increase default ARC buf_hash_table size. When typical block size is small,
the hash table could be too small, which would lead to long hash chains and
limit performance for cached reads.
A new loader tunable, vfs.zfs.arc_average_blocksize, have been added which
allows users to override the default assumption of average (typical) block
size. Old default was 65536 (64 KiB) and new default is 8192 (8 KiB).
Illumos issue:
5034 ARC's buf_hash_table is too small
MFC after: 2 weeks
Change dn->dn_dbufs from linked list to AVL tree.
Illumos issues:
4873 zvol unmap calls can take a very long time for larger datasets
MFC after: 2 weeks
code behave more like it is on Solaris.
Reported by: avg
Reviewed by: avg, mav (but bugs are mine)
Differential Revision: https://phabric.freebsd.org/D457
postponing it to zfs_vget(). zfs_root() returned vnode with the
default value of v_hash, which caused inconsistent v_hash value when
root vnode was obtained from zfs_vget().
Nullfs allocated two upper vnodes for the root zfs vnode due to
different hashes, causing consistency problems.
Reported and tested by: Harald Schmalzbauer <h.schmalzbauer@omnilan.de>
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Import Illumos changes to address the following Illumos issues:
4976 zfs should only avoid writing to a failing non-redundant
top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature
is enabled
4984 device selection should use fragmentation metric
MFC after: 2 weeks
been transferred from zio_compress_data to its caller. Therefore, passing
the 'minblocksize' down will be a no-op.
Eliminate the parameter to reduce diff against upstream.
MFC after: 2 weeks
read workload by splitting the single teardown rrw lock into
RRM_NUM_LOCKS (17) of them.
Read acquisitions are randomly distributed among these locks based
on curthread pointer. Write acquisitions are going to all the
locks, which for the usage of this type of lock should be rare.
Illumos issue:
5008 lock contention (rrw_exit) while running a read only load
MFC after: 2 weeks
When a sync task is waiting for a txg to complete, we should hurry it along
by increasing the number of outstanding async writes (i.e. make
vdev_queue_max_async_writes() return a larger number).
Illumos issue:
4753 increase number of outstanding async writes when sync task is waiting
MFC after: 2 weeks
Change the interaction between the DMU and ARC so that when the DMU is
shutting down an objset, we do not evict the data from the ARC. Instead
we simply coordinate the destruction of the DMU's data with the ARC.
The only case where we actually need to explicitly evict from the ARC is
when dbuf_rele_and_unlock() determines that the administrator has requested
that it not be kept in memory, via the primarycache/secondarycache properties.
In this case, we evict the data from the ARC by its blkptr_t, the same way
as when a block is freed we explicitly evict it from the ARC.
Illumos issue:
4631 zvol_get_stats triggering too many reads
MFC after: 2 weeks
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.
Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.
This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).
While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.
Illumos issue:
4958 zdb trips assert on pools with ashift >= 0xe
MFC after: 2 weeks
Improve extreme rewind import.
When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os. This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.
For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm. Three kernel tunables have been added:
vfs.zfs.spa_load_verify_maxinflight
vfs.zfs.spa_load_verify_metadata
vfs.zfs.spa_load_verify_data
The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.
Make 'zpool import -T' imply scrub.
Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.
Skip txg's for which there is no uberblock when doing extreme rewind.
Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.
Illumos issues:
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
MFC after: 2 weeks
the upstream implementation and helps ensure that a trap induced by tracing
fbt::trap:entry is handled without recursively generating another trap.
This makes it possible to run most (but not all) of the DTrace tests under
common/safety/ without triggering a kernel panic.
Submitted by: Anton Rang <anton.rang@isilon.com> (original version)
Phabric: D95
Use reserved space for ZFS administrative commands.
We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use. Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.
Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used. This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.
A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.
MFC after: 2 weeks
Explicitly mark file removal transactions as "presumed to result
in a net free of space" so they will not fail with ENOSPC.
Illumos issue: 4950 files sometimes can't be removed from a full
filesystem
MFC after: 2 weeks
While it is possible to create and write file, modify its permissions, etc.
without ever doing sync, it looks odd that it is required for setting
extended file attributes on ZFS. UFS does not do sync there too.
Samba uses those extended attributes to store some its data, and doing it
synchronously by many times reduces file creation performance for systems
without SLOG device.
Reviewed by: delphij, jpaetzel, silence on fs@
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
6679140 asymmetric alloc/dealloc activity can induce dynamic variable drops
6679193 dtrace_dynvar walker produces flood of dtrace_dynhash_sink
This finishes a set of merges from the older OpenSolaris releases.
Still the FreeBSD port has many differences that are difficult to
account for but that seems normal given that the kernels are different.
MFC after: 1 week
6851093 system drops to kmdb with anonymous dtrace probes + kmdb
This has no effect on FreeBSD (code is ifdef'ed) but is useful as
reference for future merges.
MFC after: 1 week