a probe name. When dtrace -G builds up a DOF section for the specified
provider(s), the probe function names are truncated to fit in this limit.
The DOF is later used to build the symbol table for the generated object
file, so the table can end up with truncated references, causing link
errors.
Instead of potentially truncating symbol table entries, write the full
function name to the DOF string table and allow the kernel to enforce the
128-byte function name limit when a process attempts to load its DOF.
PR: 194757
Differential Revision: https://reviews.freebsd.org/D1175
Reviewed by: rpaulo
MFC after: 2 weeks
union must be checked when determine whether two types are equivalent. This
bug could cause ctfmerge(1) to incorrectly merge distinct types.
Reviewed by: Robert Mustacchi <rm@joyent.com>
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
These would cause ctfconvert(1) to return an error when attempting to
resolve valid C types.
Reviewed by: Robert Mustacchi <rm@joyent.com>
MFC after: 2 weeks
Sponsored by: EMC / Isilon Storage Division
ZFS large block support.
Please note that booting from datasets that have recordsize greater
than 128KB is not supported (but it's Okay to enable the feature on
the pool). This *may* remain unchanged because of memory constraint.
Limited safety belt is provided for mounted root filesystem but use
caution is advised.
Illumos issue:
5027 zfs large block support
MFC after: 1 month
Initialize tqent_flags in the userland taskq implementation. Without
this the assertion of tq->tq_freelist != NULL may fail in taskq_destroy.
The problem is that tqent_flags is never initialized in the userland
implementation while the kernel one does initialize it. Without proper
initialization, the flag may have its lowest bit set, making it treated
as TQENT_FLAG_PREALLOC and never removing taskq_ent_t from tq_freelist.
MFC after: 2 weeks
- Limit ARC for zdb at 256MB. zdb do not typically revisit data
in the ARC.
- Increase default max_inflight from 200 to 1000 (can be overriden
by -I) so we can queue more I/Os when doing scrubbing.
- Print status while loading meataslabs for leak detection.
Illumos issues:
5169 zdb should limit its ARC size
5170 zdb -c should create more scrub i/os by default
5171 zdb should print status while loading metaslabs for leak detection
MFC after: 2 weeks
one to, for example, access the "provider" field of a struct g_consumer,
even though "provider" is a D keyword.
PR: 169657
MFC after: 2 months
Discussed with: Bryan Cantrill
Sponsored by: EMC / Isilon Storage Division
modifications to libproc to support fetching the CTF info for a given file.
With this change, dtrace(1) is able to resolve type info for function and
USDT probe arguments, and function return values. In particular, the args[n]
syntax should now work for referencing arguments of userland probes,
provided that the requisite CTF info is available.
The uctf tests pass if the test programs are compiled with CTF info. The
current infrastructure around the DTrace test suite doesn't support this
yet.
Differential Revision: https://reviews.freebsd.org/D891
MFC after: 1 month
Relnotes: yes
Sponsored by: EMC / Isilon Storage Division
install signal handlers when running in list mode (-l), and acknowledge
interrupts by cleaning up and exiting. This ensures that a command like
$ dtrace -l -P 'pid$target' -p <target PID> | less
won't cause the ptrace(2)d target process to be killed if less(1) exits
before all dtrace output is consumed.
Reported by: Anton Yuzhaninov <citrin+bsd@citrin.ru>
Differential Revision: https://reviews.freebsd.org/D880
Reviewed by: rpaulo
MFC after: 1 month
Sponsored by: EMC / Isilon Storage Division
In the case where new features where enabled by a zpool upgrade -a the
boot code warning wasn't output.
Submitted by: Jan Kokemueller
MFC after: 3 days
This errno value is emitted by dsl_props_set_check() in
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dsl_prop.c, and
is used to mean that the property value is too long. For the record,
the maximum length is ZAP_MAXVALUELEN, which is 8*1024 bytes.
Instead of claiming an unknown error (and abort()ing), provide
something more specific to the scenario involved. As far as I
can tell, E2BIG is not emitted for any other scenario.
MFC after: 1 week
Sponsored by: Spectra Logic
Affects: All ZFS versions starting 27 Feb 2009 (illumos ccba0801)
This change modified the value returned by
dsl_props_set_check(), so that it can distinguish between
a name that's too long and a value that's too long, but
libzfs was not updated accordingly.
MFSpectraBSD: r1051499 on 2014/03/28 11:07:59
This can occur if a spare is being spared, which would yield three
children: the original pool drive, the previous spare, and the spare
that is replacing it.
MFC after: 1 week
Sponsored by: Spectra Logic
Affects: All ZFS versions starting 7 Jun 2006 (illumos 94de1d4c)
MFSpectraBSD: r668345 on 2013/06/04 17:10:43
It seems that if a pragma is used to define a weak alias for a local
function, the pragma must appear after the function is defined.
PR: 193056
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
seems that they would only pass by chance on illumos; on FreeBSD, they still
fail since userland CTF is not yet supported.
MFC after: 1 week
Sponsored by: EMC / Isilon Storage Division
Enable debug printf's when ZFS_DEBUG or debug= is set.
Illumos issue:
5134 if ZFS_DEBUG or debug= is set, libzpool should enable debug prints
MFC after: 2 weeks
doing them in drti during startup. This fixes a number of problems with
using USDT probes in stripped executables and shared libraries, and with
USDT probes in static functions.
Reviewed by: rpaulo
MFC after: 1 month
Sponsored by: EMC / Isilon Storage Division
Phabric: D751
Iterate through all the children instead of returning error when we hit
the first error. This makes the error message give more information
rather than just the first device that causes problem.
Illumos issue:
5118 When verifying or creating a storage pool, error messages only
show one device
MFC after: 2 weeks
Double test device size for ztest(1).
Illumos issue:
5039 ztest should default to larger device sizes
Author: Matthew Ahrens <mahrens@delphix.com>
MFC after: 2 weeks
Change dn->dn_dbufs from linked list to AVL tree.
Illumos issues:
4873 zvol unmap calls can take a very long time for larger datasets
MFC after: 2 weeks
Import Illumos changes to address the following Illumos issues:
4976 zfs should only avoid writing to a failing non-redundant
top-level vdev
4978 ztest fails in get_metaslab_refcount()
4979 extend free space histogram to device and pool
4980 metaslabs should have a fragmentation metric
4981 remove fragmented ops vector from block allocator
4982 space_map object should proactively upgrade when feature
is enabled
4984 device selection should use fragmentation metric
MFC after: 2 weeks
Instead of asserting all zio's be properly aligned, only assert
on the logical ones.
Cap uberblocks at 8k, otherwise with ashift=17, there would be
only one uberblock.
This fixes a problem that zdb would trip assert on pools with
ashift >= 0xe (8k).
While there, also change the code so it only attempt to condense
space map unless the uncondensed size consumes greater than
zfs_metaslab_condense_block_threshold blocks.
Illumos issue:
4958 zdb trips assert on pools with ashift >= 0xe
MFC after: 2 weeks
Improve extreme rewind import.
When doing an "extreme rewind" import ("zpool import -XF"), we attempt
to verify all data in the pool, essentially scrubbing the entire pool.
The problem is that spa_load_verify_cb() issues an unbounded number of
concurrent scrub i/os. This can lead to all of memory being used for
these zio's, wedging the system. Like normal scrub, we need to put a
cap on the number of outstanding i/os, and have the traverse thread
block when we reach this cap.
For this purpose the cap can be very large (10,000) to optimize the
elevator algorithm. Three kernel tunables have been added:
vfs.zfs.spa_load_verify_maxinflight
vfs.zfs.spa_load_verify_metadata
vfs.zfs.spa_load_verify_data
The latter two tunables controls whether metadata and/or user data
when doing extreme rewind.
Make 'zpool import -T' imply scrub.
Make zpool import -T <txg> accept hexadecimal values for the txg when
prefixed with 0x.
Skip txg's for which there is no uberblock when doing extreme rewind.
Skip reading all user data twice by skipping prefetches when doing
extreme rewinds as we do not access via the ARC.
Illumos issues:
4970 need controls on i/o issued by zpool import -XF
4971 zpool import -T should accept hex values
4972 zpool import -T implies extreme rewind, and thus a scrub
4973 spa_load_retry retries the same txg
4974 spa_load_verify() reads all data twice
MFC after: 2 weeks
zpool status -x is used to identify pools that are exhibiting
errors or are otherwise unavailable, therefore non-native
block-size pools shouldn't be reported.
Also update man page to clarify other additional conditions
which won't cause a pool to be displayed under zpool status -x.
Sponsored by: Multiplay
Use reserved space for ZFS administrative commands.
We reserve 1/2^spa_slop_shift = 1/32 or 3.125% of pool space (or 32MB at
least) for system use. Most ZPL operations, e.g. write(2), creat(2), will
fail with ENOSPC if we fall below this.
Certain operations, e.g. file removal and most administrative actions,
still permitted until half of the slop space is used. This would allow
users to use these operations to free up space in the pool when pool is
close to full but half of slop space is still free.
A very restricted set of operations that frees up space or change quota
are always permitted, regardless of the amount of free space.
MFC after: 2 weeks
Refresh zpool list for each interval in order to produce fresh
output.
Illumos issue: 4966 zpool list iterator does not update output
MFC after: 2 weeks
This includes:
o All directories named *ia64*
o All files named *ia64*
o All ia64-specific code guarded by __ia64__
o All ia64-specific makefile logic
o Mention of ia64 in comments and documentation
This excludes:
o Everything under contrib/
o Everything under crypto/
o sys/xen/interface
o sys/sys/elf_common.h
Discussed at: BSDcan
2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access
MFC after: 2 weeks
Merge from r258379 missed the tests.
4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included
Illumos Revision: 54a20ab41aadcb81c53e72fc65886e964e9add59
MFC after: 5 days
Add a new zfs property, "redundant_metadata" which can have values "all" or
"most". The default will be "all", which is the current behavior. When set
to all, ZFS stores an extra copy of all metadata. If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.
Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files. This can improve performance of random writes,
because less metadata has to be written. In practice, at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.
The exact behavior of which metadata blocks are stored redundantly may change
in future releases.
Illumos issue: 3835 zfs need not store 2 copies of all metadata
MFC after: 2 weeks
(4543:12bb2876a62e). Without this, some third party applications
may break because the lack of AVL related symbols.
FreeBSD base system are not affected because the FreeBSD ZFS command
line tools were all linked against libavl and thus hide the underlying
issue.
PR: java/183081
Tested by: jkim
MFC after: 3 days
illumos, rather than using "1.0" everywhere.
Some of the translators use D functions that are not present in version
1.0 (e.g. inet_ntoa()) which can result in libdtrace crashing when running
scripts that restrict themselves to version 1.0
(e.g. with "-x version=1.0").
MFC after: 1 week
FreeBSD ZFS port unlike OpenSolaris does not use device IDs, and does not
implement respective devid_*() fuctions. It is pointless to open devices
just to close them back immediately.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
The thread pool is used by libzfs to implement parallel disk scanning.
Without this change our dummy wrapper made `zpool import ZZZ` command to
scan all disks sequentially from the single thread when searching for pools.
This change makes it use two threads per CPU, same as in OpenSolaris.
On system with 200 HDDs this change reduces ZFS pool import time from 35
to 22 seconds.
fail to attach to stripped binaries. With the _r_debug_postinit symbol,
dtrace(1) can now set a breakpoint in the victim process after it has
registered its DOF table(s) with the kernel. r_debug_state cannot be used
for this purpose since it is called before DOF is made available, in which
case dtrace(1) cannot create USDT probes before the program begins
execution.
MFC after: 2 weeks
This change adds tests/ directories in the source tree to create various
subdirectories in /usr/tests/ and to install placeholder Kyuafiles for
them.
the relevant hierarchies are: cddl, etc, games, gnu and secure.
The reason for this is to simplify the addition of new test programs for
utilities or libraries under any of these directories. Doing so on a
case by case basis is unnecessary and is quite an obscure process.
and zdb(8) by growing the buffer on demand with a cap of 1GB (specified in
spa_history_create_obj()).
PR: bin/186574
Submitted by: Andrew Childs <lorne cons org nz> (with changes)
MFC after: 2 weeks
and finish the job. ncurses is now the only Makefile in the tree that
uses it since it wasn't a simple mechanical change, and will be
addressed in a future commit.
New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
geom -- existing fully functional behavior (default);
dev -- exposing volumes only as raw disk device file in devfs;
none -- not exposing volumes outside ZFS.
The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.
Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included
Illumos Revision: 4a20ab41aadcb81c53e72fc65886e964e9add59
Reference:
https://www.illumos.org/issues/4248https://www.illumos.org/issues/4249
Obtained from: Illumos
MFC after: 1 month
Fix a memory leak in uu_avl_pool_create: pthread_mutex_init without
a corresponding pthread_mutex_destroy. It shows up, among other
places, when doing "zfs list".
MFC after: 3 weeks
Sponsored by: Spectra Logic Corporation
concatenates the DOF tables into one section. Previously, the USDT init
code in drti.o would only look at the first table in the DOF section; with
this change, it iterates over all the tables, passing each DOF table to
the kernel.
PR: 186821
Submitted by: Fedor Indutny <fedor@indutny.com>
MFC after: 1 month
illumos/illumos-gate@6fb4854bed
This fixes the tst.resize1.d and tst.resize2.d DTrace tests, which have
been failing since r261122 since they were causing dtrace(1) to attempt to
allocate and use large amounts of memory, and get killed by the OOM killer
as a result.
MFC after: 1 month
"Manpages should start a new sentence on a new line. This makes it easier
for translators to track changes." -jhb
Approved by: jhb
MFC after: 3 days
Sponsored by: SupraNet Communications, Inc
hot spares. This should be MFC'd to all STABLE branches.
Upon the availability of zfsd, the zpool manpage on relevant branches should
be updated to remove this caveat and document hot spare's reliance on zfsd.
Approved by: avg
MFC after: 1 week
Sponsored by: SupraNet Communications
The limitation was introduced in r178556 without any note or comment.
It seems pretty artificial and now it leads to problems like the following:
$ dtrace -x bufsize=17m -n ...
dtrace: processing aborted: Memory allocation failure
OpenSolaris and illumos never had this limitation.
Sponsored by: HybridCluster
emitting the DIE for the type of that member. ctfconvert can not
handle this properly and will calculate a wrong member bit offset.
Same struct/union type from different .o file will be treated as
different types when their member bit offsets are different, and
gets added/merged multiple times. This will in turn cause many other
structs/pointers/typedefs that refer to the duplicated struct/union
gets added/merged multiple times and eventually causes numerous
duplicated CTF types in the kernel.debug file.
The simple workaround here is to make use of DW_AT_byte_size attribute
of the member DIE to calculate the bits occupied by the member's type,
without actually resolving the type.
attributes generated by Clang 3.4.
* Document how different compilers generate DW_AT_data_member_location
attributes differently.
* Document the quirks about DW_FORM_data[48].
"__anon__". This hack is used to workaround a issue that compilers
like GCC could generate DW_TAG_base_type DIE without a name.
Note that we didn't need this before because the old libdwarf
internally set all the unnamed DIE's name to "__anon__".
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
illumos/illumos-gate@43466aae47
NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.
MFC after: 2 weeks
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man page
and help message
illumos/illumos-gate@31d7e8fa33
FreeBSD porting notes: the kernel part of this changeset depends
on Solaris buf(9S) interfaces and are not really applicable for
our use. vdev_disk.c is patched as-is to reduce diverge from
upstream, but vdev_file.c is left intact.
MFC after: 2 weeks
SHT_RELA sections properly instead of assuming that the relocation section
is of type SHT_REL.
Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> (original version)
MFC after: 1 month
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool
features
illumos/illumos-gate@2acef22db7
MFC after: 2 weeks
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfa
4170 zhack leaves pool in ACTIVE state
illumos/illumos-gate@7fdd916c47
MFC after: 2 weeks
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.
expect the installed ksh binary to be named "ksh", which is not the case
when it's installed on FreeBSD via the shells/ksh93 port. Allow for it to be
"ksh93" as well so that the tests can actually pass.
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
illumos/illumos-gate@0713e232b7
Note that some tunables have been removed and some new tunables have
been added. Of particular note, FreeBSD-only knob
vfs.zfs.space_map_last_hope is removed as it was a nop for some time now
(after one of the previous merges from upstream).
MFC after: 11 days
Sponsored by: HybridCluster [merge]
illumos/illumos-gate@69962b5647
Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
exposed as tunables / sysctls yet
MFC after: 10 days
Sponsored by: HybridCluster [merge]
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e
Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.
After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.
Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.
MFC after: 8 days
On some architectures (powerpc), char is unsigned by default, which means
comparisons against -1 always fail, so the programs get stuck in an
infinite loop.
MFC after: 1 week
to ILP32. Otherwise dtrace -G will attempt to use it on amd64 if it can't
determine which data model to use, which happens when -64 is omitted and
no object files are provided, e.g. with
# dtrace -G -n BEGIN
This would result in a linker error, but now works properly.
Also remove an unnecessary #ifdef.
MFC after: 2 weeks
The compiler will see the non-string literal arguments to the fprintf calls and
omit warnings for them. Quiese these warnings in contrib code:
cddl/contrib/opensolaris/lib/libnvpair/libnvpair.c:743:12: warning: format
string is not a string literal (potentially insecure) [-Wformat-security]
ARENDER(pctl, nvlist_array, nvl, name, val, nelem);
installed. Additionally, remove Solaris-specific sections and references,
and replace example outputs with output from lockstat on FreeBSD, since
lockstat's output contains stack traces.
This change also removes some examples that don't seem to work properly on
FreeBSD. The examples should be re-added when lockstat is fixed.
Reported by: avg
MFC after: 1 week
available under Oracle Solaris 11.
This includes an update to the ZFS(8) man page to reflect all the
available alias (snap, umount, and recv).
Initial changes obtained from ZFS On Linux + fixes for man page and cmd
help:
10b75496bbcf81b00a73
Obtained from: https://github.com/zfsonlinux/zfs
MFC after: 2 weeks
Sponsored by: Multiplay
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:
dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}'
Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:
fbt::kern_execve:entry
{
printf("%s", memstr(args[1]->begin_argv, ' ',
args[1]->begin_envv - args[1]->begin_argv));
}
The debug.dtrace.memstr_max sysctl limits the maximum argument size to
memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace.
Tested by: Fabian Keil (earlier version)
MFC after: 2 weeks
don't make sense on FreeBSD. In particular,
- remove the ATTRIBUTES section,
- remove references to the Solaris Dynamic Tracing Guide, except in the
SEE ALSO section,
- update the description of the -A option for FreeBSD's implementation,
- remove references to Solaris-specific programs and configuration files,
and replace them with FreeBSD equivalents where possible.
The content has not changed aside from this.
Approved by: re (joel)
MFC after: 1 week
input path. These probes get some of the fields in host order, whereas the
output probes get them in network order, so a single translator isn't
enough. This workaround ensures that the problem is essentially invisble
to users: none of the probe arguments or their fields have changed.
Approved by: re (hrs)
Add support of Illumos dumps on zvol over RAID-Z.
Note that this only adds the features. FreeBSD would
still need more work to support dumping on zvols.
Illumos ZFS issues:
2932 support crash dumps to raidz, etc. pools
MFC after: 1 month
Approved by: re (ZFS blanket)
Illumos ZFS issues:
3582 zfs_delay() should support a variable resolution
3584 DTrace sdt probes for ZFS txg states
Provide a compatibility shim for Solaris's cv_timedwait_hires
to help aid future porting.
Approved by: re (ZFS blanket)
These programs and everything using libzpool rely on the embedded asserts to
verify the correctness of operations. Given that, the core dumps would be
useless without debug symbols.
2942 CTF tools need to handle files which legitimately lack data
2978 ctfconvert still needs to ignore legitimately dataless files on SPARC
Illumos Revisions: 13745:6b3106b4250f
13754:7231b684c18b
Reference:
https://www.illumos.org/issues/2942https://www.illumos.org/issues/2978
MFC after: 3 weeks
dynamic translation so that their arguments match the definitions for
these providers in Solaris and illumos. Thus, existing scripts for these
providers should work unmodified on FreeBSD.
Tested by: gnn, hiren
MFC after: 1 month
Don't treat the parameter as a number (pool GUID) when there is
error converting it from string, instead, treat it as the pool
name.
Illumos ZFS issues:
1765 assert triggered in libzfs_import.c trying to import pool
name beginning with a number
To quote original Illumos ticket:
libctf thinks that any ELF file containing more than 65536 sections is
corrupt, because it doesn't understand the SHN_XINDEX magic.
Illumos DTrace issues:
4005 libctf can't deal with extended sections
Illumos DTrace issues:
3089 want ::typedef
3094 libctf should support removing a dynamic type
3095 libctf does not validate arrays correctly
3096 libctf does not validate function types correctly
minimum allocation size for devices. Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.
Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.
Calculate the minimum blocksize of each metaslab class. Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.
Report devices with sub-optimal block size configuration in "zpool
status". Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.
Sponsored by: Spectra Logic Corporaion
MFC after: 2 weeks
Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.
Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:
1) Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new
device.
2) The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.
Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
Add the SPA_ASHIFT constant. ZFS currently has a hard upper
limit of 13 (8k) for ashift and this constant is used to
both document and enforce this limit.
sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
Add the VDEV_AUX_ASHIFT_TOO_BIG error code.
Add fields for exporting the configured, logical, and
physical ashift to the vdev_stat_t structure.
Add VDEV_STAT_VALID() macro which can be used to verify the
presence of required vdev_stat_t fields in nvlist data.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
Provide a SYSCTL_PROC handler for "max_auto_ashift". Since
the limit is only referenced long after boot when a create
operation occurs, there's no compelling need for it to be
a boot time configurable tunable. This also allows the
validation code for the max_auto_ashift value to be contained
within the sysctl handler.
Populate the new fields in the vdev_stat_t structure.
Fail vdev opens if the vdev reports an ashift larger than
SPA_MAXASHIFT.
Propogate vdev_logical_ashift and vdev_physical_ashift between
child and parent vdevs as is done for vdev_ashift.
In vdev_open(), restore code that fails opens for devices
where vdev_ashift grows. This can only happen now if the
device's logical ashift grows, which means it really isn't
safe to use the device.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
Update the vdev_open() API so that both logical (what was
just ashift before) and physical ashift are reported.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
to vdev_t.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
Add vdev_ashift_optimize(). Call it anytime a new top-level
vdev is allocated.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.
For each sub-optimally configured leaf vdev, report configured
and native block sizes.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
This status is reported on healthy pools containing vdevs
configured to use a block size smaller than their reported
physical block size.
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Update find_vdev_problem() and supporting functions to
provide the full vdev_stat_t structure to problem checking
routines, and to allow decent into replacing vdevs.
Add a vdev_non_native_ashift() validator which is used on
the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.
cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
Enhance sysctl userland stubs now that a SYSCTL_PROC handler
is used in vdev.c.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
When the group membership of a metaslab class changes (i.e.
when a vdev is added or removed from a pool), walk the group
list to determine the smallest block size currently available
and record this in the metaslab class.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
Add the metaslab_class_get_minblocksize() accessor.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In zio_compress_data(), take the minimum blocksize as an
input parameter instead of assuming SPA_MINBLOCKSIZE.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
blocksize of the device. The l2arc code performs has it's own
code for deciding if compression is worth while, so this
effectively disables zio_compress_data() from second guessing
the original decision.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
In zio_write_bp_init(), use the minimum blocksize of the
normal metaslab class when compressing data.
Illumos ZFS issues:
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if
physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Fix a regression introduced by fix for Illumos bug #3834. Quote from
Matthew Ahrens on the Illumos issue:
ztest fails this assertion because ztest_dmu_read_write() does
dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
and then
dmu_object_set_checksum(os, bigobj,
(enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx);
If the region to free is past the end of the file, the DMU assumes that there
will be nothing to do for this object. However, ztest does set_checksum(),
which must modify the dnode. The fix is for ztest to also call
dmu_tx_hold_bonus(tx, bigobj);
so we can account for the dirty data associated with setting the checksum
Illumos ZFS issues:
3955 ztest failure: assertion refcount_count(&tx->tx_space_written)
+ delta <= tx->tx_space_towrite
Merge vendor bugfix for ZFS test suite that triggers false positives.
Illumos ZFS issues:
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Fix zfs send -D hang after processing requiring a CTRL+C to interrupt due to
pthread_join prior to fd close.
This was introduced by r251646 (MFV r251644)
Illumos ZFS issue:
3909 "zfs send -D" does not work
MFC after: 1 day
Bring back some important fixes from Illumos:
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
We particularly avoid the LD_NOLAZYLOAD changes that Illumos made
as those don't apply to FreeBSD and were causing problems in
interactive mode.
Illumos Revision: 13758:23432da34147
Reference:
https://www.illumos.org/issues/3022https://www.illumos.org/issues/3023https://www.illumos.org/issues/3024
MFC after: 1 month
Tested by: markj
GCC can generate bogus dwarf attributes with DW_AT_byte_size
set to 0xFFFFFFFF.
The issue was originaly detected in NetBSD but it has been
adapted for portability and to avoid compiler warnings.
Reference:
https://www.illumos.org/issues/3776
Obtained from: NetBSD
MFC after: 1 month
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).
Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
MFC after: 1 week