minimum allocation size for devices. Use this information to
automatically increase ZFS's minimum allocation size for new top-level
vdevs to a value that more closely matches the optimum device
allocation size.
Use GEOM's stripesize attribute, if set, as the physical sector
size of the GEOM.
Calculate the minimum blocksize of each metaslab class. Use the
calculated value instead of SPA_MINBLOCKSIZE (512b) when determining
the likelyhood of compression yeilding a reduction in physical space
usage.
Report devices with sub-optimal block size configuration in "zpool
status". Also properly fail attempts to attach devices with a
logical block size greater than 8kB, since this will cause corruption
to ZFS's label area.
Sponsored by: Spectra Logic Corporaion
MFC after: 2 weeks
Background
==========
Many modern devices use physical allocation units that are much
larger than the minimum logical allocation size accessible by
external commands. Two prevalent examples of this are 512e disk
drives (512b logical sector, 4K physical sector) and flash devices
(512b logical sector, 4K or larger allocation block size, and 128k
or larger erase block size). Operations that modify less than the
physical sector size result in a costly read-modify-write or garbage
collection sequence on these devices.
Simply exporting the true physical sector of the device to ZFS would
yield optimal performance, but has two serious drawbacks:
1) Existing pools created with devices that have different logical
and physical block sizes, but were configured to use the logical
block size (e.g. because the OS version used for pool construction
reported the logical block size instead of the physical block
size) will suddenly find that the vdev allocation size has
increased. This can be easily tolerated for active members of
the array, but ZFS would prevent replacement of a vdev with
another identical device because it now appears that the smaller
allocation size required by the pool is not supported by the new
device.
2) The device's physical block size may be too large to be supported
by ZFS. The optimal allocation size for the vdev may be quite
large. For example, a RAID controller may export a vdev that
requires read-modify-write cycles unless accessed using 64k
aligned/sized requests. ZFS currently has an 8k minimum block
size limit.
Reporting both the logical and physical allocation sizes for vdevs
solves these problems. A device may be used so long as the logical
block size is compatible with the configuration. By comparing the
logical and physical block sizes, new configurations can be optimized
and administrators can be notified of any existing pools that are
sub-optimal.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/spa.h:
Add the SPA_ASHIFT constant. ZFS currently has a hard upper
limit of 13 (8k) for ashift and this constant is used to
both document and enforce this limit.
sys/cddl/contrib/opensolaris/uts/common/sys/fs/zfs.h:
Add the VDEV_AUX_ASHIFT_TOO_BIG error code.
Add fields for exporting the configured, logical, and
physical ashift to the vdev_stat_t structure.
Add VDEV_STAT_VALID() macro which can be used to verify the
presence of required vdev_stat_t fields in nvlist data.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
Provide a SYSCTL_PROC handler for "max_auto_ashift". Since
the limit is only referenced long after boot when a create
operation occurs, there's no compelling need for it to be
a boot time configurable tunable. This also allows the
validation code for the max_auto_ashift value to be contained
within the sysctl handler.
Populate the new fields in the vdev_stat_t structure.
Fail vdev opens if the vdev reports an ashift larger than
SPA_MAXASHIFT.
Propogate vdev_logical_ashift and vdev_physical_ashift between
child and parent vdevs as is done for vdev_ashift.
In vdev_open(), restore code that fails opens for devices
where vdev_ashift grows. This can only happen now if the
device's logical ashift grows, which means it really isn't
safe to use the device.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_file.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_mirror.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_missing.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_raidz.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_root.c:
Update the vdev_open() API so that both logical (what was
just ashift before) and physical ashift are reported.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/vdev_impl.h:
Add two new fields, vdev_physical_ashift and vdev_logical_ashift,
to vdev_t.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa_config.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/spa.c:
Add vdev_ashift_optimize(). Call it anytime a new top-level
vdev is allocated.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
Add text for the VDEV_AUX_ASHIFT_TOO_BIG error.
For each sub-optimally configured leaf vdev, report configured
and native block sizes.
cddl/contrib/opensolaris/cmd/zpool/zpool_main.c:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs.h:
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Introduce a new zpool status: ZPOOL_STATUS_NON_NATIVE_ASHIFT.
This status is reported on healthy pools containing vdevs
configured to use a block size smaller than their reported
physical block size.
cddl/contrib/opensolaris/lib/libzfs/common/libzfs_status.c:
Update find_vdev_problem() and supporting functions to
provide the full vdev_stat_t structure to problem checking
routines, and to allow decent into replacing vdevs.
Add a vdev_non_native_ashift() validator which is used on
the full vdev tree to check for ZPOOL_STATUS_NON_NATIVE_ASHIFT.
cddl/contrib/opensolaris/lib/libzpool/common/kernel.c:
cddl/contrib/opensolaris/lib/libzpool/common/sys/zfs_context.h:
Enhance sysctl userland stubs now that a SYSCTL_PROC handler
is used in vdev.c.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab_impl.h:
When the group membership of a metaslab class changes (i.e.
when a vdev is added or removed from a pool), walk the group
list to determine the smallest block size currently available
and record this in the metaslab class.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/metaslab.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/metaslab.c:
Add the metaslab_class_get_minblocksize() accessor.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/sys/zio_compress.h:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio_compress.c:
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In zio_compress_data(), take the minimum blocksize as an
input parameter instead of assuming SPA_MINBLOCKSIZE.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/arc.c:
In l2arc_compress_buf(), pass SPA_MINBLOCKSIZE as the minimum
blocksize of the device. The l2arc code performs has it's own
code for deciding if compression is worth while, so this
effectively disables zio_compress_data() from second guessing
the original decision.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zio.c:
In zio_write_bp_init(), use the minimum blocksize of the
normal metaslab class when compressing data.
Illumos ZFS issues:
3957 ztest should update the cachefile before killing itself
3958 multiple scans can lead to partial resilvering
3959 ddt entries are not always resilvered
3960 dsl_scan can skip over dedup-ed blocks if
physical birth != logical birth
3961 freed gang blocks are not resilvered and can cause pool to suspend
3962 ztest should print out zfs debug buffer before exiting
Fix a regression introduced by fix for Illumos bug #3834. Quote from
Matthew Ahrens on the Illumos issue:
ztest fails this assertion because ztest_dmu_read_write() does
dmu_tx_hold_free(tx, bigobj, bigoff, bigsize);
and then
dmu_object_set_checksum(os, bigobj,
(enum zio_checksum)ztest_random_dsl_prop(ZFS_PROP_CHECKSUM), tx);
If the region to free is past the end of the file, the DMU assumes that there
will be nothing to do for this object. However, ztest does set_checksum(),
which must modify the dnode. The fix is for ztest to also call
dmu_tx_hold_bonus(tx, bigobj);
so we can account for the dirty data associated with setting the checksum
Illumos ZFS issues:
3955 ztest failure: assertion refcount_count(&tx->tx_space_written)
+ delta <= tx->tx_space_towrite
Merge vendor bugfix for ZFS test suite that triggers false positives.
Illumos ZFS issues:
3949 ztest fault injection should avoid resilvering devices
3950 ztest: deadman fires when we're doing a scan
3951 ztest hang when running dedup test
3952 ztest: ztest_reguid test and ztest_fault_inject don't place nice together
Fix zfs send -D hang after processing requiring a CTRL+C to interrupt due to
pthread_join prior to fd close.
This was introduced by r251646 (MFV r251644)
Illumos ZFS issue:
3909 "zfs send -D" does not work
MFC after: 1 day
Bring back some important fixes from Illumos:
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
We particularly avoid the LD_NOLAZYLOAD changes that Illumos made
as those don't apply to FreeBSD and were causing problems in
interactive mode.
Illumos Revision: 13758:23432da34147
Reference:
https://www.illumos.org/issues/3022https://www.illumos.org/issues/3023https://www.illumos.org/issues/3024
MFC after: 1 month
Tested by: markj
GCC can generate bogus dwarf attributes with DW_AT_byte_size
set to 0xFFFFFFFF.
The issue was originaly detected in NetBSD but it has been
adapted for portability and to avoid compiler warnings.
Reference:
https://www.illumos.org/issues/3776
Obtained from: NetBSD
MFC after: 1 month
Restore a previous behavior before r251646, where when destructing
ZFS snapshot, the ioctl would return ENOENT when it hit any of
them in the errlist (the new behavior was only return ENOENT when
all returns error).
Illumos ZFS issues:
3829 fix for 3740 changed behavior of zfs destroy/hold/release ioctl
MFC after: 1 week
zpool create should treat -O mountpoint and -m the same
Illumos ZFS issues:
3745 zpool create should treat -O mountpoint and -m the same
MFC after: 2 weeks
seven arguments.
The original test uses Solaris' uadmin system call to trigger the test
probe; this change adds a sysctl to the dtrace_test module and gets the test
program to trigger the test probe via the sysctl handler.
The test is currently failing on amd64 because of some bugs in the way that
probe arguments beyond the first five are obtained - these bugs will be
fixed in a separate change.
dt_cg_ptrsize() and generally cleans up some of the error handling around
register allocation.
This change corresponds to part of illumos-gate commit e5803b76927480:
3025 register leak in D code generation
Reviewed by: pfg
Obtained from: illumos
MFC after: 1 month
FreeBSD. In the IPv6 case, try each interface before returning an error;
each IPv6-enabled interface will have a link-local address even if the link
isn't up.
MFC after: 1 week
users to guarantee that the output of DTrace scripts will be time-ordered.
This option is enabled by adding the line
#pragma D option temporal
to the beginning of a script, or by adding '-x temporal' to the arguments of
dtrace(1).
This change fixes a bug in the original port of the temporal option. This
bug was causing some assertions to fail, so they had been disabled; in this
revision the assertions are working properly and are enabled.
The DTrace version number has been bumped from 1.9.0 to 1.9.1 to reflect
the language change that's being introduced.
This change corresponds to part of illumos-gate commit e5803b76927480:
3021 option for time-ordered output from dtrace(1M)
Reviewed by: pfg
Obtained from: illumos
MFC after: 1 month
not be installed in /usr/lib32 on systems with compat-32 support.
This fix has two parts. First, the build is forced by linking drti.o
into a dummy internal library. Second, the object file is installed
manually in the LIBRARIES_ONLY case.
MFC after: 3 days
USDT probes are advertised to kernel by initialization code with
atrribute((constructor))). It seems that on Solaris the .init-ish code
of the main object is executed before RD_PREINIT point is hit. On
FreeBSD that is not the case. And because on FreeBSD there is no other
well-defined point between RD_PREINIT and main() we have to parse a
DTrace script when main is hit, for time being.
A footnote: currently we actually post RD_POSTINIT event, but that's a
bug because the event is triggered by hitting r_debug_state which
happens before any init code is executed.
Reported by: markj
snapshot.
Related illumos ZFS issue:
3699 zfs hold or release of a non-existent snapshot does not output error
Reported by: Steven Hartland <smh@FreeBSD.org>
MFC after: 3 days
The following change from illumos brought caused DTrace to
pause in an interactive environment:
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider
This was not detected during testing because it doesn't
affect scripts.
We shouldn't be changing the environment, especially since the
LD_NOLAZYLOAD option doesn't apply to our (GNU) ld.
Unfortunately the change from upstream was made in such a way
that it is very difficult to separate this change from the
others so, at least for now, it's better to just revert
everything.
Reference:
https://www.illumos.org/issues/3026
Reported by: Navdeep Parhar and Mark Johnston
Merge changes from illumos:
3675 DTrace print() should try to resolve function pointers
3676 dt_print_enum hardcodes a value of zero
Illumos Revision: b1fa6326238973aeaf12c34fcda75985b6c06be1
Reference:
https://www.illumos.org/issues/3675https://www.illumos.org/issues/3676
Obtained from: Illumos
MFC after: 1 month
Merge change from illumos:
1731 DTrace NFS translators should be split into client/server pieces
Illumos Revision: 13523:6763769941d2
This code seems to be currently unused on FreeBSD.
Reference:
https://www.illumos.org/issues/1731
Obtained from: Illumos
MFC after: 1 week
Merge changes from illumos:
3021 option for time-ordered output from dtrace(1M)
3022 DTrace: keys should not affect the sort order when sorting by value
3023 it should be possible to dereference dynamic variables
3024 D integer narrowing needs some work
3025 register leak in D code generation
3026 libdtrace should set LD_NOLAZYLOAD=1 to help the pid provider
This brings yet another feature implemented in upstream DTrace.
A complete description is available here:
http://dtrace.org/blogs/ahl/2012/07/28/my-new-dtrace-favorite/
This change bumps the DT_VERS_* number to 1.9.1 in
accordance to what is done in illumos.
This change was somewhat complicated because upstream is mixed many
changes in an individual commit and some of the tests don't really
apply to us.
There are also appear to be differences in timestamping with Solaris
so we had to workaround some assertions making sure no regression
happened.
Special thanks to Fabian Keil for changes and testing.
Illumos Revisions: 13758:23432da34147
Reference:
https://www.illumos.org/issues/3021https://www.illumos.org/issues/3022https://www.illumos.org/issues/3023https://www.illumos.org/issues/3024https://www.illumos.org/issues/3025https://www.illumos.org/issues/1694
Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 1 months
release a non-existing snapshot of a existing dataset. In recursive case
error is reported if no snapshots with the requested name have been found.
Problem and proposed solution reported to illumos:
3699 zfs hold or release of a non-existent snapshot does not output error
MFC after: 8 days
doesn't copyout in this case.
To solve this issue a new struct zfs_iocparm_t is introduced consisting of:
- zfs_ioctl_version (future backwards compatibility purposes)
- user space pointer to zfs_cmd_t (copyin and copyout)
- size of zfs_cmd_t (verification purposes)
The copyin and copyout of zfs_cmd_t is now done the illumos (vendor) way
what makes porting of new changes easier and ensures correct behavior if
returning an error.
MFC after: 10 days
Merge change from vendor to reduce diff only.
ZFS dtrace probes are not supported on FreeBSD yet.
Illumos ZFS issues:
3598 want to dtrace when errors are generated in zfs
MFC after: 3 weeks
only other case where STT_FILE symbols are used, in symit_next() in
cddl/contrib/opensolaris/tools/ctf/cvt/input.c, save the basename of the
symbol, instead of the full pathname.
Reported by: avg
Tested by: avg, jimharris
MFC after: 1 week
Merge change from illumos:
1368 enablings on defunct providers prevent providers from unregistering
We try to address some underlying differences between the Solaris
and FreeBSD implementations: dtrace_attach() / dtrace_detach() are
currently unimplemented in FreeBSD but the new code from illumos
makes use of taskq so some adaptations were made to dtrace_open()
and dtrace_close() to handle them appropriately.
Illumos Revision: r13430:8e6add739e38
Reference:
https://www.illumos.org/issues/1368
Reviewed by: gnn
Tested by: Fabian Keil
Obtained from: Illumos
MFC after: 3 weeks
for and loading dependent modules. This addresses a bug seen with
io.d where it was being doubly included.
PR: 171678
Submitted by: Mark Johnston
MFC after: 2 weeks