clear vd->vdev_tsd in vdev_geom_close_locked instead of vdev_geom_detach.
In the latter function, it would fail to happen in certain circumstances
where cp->private was unset. Ideally, the latter should never happen, but
it can happen when vdev open fails, or where spares are involved.
MFC after: 4 weeks
X-MFC-With: 298786
Sponsored by: Spectra Logic Corp
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Move checks for provider's sectorsize and mediasize into a single
location in vdev_geom_attach. Remove the zfs::vdev::taste class;
it's ok to use the regular vdev class for tasting. Consolidate guid
checks into a single location in vdev_attach_ok. Consolidate some
error handling code from vdev_geom_attach into vdev_geom_detach,
closing a resource leak of geom consumers in the process.
Reviewed by: avg
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D5974
This allows for the long function components encountered in www/firefox.
This constant is part of DTrace's userland ABI, so this change may not be
MFC'ed.
PR: 207735
Without this change, DTrace will refuse to load a DOF section if the
function component of any of its probes exceeds DTRACE_FUNCNAMELEN (128).
Probes in C++ programs can have very long function components. Rather than
rejecting all probes if a single probe exceeds the limit, simply skip the
invalid probe and emit a warning. This ensures that valid probes are
instantiated.
PR: 207735
MFC after: 2 weeks
When this flag is turned on, DOF and DIF validation errors are printed to
the kernel message buffer. This is useful for debugging.
Also remove the debug.dtrace.debug sysctl, which has no effect.
While the instructions were not included into the original instruction
set, their support can be indicated by a special feature bit.
For example:
CPU: AMD Phenom(tm) II X4 955 Processor (3214.71-MHz K8-class CPU)
...
AMD Features2=0x37ff<LAHF, ...>
Clang 3.8 uses lahf/sahf as a faster alternative to pushf/popf where
possible.
MFC after: 2 weeks
illumos/illumos-gate@26455f9efc26455f9efchttps://www.illumos.org/issues/6052
At the moment type parameter of lzc_create() is of dmu_objset_type_t type.
That exposes an implementation detail and requires sys/fs/zfs.h to be included
in libzfs_core.h creating unnecessary coupling between libzfs_core interface
and ZFS internals.
I think that dmu_objset_type_t should be replaced with a libzfs_core
enumeration of supported dataset types.
For ABI reasons the new enumeration could be bit-compatible with
dmu_objset_type_t.
For example:
typedef enum {
LZC_DST_ZFS = 2,
LZC_DST_ZVOL
} lzc_dataset_type_t;
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
MFC after: 2 weeks
Sponsored by: ClusterHQ
Currently this argument is a pointer into the stack which is used by FBT
to fetch the first five probe arguments. On all non-x86 architectures it's
simply the trapframe address, so this change has no functional impact. On
amd64 it's a pointer into the trapframe such that stack[1 .. 5] gives the
first five argument registers, which are deliberately grouped together in
the amd64 trapframe definition.
A trapframe argument simplifies the invop handlers on !x86 and makes the
x86 FBT invop handler easier to understand. Moreover, it allows for invop
handlers that may want to modify the register set of the interrupted thread.
Note that now we have to account for possible partial writes
in dmu_write_uio_dbuf(). It seems that on illumos either all or none
of the data are expected to be written. But the partial writes are
quite expected when vn_io_fault support is enabled.
Reviewed by: kib
MFC after: 7 weeks
Differential Revision: https://reviews.freebsd.org/D2790
Prior to this change, vdev_geom_open_by_path would call vdev_geom_attach
prior to verifying the device's GUIDs. vdev_geom_attach calls
vdev_geom_attrchange to set the physpath in the vdev object. The result is
that if the disk could not be found, then the labels for other disks in the
same TLD would overwrite the missing disk's physpath with the physpath of
whichever disk currently has the same devname as the missing one used to
have.
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/vdev_geom.c
Don't drop the g_topology_lock before freeing old_physpath. That
opens up a race where one thread can call vdev_geom_attrchanged,
set old_physpath, drop the g_topology_lock, then block trying to
acquire the SCL_STATE lock. Then another thread can come into
vdev_geom_attrchanged, set old_physpath to the same value, and
proceed to free it. When the first thread resumes, it will free
the same location.
It turns out that the SCL_STATE lock isn't needed. It was
originally added by gibbs to protect vd->vdev_physpath while
updating the same. However, the update process subsequently was
switched to an atomic operation (a pointer swap). Now, there is
no need for the SCL_STATE lock, and hence no need to drop the
g_topology_lock.
Reviewed by: delphij
MFC after: 4 weeks
Sponsored by: Spectra Logic Corp
Differential Revision: https://reviews.freebsd.org/D5413
Previously uncompressed buffers did not obey that rule.
Type of b_asize is changed to uint64_t for consistency,
given that this is a zeta-byte filesystem.
l2arc_compress_buf is renamed to l2arc_transform_buf to better reflect
its new utility. Now not only we ensure that a compressed buffer has
a size aligned to ashift, but we also allocate a properly sized
temporary buffer if the original buffer is not compressed and it has
an odd size. This ensures that all I/O to the cache device is always
ashift-aligned, in terms of both a request offset and a request size.
If the aligned data is larger than the original data, then we have to use
a temporary buffer when reading it as well.
Also, enhance physical zio alignment checks using vdev_logical_ashift.
On FreeBSD we have this information, so we can make stricter assertions.
Reviewed by: smh, mav
MFC after: 1 month
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2789
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Author: Alexander Motin <mav@FreeBSD.org>
Improve speculative prefetch of indirect blocks.
Scalability of many operations on wide ZFS pool can be limited by
requirement to prefetch indirect blocks first. Recently added
asynchronous indirect block read partially helped, but did not
solve the problem completely. This patch extends existing prefetcher
functionality to explicitly work with indirect blocks.
Before this change prefetcher issued reads for up to 8MB of data in
advance. With this change it also issues indirect block reads
for up to 64MB of data in advance, so that when it will be time to
actually read those data, it can be done immediately. Alike effect
can be achieved by just increasing maximal data prefetch distance,
but at higher memory cost.
Also this change introduces indirect block prefetch for rewrite
operations, that was never done before. Previously ARC miss for
Indirect blocks regularly blocked rewrites, converting perfectly
aligned asynchronous operations into synchronous read-write pairs,
significantly reducing maximal rewrite speed.
While being there this issue was also fixed:
- prefetch was done always, even if caching for the dataset was
completely disabled.
Testing on FreeBSD with zvol on top of 6x striped 2x mirrored pool
of 12 assorted HDDs shown me such performance numbers:
------- BEFORE --------
Write 491363677 bytes/sec
Read 312430631 bytes/sec
Rewrite 97680464 bytes/sec
-------- AFTER --------
Write 493524146 bytes/sec
Read 438598079 bytes/sec
Rewrite 277506044 bytes/sec
Closes#65Closes#80openzfs/openzfs@792fd28ac0
Only include sysctl in kernel builds fixing warning about implicit
declaration of function 'sysctl_handle_int'.
PR: 204140
MFC after: 1 week
X-MFC-With: r297813
Sponsored by: Multiplay
At the moment no ZFS buffers are included into a crash dump unless
ZFS_DEBUG (or INVARIANTS) kernel option is enabled. That's not very
helpful for debugging of ZFS problems, because important information
often resides in metadata buffers.
This change switches the dumping behavior when UMA is used from the
illumos behavior to a more useful behavior that we have on FreeBSD
when ZFS buffers are allocated via malloc.
Reviewed by: smh, mav
MFC after: 3 weeks
Differential Revision: https://reviews.freebsd.org/D5892
This allows one to enable DTrace probes relatively early during boot,
during SI_SUB_DTRACE_ANON, before dtrace(1) can invoked. The desired
enabling is created using dtrace -A, which writes a /boot/dtrace.dof
file and uses nextboot(8) to ensure that DTrace kernel modules are loaded
and that the DOF file describing the enabling is loaded by loader(8)
during the subsequent boot. The trace output can then be fetched with
dtrace -a.
With this commit, boot-time DTrace is only functional on i386 and amd64: on
other architectures, the high-resolution timer frequency is initialized
during SI_SUB_CLOCKS and is thus not available when the anonymous
tracing state is initialized. On x86, the TSC is used and is thus available
earlier.
MFC after: 1 month
Relnotes: yes
This allows the hrtimer to be used earlier during boot. This is required
for boot-time DTrace: anonymous enablings are created during
SI_SUB_DTRACE_ANON, which runs before APs are started. In particular,
the DTrace deadman timer requires that the hrtimer be functional.
MFC after: 2 weeks
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Author: Will Andrews <will@firepipe.net>
Closes#83Closes#32openzfs/openzfs@9663688425
FreeBSD already had `zpool labelclear` functionality, so this is mostly
just a diff reduction.
MFC after: 1 month
This is because they might do data compression which is quite CPU
expensive. The original code is correct for illumos, because there
a higher priority corresponds to a greater number.
MFC after: 2 weeks
This made impossible spare disk open by known path, which kind of worked
only because the same fix was applied to vdev_geom_attach_by_guids() in
r293708.
MFC after: 1 week
for limiting disk (actually filesystem) IO.
Note that in some cases these limits are not quite precise. It's ok,
as long as it's within some reasonable bounds.
Testing - and review of the code, in particular the VFS and VM parts - is
very welcome.
MFC after: 1 month
Relnotes: yes
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D5080
Allow using DTRACE for performance analysis of userspace
applications - the function call stack can be captured.
This is almost an exact copy of AMD64 solution.
Obtained from: Semihalf
Sponsored by: Cavium
Reviewed by: emaste, gnn, jhibbits
Differential Revision: https://reviews.freebsd.org/D5779
On FreeBSD VFS_HOLD/VN_RELE were mapped to MNT_REF/MNT_REL that
manipulate mnt_ref. But the job of properly maintaining the reference
count is already automatically performed by insmntque(9) and
delmntque(9). So, in effect all ZFS vnodes referenced the corresponding
mountpoint twice.
That was completely harmless, but we want to be very explicit about what
FreeBSD VFS APIs are used, because illumos VFS_HOLD and FreeBSD MNT_REF
provide quite different guarantees with respect to the held vfs_t /
mountpoint. On illumos VFS_HOLD is sufficient to guarantee that
vfs_t.vfs_data stays valid. On the other hand, on FreeBSD MNT_REF does
*not* provide the same guarantee about mnt_data. We have to use
vfs_busy() to get that guarantee.
Thus, the calls to VFS_HOLD/VFS_RELE on vnode init and fini are removed.
VFS_HOLD calls are replaced with vfs_busy in the ioctl handlers.
And because vfs_busy has a richer interface that can not be dumbed down
in all cases it's better to explicitly use it rather than trying to mask
it behind VFS_HOLD.
This change fixes a panic that could result from a race between
zfs_umount() and zfs_ioc_rollback(). We observed a case where
zfsvfs_free() tried to destroy data that zfsvfs_teardown() was still
using. That happened because there was nothing to prevent unmounting of
a ZFS filesystem that was in between zfs_suspend_fs() and
zfs_resume_fs().
Reviewed by: kib, smh
MFC after: 3 weeks
Sponsored by: ClusterHQ
Differential Revision: https://reviews.freebsd.org/D2794
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Eli Rosenthal <eli.rosenthal@delphix.com>
illumos/illumos-gate@c20404ff77
Reviewed by: Patrick Mooney <patrick.mooney@joyent.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Alex Wilson <alex.wilson@joyent.com>
illumos/illumos-gate@d09e4475f6
Unlike Illumos FreeBSD has concept of logical ashift, that specifies
really minimal vdev block size that can be accessed. This knowledge
allows properly pad physical I/O and correctly assert its alignment.
This change fixes L2ARC write errors when device has logical sector
size above 512 bytes.
MFC after: 1 month
This fixes creation of zvol devices for snapshots during zfs receive,
that previously failed with "ZFS WARNING: Unable to create ZVOL" message.
This solution is not perfect, but IMHO better then it was before.
MFC after: 2 weeks
If device has stripe size bigger then maximal sector size supported by
ZFS, there is nothing can be done to avoid read-modify-write cycles.
Taking that stripe size into account will only reduce space efficiency
and pointlessly bother user with warnings that can not be fixed.
Discussed with: smh
Use of misaligned or non-power-of-2 stripes is not really useful for ZFS,
since increased ashift won't help to avoid read-modify-write cycles, and
only reduce pool space efficiency and compression rates.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: Stefan Ring <stefanrin@gmail.com>
Reviewed by: Steven Burgess <sburgess@datto.com>
Reviewed by: Arne Jansen <sensille@gmx.net>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
In certain circumstances, "zfs send -i" (incremental send) can produce a
stream which will result in incorrect sparse file contents on the
target.
The problem manifests as regions of the received file that should be
sparse (and read a zero-filled) actually contain data from a file that
was deleted (and which happened to share this file's object ID).
Note: this can happen only with filesystems (not zvols, because they do
not free (and thus can not reuse) object IDs).
Note: This can happen only if, since the incremental source (FromSnap),
a file was deleted and then another file was created, and the new file
is sparse (i.e. has areas that were never written to and should be
implicitly zero-filled).
We suspect that this was introduced by 4370 (applies only if hole_birth
feature is enabled), and made worse by 5243 (applies if hole_birth
feature is disabled, and we never send any holes).
The bug is caused by the hole birth feature. When an object is deleted
and replaced, all the holes in the object have birth time zero. However,
zfs send cannot tell that the holes are new since the file was replaced,
so it doesn't send them in an incremental. As a result, you can end up
with invalid data when you receive incremental send streams. As a
short-term fix, we can always send holes with birth time 0 (unless it's
a zvol or a dataset where we can guarantee that no objects have been
reused).
Closes#37openzfs/openzfs@adef853162
6672 arc_reclaim_thread() should use gethrtime() instead of ddi_get_lbolt()
6673 want a macro to convert seconds to nanoseconds and vice-versa
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Eli Rosenthal <eli.rosenthal@delphix.com>
illumos/illumos-gate@a8f6344fa0
in the dedup property value
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: ilovezfs <ilovezfs@icloud.com>
illumos/illumos-gate@971640e6aa
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Dan McDonald <danmcd@omniti.com>
illumos/illumos-gate@5f7a8e6d75