These have no effect on FreeBSD, in fact they are ifdef'ed,
but make easier future merges:
6699767 panic in spec_open()
6718877 crgetzoneid() use can cause problems when forking processes with
USDT providers in a non global zone
MFC after: 3 days
MFV r260708
4427 pid provider rejects probes with valid UTF-8 names
Use of u8_textprep.c broke the build on powerpc.
Reported by: bz, rpaulo and tinderbox.
Pointyhat: me
These changes prevent sysctl(8) from returning proper output,
such as:
1) no output from sysctl(8)
2) erroneously returning ENOMEM with tools like truss(1)
or uname(1)
truss: can not get etype: Cannot allocate memory
there is an environment variable which shall initialize the SYSCTL
during early boot. This works for all SYSCTL types both statically and
dynamically created ones, except for the SYSCTL NODE type and SYSCTLs
which belong to VNETs. A new flag, CTLFLAG_NOFETCH, has been added to
be used in the case a tunable sysctl has a custom initialisation
function allowing the sysctl to still be marked as a tunable. The
kernel SYSCTL API is mostly the same, with a few exceptions for some
special operations like iterating childrens of a static/extern SYSCTL
node. This operation should probably be made into a factored out
common macro, hence some device drivers use this. The reason for
changing the SYSCTL API was the need for a SYSCTL parent OID pointer
and not only the SYSCTL parent OID list pointer in order to quickly
generate the sysctl path. The motivation behind this patch is to avoid
parameter loading cludges inside the OFED driver subsystem. Instead of
adding special code to the OFED driver subsystem to post-load tunables
into dynamically created sysctls, we generalize this in the kernel.
Other changes:
- Corrected a possibly incorrect sysctl name from "hw.cbb.intr_mask"
to "hw.pcic.intr_mask".
- Removed redundant TUNABLE statements throughout the kernel.
- Some minor code rewrites in connection to removing not needed
TUNABLE statements.
- Added a missing SYSCTL_DECL().
- Wrapped two very long lines.
- Avoid malloc()/free() inside sysctl string handling, in case it is
called to initialize a sysctl from a tunable, hence malloc()/free() is
not ready when sysctls from the sysctl dataset are registered.
- Bumped FreeBSD version to indicate SYSCTL API change.
MFC after: 2 weeks
Sponsored by: Mellanox Technologies
2915 DTrace in a zone should see "cpu", "curpsinfo", et al
2916 DTrace in a zone should be able to access fds[]
2917 DTrace in a zone should have limited provider access
MFC after: 2 weeks
4427 pid provider rejects probes with valid UTF-8 names
This make use of Solaris' u8_validate() which we happen to
use since r185029 for ZFS.
Illumos Revision: 1444d846b126463eb1059a572ff114d51f7562e5
Reference:
https://www.illumos.org/issues/4427
Obtained from: Illumos
MFC after: 2 weeks
usage from dtrace. The dtrace code already uses cdevpriv(9) since FreeBSD
8, so this change should be quite harmless.
Reviewed by: markj
Approved by: markj
MFC after: never
the 4 byte-aligned dtrace_invop_callsite can be found and that it
immediately follows the call to dtrace_invop(). Secondly, fix some pointer
arithmetic to account for differences between struct i386_frame and illumos'
struct frame. Finally, ensure that dtrace_getarg() isn't inlined. It works
by following a fixed number of frame pointers to the probe site, so inlining
breaks it.
MFC after: 3 weeks
first five for probes entered through a UD fault (i.e. FBT probes).
Specifically, handle the fact that dtrace_invop_callsite must be
16 byte-aligned and thus may not immediately follow the call to
dtrace_invop() in dtrace_invop_start(). Also fetch register arguments and
the stack pointer through a struct trapframe instead of a struct reg.
PR: 191260
Submitted by: luke.tw@gmail.com
MFC after: 3 weeks
defined. This ensures that the sdt:zfs:: probes appear despite the fact
the sdt provider is defined in the kernel rather than in zfs.ko.
Reported by: hiren
Tested by: hiren
MFC after: 2 weeks
selection. gethrtime() in our port updated with HZ rate, so unusable for
this specific purpose, completely draining benefit of multiple taskqueues.
MFC after: 2 weeks
Add a new zfs property, "redundant_metadata" which can have values "all" or
"most". The default will be "all", which is the current behavior. When set
to all, ZFS stores an extra copy of all metadata. If a single on-disk block
is corrupt, at worst a single block of user data (which is recordsize bytes
long) can be lost.
Setting to "most" will cause us to only store 1 copy of level-1 indirect
blocks of user data files. This can improve performance of random writes,
because less metadata has to be written. In practice, at worst about
100 blocks (of recordsize bytes each) of user data can be lost if a single
on-disk block is corrupt.
The exact behavior of which metadata blocks are stored redundantly may change
in future releases.
Illumos issue: 3835 zfs need not store 2 copies of all metadata
MFC after: 2 weeks
This includes decodes of recent Intel instructions, in particular
VT-x and related instructions. This allows the FBT provider to
locate the exit points of routines that include these new
instructions.
Illumos issues:
3414 Need a new word of AT_SUN_HWCAP bits
3415 Add isainfo support for f16c and rdrand
3416 Need disassembler support for rdrand and f16c
3413 isainfo -v overflows 80 columns
3417 mdb disassembler confuses rdtscp for invlpg
1518 dis should support AMD SVM/AMD-V/Pacifica instructions
1096 i386 disassembler should understand complex nops
1362 add kvmstat for monitoring of KVM statistics
1363 add vmregs[] variable to DTrace
1364 need disassembler support for VMX instructions
1365 mdb needs 16-bit disassembler support
This corresponds to Illumos-gate (github) version
eb23829ff08a873c612ac45d191d559394b4b408
Reviewed by: markj
MFC after: 1 week
4730 metaslab group taskq should be destroyed in metaslab_group_destroy()
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Original author: George Wilson
MFC after: 3 days
cause stack overflow for IO's which return ZIO_PIPELINE_CONTINUE
from the zio_vdev_io_start stage and hence don't suspend and complete
in a different thread.
This prevents double fault panic on slow machines running ZFS on
GELI volumes which return EOPNOTSUPP directly to BIO_DELETE requests.
MFC after: 1 month
X-MFC-With: r265152
The changes how TRIM requests are generated to use ZIO_TYPE_FREE + a priority
instead of ZIO_TYPE_IOCTL, until processed by vdev_geom; only then is it
translated the required geom values. This reduces the amount of changes
required for FREE requests to be supported by the new IO scheduler. This
also eliminates the need for a specific DKIOCTRIM.
Also fixed FREE vdev child IO's from running ZIO_STAGE_VDEV_IO_DONE as part
of their schedule.
As the new IO scheduler can result in a request to execute one type of IO to
actually run a different type of IO it requires that zio_trim requests are
processed without holding the trim map lock (tm->tm_lock), as the free request
execute call may result in write request running hence triggering a
trim_map_write_start call, which takes the trim map lock and hence would result
in recused on no-recursive sx lock.
This is based off avg's original work, so credit to him.
MFC after: 1 month
returns ZIO_PIPELINE_CONTINUE from vdev_op_io_start to zio_execute resulting
in the wrong ZIO continuing its pipeline.
This is a serious issue which could cause data loss / corruption but appears
to be limited to error handling such as when vdev_readable(vd) returns false.
MFC after: 2 days
4754 io issued to near-full luns even after setting noalloc threshold
4755 mg_alloc_failures is no longer needed
illumos/illumos@b6240e830b
MFC after: 2 weeks
sites and installing a hook at the kernel's trap handler. The fasttrap code
will emulate the overwritten instruction in some common cases, but otherwise
copies it out into some scratch space in the traced process' address space
and ensures that it's executed after returning from the trap.
In Solaris and illumos, this (per-thread) scratch space comes from some
reserved space in TLS, accessible via the fs segment register. This
approach is somewhat unappealing on FreeBSD since it would require some
modifications to rtld and jemalloc (for static TLS) to ensure that TLS is
executable, and would thus introduce dependencies on their implementation
details. I think it would also be impossible to safely trace static binaries
compiled without these modifications.
This change implements the functionality in a different way, by having
fasttrap map pages into the target process' address space on demand. Each
page is divided into 64-byte chunks for use by individual threads, and
fasttrap's process descriptor struct has been extended to keep track of
any scratch space allocated for the corresponding process.
With this change it's possible to trace all libc functions in a program,
e.g. with
pid$target:libc.so.*::entry {@[probefunc] = count();}
Previously this would generally cause the victim process to crash, as
tracing memcpy on amd64 requires the functionality described above.
Tested by: Prashanth Kumar <pra_udupi@yahoo.co.in> (earlier version)
MFC after: 6 weeks
this is due to a wrong dereference of a vnode when it's not locked and
can be (potentially) recycled. 'sdvp' cannot be locked on zfs_rename()
entry point because the VFS can't be sure that this scenario is
LOR-free (it might violate the parent->child lock acquisition rule).
Dereference 'tdvp' instead, which is already locked on entry, and access
'sdvp' fields only when it's safe, i.e. under ZFS_ENTER scope.
While at it, remove the usage of VOP_REALVP, as long as this is a NOP
on FreeBSD.
Discussed with: avg
Reviewed by: pjd
New ZFS property volmode and sysctl vfs.zfs.vol.mode allow switching ZVOL
between three modes:
geom -- existing fully functional behavior (default);
dev -- exposing volumes only as raw disk device file in devfs;
none -- not exposing volumes outside ZFS.
The "dev" mode is less functional (can't be partitioned, mounted, etc),
but it is faster, and in some scenarios with untrusted consumers safer.
It can be useful for NAS, VM block storages, etc.
The "none" mode may be convenient for backup servers, etc. that don't
need direct data access.
Due to the way ZVOL is integrated with main ZFS code, those property
and sysctl are checked only during pool import and volume creation.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
3580 Want zvols to return volblocksize when queried for physical block size
illumos/illumos-gate@a0b60564df
It is irrelevant for FreeBSD, just reducing diff.
It is an adapted merge from the vendor branch of:
701 UNMAP support for COMSTAR (in part related to ZFS)
2130 zvol DKIOCFREE uses nested DMU transactions
4248 dtrace(1M) should never create DOF with empty probes section
4249 Only probes from the first DTrace object file will be included
Illumos Revision: 4a20ab41aadcb81c53e72fc65886e964e9add59
Reference:
https://www.illumos.org/issues/4248https://www.illumos.org/issues/4249
Obtained from: Illumos
MFC after: 1 month
To reduce the diff struct pcu.cnt field was not renamed, so
PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in
kvm(3) and vmstat(8). The goal was to not affect externally used KPI.
Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the
the global cnt variable.
Exp-run revealed no ports using it directly.
No objection from: arch@
Sponsored by: EMC / Isilon Storage Division
further refinement is required as some device drivers intended to be
portable over FreeBSD versions rely on __FreeBSD_version to decide whether
to include capability.h.
MFC after: 3 weeks
so there is no reason to assert that we won't hit an error. Instead,
just return that error to caller and have the upper layer handle it.
Obtained from: FreeNAS
Reported by: rodrigc
Reviewed by: Matthew Ahrens
MFC after: 2 weeks
that this is done for SDT probes. This fixes the syscall/tst.args.d test,
which was failing because mmap(2)'s sixth argument wasn't available to the
probe.
MFC after: 2 weeks
illumos/illumos-gate@6fb4854bed
This fixes the tst.resize1.d and tst.resize2.d DTrace tests, which have
been failing since r261122 since they were causing dtrace(1) to attempt to
allocate and use large amounts of memory, and get killed by the OOM killer
as a result.
MFC after: 1 month
upstream DTrace code. It indicates that the kernel memory allocator need not
attempt to satisfy non-blocking allocations in low-memory conditions. This
has no direct equivalent in the malloc(9) flags, so it is just defined to 0
for now.
4574 get_clones_stat does not call zap_count in non-debug kernel
zap_count(...) is never called in non-DEBUG kernel.
As result "count" variable is always 0, and "goto fail" is always
reached. This means get_clones_stat function never makes up list
of clones for "clones" properties.
MFC after: 2 weeks
This is done to ensure that visited object IDs are always increasing.
Also, pass correct object ID to prefetch_dnode_metadata for
os_groupused_dnode.
Without this change we would hit an assert if traversal was paused on
a GROUPUSED object, which is unlikely but possible.
Apparently the same change was independently developed by Deplhix.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
MFC after: 10 days
Sponsored by: HybridCluster
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt. Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds. But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.
Reviewed by: gibbs
MFC after: 13 days
Sponsored by: HybridCluster
If we prematurely free the name buffer and it gets quickly recycled,
then zfs_rename may see data from another lookup or even unmapped memory
via cn_nameptr.
MFC after: 6 days
Sponsored by: HybridCluster
The bug was introduced in r256956 "Improve ZFS N-way mirror read
performance".
The code in vdev_mirror_dva_select erroneously considers already
tried DVAs for the next attempt. Thus, it is possible that a failing DVA
would be retried forever.
As a secondary effect, if the attempts fail with checksum error, then
checksum error reports are accumulated until the original request
ultimately fails or succeeds. But because retrying is going on indefinitely
the cheksum reports accumulation will effectively be a memory leak.
Reviewed by: gibbs
MFC after: 13 days
Sponsored by: HybridCluster
Otherwise we could run into the following deadlock.
A thread has a transaction open and assigned to a transaction group.
That would prevent the transaction group from be quiesced and synced.
The thread is blocked in getnewvnode_reserve waiting for a vnode to
a be reclaimed. vnlru thread is blocked trying to enter ZFS VOP because
a filesystem is suspended by an ongoing rollback or receive operation.
In its turn the operation is waiting for the current transaction group
to be synced.
zfs_zget is always used outside of active transactions, but zfs_mknode
is always used in a transaction context. Thus, we hoist
getnewvnode_reserve from zfs_mknode to its callers.
While there, assert that ZFS always calls getnewvnode while having
a vnode reserved.
Reported by: adrian
Tested by: adrian
MFC after: 17 days
Sponsored by: HybridCluster
When we encounter an I/O error on a piece of metadata while deleting
a file system or zvol, we don't update the bptree_entry_phys_t's
bookmark. This would lead to double free of bp's which will lead to
space map corruption.
Instead of tolerating and allowing the corruption, panic immediately.
See Illumos #4390 for more details.
4391 panic system rather than corrupting pool if we hit bug 4390
Illumos/illumos-gate@8b36997aa2
MFC after: 2 weeks
4370 avoid transmitting holes during zfs send
4371 DMU code clean up
illumos/illumos-gate@43466aae47
NOTE: Make sure the boot code is updated if a zpool upgrade is
done on boot zpool.
MFC after: 2 weeks
(Note: this change is not applicable to FreeBSD and the file
is not included in build. It's integrated for completeness).
4128 disks in zpools never go away when pulled
illumos/illumos-gate@39cddb10a3
MFC after: 2 weeks
3306 zdb should be able to issue reads in parallel
3321 'zpool reopen' command should be documented in the man page
and help message
illumos/illumos-gate@31d7e8fa33
FreeBSD porting notes: the kernel part of this changeset depends
on Solaris buf(9S) interfaces and are not really applicable for
our use. vdev_disk.c is patched as-is to reduce diverge from
upstream, but vdev_file.c is left intact.
MFC after: 2 weeks
aggregation IDs, as is done in the upstream illumos code. This still
requires some FreeBSD-specific code, as our vmem API is not identical to the
one in illumos.
Submitted by: Mike Ma <mikemandarine@gmail.com>
4171 clean up spa_feature_*() interfaces
4172 implement extensible_dataset feature for use by other zpool
features
illumos/illumos-gate@2acef22db7
MFC after: 2 weeks
4168 ztest assertion failure in dbuf_undirty
4169 verbatim import causes zdb to segfa
4170 zhack leaves pool in ACTIVE state
illumos/illumos-gate@7fdd916c47
MFC after: 2 weeks
child process that were inherited from its parent. However, this should
not be done in the case of a vfork, since the fork handler ends up removing
the tracepoints from the shared vm space, and userland DTrace probes in the
parent will no longer fire as a result.
Now the child of a vfork may trigger userland DTrace probes enabled in its
parent, so modify the fasttrap probe handler to handle this case and handle
the child process in the same way that it would handle the traced process.
In particular, if once traces function foo() in a process that vforks, and
the child calls foo(), fasttrap will treat this call as having come from the
parent. This is the behaviour of the upstream code.
While here, add #ifdef guards to some code that isn't present upstream.
MFC after: 1 month
When a da or ada device dissappears, outstanding IOs fail with
ENXIO, not EIO. The check for EIO was probably copied from Illumos,
where that is indeed the correct errno.
Without this change, pulling a busy drive from a zpool would usually
turn it into UNAVAIL, even though pulling an idle drive would turn
it into REMOVED. With this change, it is REMOVED every time.
Also, vdev_geom_io_intr shouldn't do zfs_post_remove, because that
results in devd getting two resource.fs.zfs.removed events. The
comment said that the event had to be sent directly instead of
through the async removal thread because "the DE engine is using
this information to discard prevoius I/O errors". However, the fact
that vdev_geom_io_intr was never actually sending the events until
now, and that vdev_geom_orphan never sent them at all, and that
vdev_geom_orphan usually gets called about 2 seconds after the
actual removal, means that FreeBSD's userland can cope with a late
event just fine.
Approved by: ken (mentor)
Sponsored by: Spectra Logic Corporation
MFC after: 4 weeks
(64MB). Even if we would find one somehow, ZFS kernel code rejects such
devices. It is funny to look on attempts to read 4 256K vdev labels from
1.44MB floppy, though it is not very practical and quite slow.
The freebsd variant of dmu_write_pages is hidden under _KERNEL
to avoid needlessly pulling in vm_page_t declaration.
Besides, this function seems to be useless for ZFS userland counterpart.
MFC after: 15 days
ZFS never partially validates or invalidates a page.
The higher level VM should not do that either.
mappedread_sf correct operation depends on a page being either fully
valid or invalid.
MFC after: 7 days
4101 metaslab_debug should allow for fine-grained control
4102 space_maps should store more information about themselves
4103 space map object blocksize should be increased
4104 ::spa_space no longer works
4105 removing a mirrored log device results in a leaked object
4106 asynchronously load metaslab
illumos/illumos-gate@0713e232b7
Note that some tunables have been removed and some new tunables have
been added. Of particular note, FreeBSD-only knob
vfs.zfs.space_map_last_hope is removed as it was a nop for some time now
(after one of the previous merges from upstream).
MFC after: 11 days
Sponsored by: HybridCluster [merge]
In illumos all ioctl zio-s are "global" at the moment. That is they act
on a whole disk, e.g. a cache flush command, and thus do not need either
offset or size parameters.
FreeBSD, on the other hand, has support for TRIM command and that
command requires proper offset and size parameters.
Without this fix all TRIM commands act on the start of any disk or
partition used by ZFS destroying any data there.
Pointyhat to: avg
Tested by: sbruno
MFC after: 3 days
X-MFC with: r258632
Sponsored by: HybridCluster
4080 zpool clear fails to clear pool
4081 need zfs_mg_noalloc_threshold
illumos/illumos-gate@22e30981d8
MFC after: 10 days
Sponsored by: HybridCluster [merge]
illumos/illumos-gate@69962b5647
Please note the following changes:
- zio_ioctl has lost its priority parameter and now TRIM is executed
with 'now' priority
- some knobs are gone and some new knobs are added; not all of them are
exposed as tunables / sysctls yet
MFC after: 10 days
Sponsored by: HybridCluster [merge]
943 zio_interrupt ends up calling taskq_dispatch with TQ_SLEEP
illumos/illumos-gate@5aeb94743e
Essentially FreeBSD taskqueues already operate in a mode that
was added to Illumos with taskq_dispatch_ent change.
We even exposed the superior FreeBSD interface as taskq_dispatch_safe.
Now we just rename taskq_dispatch_safe to taskq_dispatch_ent and
struct struct ostask to taskq_ent_t, so that code differences will be
minimal.
After this change sys/cddl/compat/opensolaris/sys/taskq.h header is no
longer needed.
Note that this commit is not an MFV because the upstream change was not
individually committed to the vendor area.
MFC after: 8 days
The new macros are implemented in terms of SDT_PROBE_DEFINE and SDT_PROBE.
Probes defined in this way will appear under SDT provider named "sdt".
Parameter types are exposed via SDT_PROBE_ARGTYPE.
This is something that illumos does not have by default.
This kind of SDT probes is already present in ZFS code, so those probes
will now be available if KDTRACE_HOOKS options is enabled.
A potential future illumos compatibility enhancement is to encode a provider
name as a prefix in a probe name.
Reviewed by: markj
MFC after: 3 weeks
X-MFC after: r258622
In its stead use the Solaris / illumos approach of emulating '-' (dash)
in probe names with '__' (two consecutive underscores).
Reviewed by: markj
MFC after: 3 weeks
option, unbreak the lock tracing release semantic by embedding
calls to LOCKSTAT_PROFILE_RELEASE_LOCK() direclty in the inlined
version of the releasing functions for mutex, rwlock and sxlock.
Failing to do so skips the lockstat_probe_func invokation for
unlocking.
- As part of the LOCKSTAT support is inlined in mutex operation, for
kernel compiled without lock debugging options, potentially every
consumer must be compiled including opt_kdtrace.h.
Fix this by moving KDTRACE_HOOKS into opt_global.h and remove the
dependency by opt_kdtrace.h for all files, as now only KDTRACE_FRAMES
is linked there and it is only used as a compile-time stub [0].
[0] immediately shows some new bug as DTRACE-derived support for debug
in sfxge is broken and it was never really tested. As it was not
including correctly opt_kdtrace.h before it was never enabled so it
was kept broken for a while. Fix this by using a protection stub,
leaving sfxge driver authors the responsibility for fixing it
appropriately [1].
Sponsored by: EMC / Isilon storage division
Discussed with: rstone
[0] Reported by: rstone
[1] Discussed with: philip
This is a fix for a regression introduced in r246293.
vm_page_clear_dirty expects the range to have DEV_BSIZE aligned boundaries,
otherwise it extends them. Thus it can happen that the whole page is
marked clean while actually having some small dirty region(s).
This commit makes the range properly aligned and ensures that only
the clean data is marked as such.
It would interesting to evaluate how much benefit clearing with DEV_BSIZE
granularity produces. Perhaps instead we should clear the whole page
when it is completely overwritten and don't bother clearing any bits
if only a portion a page is written.
Reported by: George Hartzell <hartzell@alerce.com>,
Richard Todd <rmtodd@servalan.servalan.com>
Tested by: George Hartzell <hartzell@alerce.com>,
Reviewed by: kib
MFC after: 5 days
On machines with seveal CPUs and enough RAM this can easily twice improve
ZFS performance or twice reduce CPU usage. It was disabled three years
ago due to memory and KVA exhaustion reports, but our VM subsystem got
improved a lot since that time, hopefully enough to make another try.
Fix several problems that can cause panics on kldload and kldunload.
* kproc_create(fasttrap_pid_cleanup_cb, ...) gets called before
fasttrap_provs.fth_table gets allocated. This can lead to a panic
on module load, because fasttrap_pid_cleanup_cb references
fasttrap_provs.fth_table. Move kproc_create down after the point
that fasttrap_provs.fth_table gets allocated, and modify the error
handling accordingly.
* dtrace_fasttrap_{fork,exec,exit} weren't getting NULLed until
after fasttrap_provs.fth_table got freed. That caused panics on
module unload because fasttrap_exec_exit calls
fasttrap_provider_retire, which references
fasttrap_provs.fth_table. NULL those function pointers earlier.
* There wasn't any code to destroy the
fasttrap_{tpoints,provs,procs}.fth_table mutexes on module unload,
leading to a resource leak when WITNESS is enabled. Destroy those
mutexes during fasttrap_unload().
Reviewed by: markj
Approved by: ken (mentor)
Sponsored by: Spectra Logic
MFC after: 4 weeks
It was not being correctly copied into the kernel on FreeBSD, and as a
result, probes with multiple probe sites were not being created properly.
To fix this, change the ioctl definition so that the fasttrap ioctl handler
is responsible for copying in userland data.
Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in>
MFC after: 1 month
VM subsystem twice for every written record.
Tests on 24-core system show double reduction of CPU time spent on copying
single large well-compressed file.
This patch is not really needed on illumos (while not harm either) since
their memory allocator by default uses caching for all requests up to 128K.
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
emulation of the call instruction caused by reversing the uaddr and kaddr
arguments when copying data out to userland: the suword* functions take the
uaddr as the first argument whereas copyout(9) takes the kaddr as the first
argument. This also partially undoes the fixes from r257143.
Submitted by: Prashanth Kumar <pra_udupi@yahoo.co.in> (original version)
MFC after: 1 month
corresponding x86 trap type. Userland DTrace probes are currently handled
by the other fasttrap hooks (dtrace_pid_probe_ptr and
dtrace_return_probe_ptr).
Discussed with: rpaulo
* Remove the unused sdt cdev.
* Don't bother keeping a list of probes in struct sdt_prov; it's not needed.
* Invoke sdt_load and sdt_unload from the module handler instead of
registering separate SYSINITs.
* Keep to within 80 columns.
* Check for errors from dtrace_unregister().
the code was trying to save the stack pointer rather than the frame pointer,
and the arguments to copyout(9) were reversed, so nothing ended up being
saved on the stack. This would cause process crashes when the pid provider
was being used to instrument calls of a function starting with this
instruction.
Reported by: symbolics@gmx.com
Tested by: symbolics@gmx.com (earlier version)
MFC after: 2 weeks
information.
The existing algorithm selects a preferred leaf vdev based on offset of the zio
request modulo the number of members in the mirror. It assumes the devices are
of equal performance and that spreading the requests randomly over both drives
will be sufficient to saturate them. In practice this results in the leaf vdevs
being under utilized.
The new algorithm takes into the following additional factors:
* Load of the vdevs (number outstanding I/O requests)
* The locality of last queued I/O vs the new I/O request.
Within the locality calculation additional knowledge about the underlying vdev
is considered such as; is the device backing the vdev a rotating media device.
This results in performance increases across the board as well as significant
increases for predominantly streaming loads and for configurations which don't
have evenly performing devices.
The following are results from a setup with 3 Way Mirror with 2 x HD's and
1 x SSD from a basic test running multiple parrallel dd's.
With pre-fetch disabled (vfs.zfs.prefetch_disable=1):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 161 seconds @ 95 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 297 seconds @ 51 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 54 seconds @ 284 MB/s
With pre-fetch enabled (vfs.zfs.prefetch_disable=0):
== Stripe Balanced (default) ==
Read 15360MB using bs: 1048576, readers: 3, took 91 seconds @ 168 MB/s
== Load Balanced (zfslinux) ==
Read 15360MB using bs: 1048576, readers: 3, took 108 seconds @ 142 MB/s
== Load Balanced (locality freebsd) ==
Read 15360MB using bs: 1048576, readers: 3, took 48 seconds @ 320 MB/s
In addition to the performance changes the code was also restructured, with
the help of Justin Gibbs, to provide a more logical flow which also ensures
vdevs loads are only calculated from the set of valid candidates.
The following additional sysctls where added to allow the administrator
to tune the behaviour of the load algorithm:
* vfs.zfs.vdev.mirror.rotating_inc
* vfs.zfs.vdev.mirror.rotating_seek_inc
* vfs.zfs.vdev.mirror.rotating_seek_offset
* vfs.zfs.vdev.mirror.non_rotating_inc
* vfs.zfs.vdev.mirror.non_rotating_seek_inc
These changes where based on work started by the zfsonlinux developers:
https://github.com/zfsonlinux/zfs/pull/1487
Reviewed by: gibbs, mav, will
MFC after: 2 weeks
Sponsored by: Multiplay
zio_compress_data(..) when compressing l2arc buffers.
This eliminates l2arc I/O errors, which resulted in very poor performance on
vdev's configured with block size greater than 512b due to compression
assuming a smaller min block size than the vdev supports.
MFC after: 2 days
When safety requirements are met, it allows to avoid passing I/O requests
to GEOM g_up/g_down thread, executing them directly in the caller context.
That allows to avoid CPU bottlenecks in g_up/g_down threads, plus avoid
several context switches per I/O.
The defined now safety requirements are:
- caller should not hold any locks and should be reenterable;
- callee should not depend on GEOM dual-threaded concurency semantics;
- on the way down, if request is unmapped while callee doesn't support it,
the context should be sleepable;
- kernel thread stack usage should be below 50%.
To keep compatibility with GEOM classes not meeting above requirements
new provider and consumer flags added:
- G_CF_DIRECT_SEND -- consumer code meets caller requirements (request);
- G_CF_DIRECT_RECEIVE -- consumer code meets callee requirements (done);
- G_PF_DIRECT_SEND -- provider code meets caller requirements (done);
- G_PF_DIRECT_RECEIVE -- provider code meets callee requirements (request).
Capable GEOM class can set them, allowing direct dispatch in cases where
it is safe. If any of requirements are not met, request is queued to
g_up or g_down thread same as before.
Such GEOM classes were reviewed and updated to support direct dispatch:
CONCAT, DEV, DISK, GATE, MD, MIRROR, MULTIPATH, NOP, PART, RAID, STRIPE,
VFS, ZERO, ZFS::VDEV, ZFS::ZVOL, all classes based on g_slice KPI (LABEL,
MAP, FLASHMAP, etc).
To declare direct completion capability disk(9) KPI got new flag equivalent
to G_PF_DIRECT_SEND -- DISKFLAG_DIRECT_COMPLETION. da(4) and ada(4) disk
drivers got it set now thanks to earlier CAM locking work.
This change more then twice increases peak block storage performance on
systems with manu CPUs, together with earlier CAM locking changes reaching
more then 1 million IOPS (512 byte raw reads from 16 SATA SSDs on 4 HBAs to
256 user-level threads).
Sponsored by: iXsystems, Inc.
MFC after: 2 months
the register based on the argument index rather than relying on the fields
in struct reg to be in the right order. This assumption is incorrect on
FreeBSD and generally led to bogus argument values for the sixth argument
of PID and USDT probes; the first five are passed directly to dtrace_probe()
via the fasttrap trap handler and so were correctly handled.
MFC after: 2 weeks
null-separated strings to a single string. This can be used to print the
full arguments of a process using execsnoop (from the DTrace toolkit) or
with the following one-liner:
dtrace -n 'syscall::execve:return {trace(curpsinfo->pr_psargs);}'
Note that this relies on the process arguments being cached via the struct
proc, which means that it will not work for argvs longer than
kern.ps_arg_cache_limit. However, the following rather non-portable
script can be used to extract any argv at exec time:
fbt::kern_execve:entry
{
printf("%s", memstr(args[1]->begin_argv, ' ',
args[1]->begin_envv - args[1]->begin_argv));
}
The debug.dtrace.memstr_max sysctl limits the maximum argument size to
memstr(). Thanks to Brendan Gregg for helpful comments on freebsd-dtrace.
Tested by: Fabian Keil (earlier version)
MFC after: 2 weeks
illumos change 14172:be36a38bac3d:
illumos ZFS issues:
4082 zfs receive gets EFBIG from dmu_tx_hold_free()
Please note that this change is slightly different from r255257, because
it is merged out of order with other (larger) upstream changes.
PR: kern/182570
Reported by: Keith White <kwhite@site.uottawa.ca>
Tested by: Keith White <kwhite@site.uottawa.ca>
Approved by: re (glebius)
MFC after: 1 week
X-MFC after: r254753
handlers rather than in the dtrace device open/close methods. The current
approach can cause a panic if the device is closed which the taskqueue
thread is active, or if a kernel module containing a provider is unloaded
while retained enablings are present and the dtrace device isn't opened.
Submitted by: gibbs (original version)
Reviewed by: gibbs
Approved by: re (glebius)
MFC after: 2 weeks
Add support of Illumos dumps on zvol over RAID-Z.
Note that this only adds the features. FreeBSD would
still need more work to support dumping on zvols.
Illumos ZFS issues:
2932 support crash dumps to raidz, etc. pools
MFC after: 1 month
Approved by: re (ZFS blanket)
where 'fdvp' is a directory, 'tvp' is an already existing directory
and they have different mount points.
Reported by: avg, pjd
Reviewed by: pjd
Approved by: re (rodrigc)