The purpose of this change is to group kernelthreads specific to a
particular ZFS pool under a kernel process. There can be many dozens of
threads per pool. This change improves observability of those threads.
This change consists of several subchanges:
1. illumos taskq_create_proc can now pass its process parameter to
taskqueue. Also, use zfsproc instead of NULL for taskq_create. Caveat:
zfsproc might not be initialized yet. But in that case it is still NULL,
so not worse than before.
2. illumos sys/proc.h: kthread id is stored in t_did field, not t_tid.
3. zfs: enable SPA_PROCESS on the kernel side. The change is a bit hairy
as newproc() is implemented privately to spa.c. I couldn't think of a
better way to populate process name than to poke inside the argument for
the process routine.
4. illumos thread_create: allow assigning thread to process other than
zfsproc.
5. zfs: expose spa_proc to other users, assign sync and quiesce threads
to it.
Pool-specific threads created using (relatively new) zthr mechanism are
still assigned to the zfskern process rather than to a respective
zpool-xxx process. I am going to address this a bit later.
Reviewed by: no one
MFC after: 5 weeks
Relnotes: perhaps
Differential Revision: https://reviews.freebsd.org/D9720
Since physical device asize is calculated from psize and the asize is stored
in pool label, we can use asize to set the value of psize, which is used to
calculate the location of the pool labels.
MFC after: 1 week
Port illumos change: https://www.illumos.org/issues/11667
Move lz4.c out of zfs tree to opensolaris/common/lz4, adjust it to be
usable from kernel/stand/userland builds, so we can use just one single
source. Add lz4.h to declare lz4_compress() and lz4_decompress().
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D22037
The previous code came from OpenSolaris, which in my understanding require
allocation size to be known to free memory. To store that size previous
code allocated additional 8 byte header. But I have noticed that zlib
with present settings allocates 64KB context buffers for each call, that
could be efficiently cached by UMA, but addition of those 8 bytes makes
them fall back to physical RAM allocations, that cause huge overhead and
lock congestion on small blocks. Since FreeBSD's free() does not have
the size argument, switching to it solves the problem, increasing write
speed to ZVOLs with 4KB block size and GZIP compression on my 40-threads
test system from ~60MB/s to ~600MB/s.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
This patch modifies the zfs_ioc_snapshot_list_next() ioctl to enable it
to take input parameters that alter the way looping through the list of
snapshots is performed. The idea here is to restrict functions that
throw away some of the snapshots returned by the ioctl to a range of
snapshots that these functions actually use. This improves efficiency
and execution speed for some rollback and send operations.
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
Closes#8077zfsonlinux/zfs@4c0883fb4a
MFC after: 2 weeks
except for filesystems that set the MNTK_VMSETSIZE_BUG, Set the flag for ZFS.
Tested by: pho
Sponsored by: The FreeBSD Foundation
MFC after: 2 weeks
Differential revision: https://reviews.freebsd.org/D21883
illumos/illumos-gate@c4ab0d3f46c4ab0d3f46https://www.illumos.org/issues/10809
Port ZoL ee36c709c3 Performance optimization of AVL tree comparator functions
This is a followup to r337567 that imported the ZoL commit directly into
FreeBSD. It seems that at the time we did not have some of the earlier
changes, so some pieces of the ZoL change were not applicable. Also,
the illumos version got a few style cleanups. Some changes were missed
or incorrectly merged (e.g., vdev_cache_lastused_compare and
metaslab_rangesize_compare).
Obtained from: ZoL, illumos
MFC after: 25 days
X-MFC after: r353634
illumos/illumos-gate@7931524763
FreeBSD note: some tweaking was needed to avoid a conflict with
sys/rangelock.h.
Author: Matthew Ahrens <mahrens@delphix.com>
Obtained from: illumos
MFC after: 3 weeks
illumos/illumos-gate@52abb70e0752abb70e07https://www.illumos.org/issues/9691
When iterating over a ZAP object, we're almost always certain to
iterate over the entire object. If there are multiple leaf blocks, we
can realize a performance win by issuing reads for all the leaf blocks
in parallel when the iteration begins.
For example, if we have 10,000 snapshots, "zfs destroy -nv
pool/fs@1%9999" can take 30 minutes when the cache is cold. This
change provides a >3x performance improvement, by issuing the reads
for all ~64 blocks of each ZAP object in parallel.
Author: Matthew Ahrens <mahrens@delphix.com>
Obtained from: illumos
MFC after: 2 weeks
illumos/illumos-gate@d0cb1fb926d0cb1fb926https://www.illumos.org/issues/9425
Problem Statement
ZFS Channel program scripts currently require a timeout, so that hung
or long-running scripts return a timeout error instead of causing ZFS
to get wedged. This limit can currently be set up to 100 million Lua
instructions. Even with a limit in place, it would be desirable to
have a sys admin (support engineer) be able to cancel a script that is
taking a long time.
Proposed Solution
Make it possible to abort a channel program by sending an interrupt
signal.In the underlying txg_wait_sync function, switch the cv_wait to
a cv_wait_sig to catch the signal. Once a signal is encountered, the
dsl_sync_task function can install a Lua hook that will get called
before the Lua interpreter executes a new line of code. The
dsl_sync_task can resume with a standard txg_wait_sync call and wait
for the txg to complete. Meanwhile, the hook will abort the script and
indicate that the channel program was canceled. The kernel returns a
EINTR to indicate that the channel program run was canceled.
FreeBSD note: the return value of cv_wait_sig() has inverted meaning
between us and illumos.
Author: Don Brady <don.brady@delphix.com>
Obtained from: illumos
MFC after: 4 weeks
illumos/illumos-gate@a0b03b161ca0b03b161chttps://www.illumos.org/issues/10330
3 recent ZoL changes in the vdev and metaslab code which we can pull over:
PR 8324 c853f382db 8324 Change target size of metaslabs from 256GB to 16GB
PR 8290 b194fab0fb 8290 Factor metaslab_load_wait() in metaslab_load()
PR 8286 419ba59145 8286 Update vdev_is_spacemap_addressable() for new spacemap
encoding
Author: Serapheim Dimitropoulos <serapheimd@gmail.com>
Obtained from: illumos, ZoL
MFC after: 2 weeks
illumos/illumos-gate@e914ace2e9e914ace2e9https://www.illumos.org/issues/10343
On the openzfs feature/porting matrix, this is listed as:
prefix to refcount funcs/types
Having these changes will make it easier to share other work across the
different ZFS operating systems.
PR 7963 424fd7c3e Prefix all refcount functions with zfs_
PR 7885 & 7932 c13060e47 Linux 4.19-rc3+ compat: Remove refcount_t compat
PR 5823 & 5842 4859fe796 Linux 4.11 compat: avoid refcount_t name conflict
Author: Tim Schumacher <timschumi@gmx.de>
Obtained from: illumos, ZoL
MFC after: 3 weeks
illumos/illumos-gate@aa02ea0194aa02ea0194
10572 Fix race in dnode_check_slots_free()
https://www.illumos.org/issues/10572
The Fix from ZoL:
Currently, dnode_check_slots_free() works by checking dn->dn_type
in the dnode to determine if the dnode is reclaimable. However,
there is a small window of time between dnode_free_sync() in the
first call to dsl_dataset_sync() and when the useraccounting code
is run when the type is set DMU_OT_NONE, but the dnode is not yet
evictable, leading to crashes. This patch adds the ability for
dnodes to track which txg they were last dirtied in and adds a
check for this before performing the reclaim.
This patch also corrects several instances when dn_dirty_link was
treated as a list_node_t when it is technically a multilist_node_t.
10579 Don't allow dnode allocation if dn_holds != 0
https://www.illumos.org/issues/10579
The fix from ZoL:
This patch simply fixes a small bug where dnode_hold_impl() could
attempt to allocate a dnode that was in the process of being freed,
but which still had active references. This patch simply adds the
required check.
Author: Tom Caputi <tcaputi@datto.com>
Reported by: delphij
MFC after: 2 weeks
X-MFC with: r353176
illumos/illumos-gate@946342a260946342a260https://www.illumos.org/issues/10452
illumos is missing a few small follow up ZoL bug fixes for the large dnode
feature. We should pull those in.
Those commits are in the ZoL tree as (newest to oldest):
PR 8435 - 75d6b7ddca - Add missing copyright
notice to large_dnode tests
PR 7433 - e14a32b1c8 - Fix object reclaim when
using large dnodes
PR 6616 - 48fbb9ddbf - Free objects when
receiving full stream as clone
PR 6695 - 39f56627ae - receive_freeobjects()
skips freeing some object
Portions contributed by: Ned Bass <bass6@llnl.gov>
Portions contributed by: Tom Caputi <tcaputi@datto.com>
Author: Fabian Grünbichler <f.gruenbichler@proxmox.com>
Obtained from: illumos, ZoL
MFC after: 2 weeks
X-MFC with: r353176
Atomics are used for page busy and valid state when the shared busy is
held. The details of the locking protocol and valid and dirty
synchronization are in the updated vm_page.h comments.
Reviewed by: kib, markj
Tested by: pho
Sponsored by: Netflix, Intel
Differential Revision: https://reviews.freebsd.org/D21594
fcmpset can have two kinds of semantics, weak and strong.
For practical purposes, strong semantics means that if fcmpset fails
then the reported current value is always different from the expected
value. Weak semantics means that the reported current value may be the
same as the expected value even though fcmpset failed. That's a so
called "sporadic" failure.
I originally implemented atomic_cas expecting strong semantics, but many
platforms actually have weak one.
Reported by: pkubaj (not confirmed if same issue)
Discussed with: kib, mjg
MFC after: 19 days
X-MFC with: r353340
This patch fixes 2 issues with the DMU free throttle implemented
in dmu_free_long_range(). The first issue is that get_next_chunk()
was calculating the number of L1 blocks the free would dirty
incorrectly. In some cases involving extremely large files, this
code would greatly overestimate the number of affected L1 blocks,
causing excessive calls to txg_wait_open(). This patch corrects
the calculation.
The second issue is that the free throttle uses the total number
of free'd blocks in all (open, quiescing, and syncing) txgs to
determine whether to throttle. This causes large frees (such as
those created by the first issue) to cause 4 txg syncs before
any further frees were allowed to proceed. This patch ensures
that the accounting is done entirely in a per-txg fashion, so
that frees from a given txg don't affect those that immediately
follow it.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Tom Caputi <tcaputi@datto.com>
zfsonlinux/zfs@f4c594da94
Freeing throttle should account for holes
Deletion throttle currently does not account for holes in a file.
This means that it can activate when it shouldn't.
To fix it we switch the throttle to be based on the number of
L1 blocks we will have to dirty when freeing
Reviewed-by: Tom Caputi <tcaputi@datto.com>
Reviewed-by: Matt Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alek Pinchuk <apinchuk@datto.com>
zfsonlinux/zfs@65282ee9e0
Submitted by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: allanjude
MFC after: 2 weeks
Sponsored by: Axcient
Differential Revision: https://reviews.freebsd.org/D21895
membar_producer is supposed to be a store-store barrier.
Also, in the code that FreeBSD has ported from illumos membar_producer
is used only with regular stores to regular memory (with respect to
caching).
We do not have an MI primitive for the store-store barrier, so
atomic_thread_fence_rel is the closest we have as it provides
(load | store) -> store barrier.
Previously, membar_producer was an empty function call on all 32-bit
arm-s, 32-bit powerpc, riscv and all mips variants. I think that it was
inadequate.
On other platforms, such as amd64, arm64, i386, powerpc64, sparc64,
membar_producer was implemented using stronger primitives than required
for a store-store barrier with respect to regular memory access.
For example, it used sfence on amd64 and lock-ed nop in i386 (despite TSO).
On powerpc64 we now use recommended lwsync instead of eieio.
On sparc64 FreeBSD uses TSO mode.
On arm64/aarch64 we now use dmb sy instead of dmb ish. Not sure if this
is an improvement, actually.
After this change we can drop opensolaris_atomic.S for aarch64, amd64,
powerpc64 and sparc64 as all required atomic operations have either
direct or light-weight mapping to FreeBSD native atomic operations.
Discussed with: kib
MFC after: 4 weeks
atomic_cas_32 is implemented using atomic_fcmpset_32 on all platforms.
Ditto for atomic_cas_64 and atomic_fcmpset_64 on platforms that have it.
The only exception is sparc64 that provides MD atomic_cas_32 and
atomic_cas_64.
This is slightly inefficient as fcmpset reports whether the operation
updated the target and that information is not needed for cas.
Nevertheless, there is less code to maintain and to add for new platforms.
Also, the operations are done inline now as opposed to function calls before.
atomic_add_64_nv is implemented using atomic_fetchadd_64 on platforms
that provide it.
casptr, cas32, atomic_or_8, atomic_or_8_nv are completely removed as they
have no users.
atomic_mtx that is used to emulate 64-bit atomics on platforms that lack
them is defined only on those platforms.
As a result, platform specific opensolaris_atomic.S files have lost most of
their code. The only exception is i386 where the compat+contrib code
provides 64-bit atomics for userland use. That code assumes availability of
cmpxchg8b instruction. FreeBSD does not have that assumption for i386
userland and does not provide 64-bit atomics. Hopefully, this can and will
be fixed.
MFC after: 3 weeks
As long as we support ZFS on 32-bit platforms we should do this for all
64-bit variables that are modified in a lockless fashion using atomic
operations. Otherwise, there is a risk of a reading a torn value.
Here is a rationale for why I am doing this in dmu_object_alloc_impl:
- it's very recent code
- the code deals with object IDs and a number of objects in a file
system can overflow 32 bits
- incorrect allocation of an object ID may result in hard to debug
problems
- fixing all plain reads of 64-bit atomic variables is not a trivial
undertaking to do in one shot, so I chose to do it incrementally
MFC after: 3 weeks
X-MFC after: r353301, r353176
The compatibility code for the atomic operations in ZFS code is a bit
messy. In some cases the native definitions are directly made
available, in some cases there are emulated operations in
opensolaris_atomic.c and in yet other cases there are atomic operations
implemented in assembly that were obtained from OpenSolaris / illumos.
This commit adds atomic_swap_64 for use with i386 userland.
The code is copied from illumos.
I am not sure why FreeBSD does not provide that operation natively.
Maybe because we try (or pretend) to support processors that did not
have the necessary instructions.
While here I also added atomic_load_64 for the same reasons.
This is original code based on iilumos atomic_swap_64 and FreeBSD
atomic_load_acq_64_i586.
Pointyhat to: avg
MFC after: 1 week
8423 8199 7432 Implement large_dnode pool feature
7432 Large dnode pool feature
8199 multi-threaded dmu_object_alloc()
8423 Implement large_dnode pool feature
10406 large_dnode changes broke zfs recv of legacy stream
llumos/illumos-gate@54811da5ac54811da5achttps://www.illumos.org/issues/8423https://www.illumos.org/issues/8199https://www.illumos.org/issues/7432illumos/illumos-gate@811964cd9f811964cd9fhttps://www.illumos.org/issues/10406
ZoL issues:
Improved dnode allocation #6564
Clean up large dnode code #6262
Fix dnode_hold() freeing dnode behavior #8172
Fix dnode allocation race #6414, #6439
Partial: Raw sends must be able to decrease nlevels #6821, #6864
Remove unnecessary txg syncs from receive_object() Closes#7197
This updates FreeBSD large_dnode code (that was imported from ZoL) to a
version that was committed to illumos. It has some cleanups,
improvements and fixes comparing to what we have in FreeBSD now.
I think that the most significant update is 8199 multi-threaded
dmu_object_alloc().
This commit reverts r351077 that was a revert of r351074 and r351076 and
restores those changes. Required atomic operations should be available
now on all platforms where we build ZFS.
Obtained from: illumos
MFC after: 3 weeks
Previously, the code used a plain store on platforms that lacked
atomic_swap_64 and possibly some other platforms as the condition worked
only if atomic_swap_64 was a macro.
MFC after: 1 week
X-MFC after: r353166, r353167
Some 32-bit platforms do not provide 64-bit atomic operations that ZFS
requires, either in userland or at all. We emulate those operations for
those platforms using a mutex. That is not entirely correct and it's
very efficient. Besides, the loads are plain loads, so torn values are
possible.
Nevertheless, the emulation seems to work for some definition of work.
This change adds atomic_swap_64, which is already used in ZFS code, and
atomic_load_64 that can be used to prevent torn reads.
MFC after: 1 week
The registers in ilumos and FreeBSD have a different number.
In the illumos, last 32-bits register defined is SS an in FreeBSD is GS.
While translating register we should comper it to the highest one.
PR: 240358
Reported by: lwhsu@
MFC after: 2 weeks
The feature is implemented as an extension of the existing
ZFS_IOC_RENAME ioctl. Both the userland and the DSL interfaces support
renaming only a single bookmark at a time. As of now, there is no ZCP
interface to the new functionality. I am going to add it once the DSL
interface passes a test of time.
This change picks up support for zfs_ioc_namecheck_t::ENTITY_NAME that
was added to ZoL as part of Redacted Send/Receive feature by Paul
Dagnelie <pcd@delphix.com>. This is needed to allow a bookmark name in
zc_name.
Discussed with: mahrens
Reviewed by: bcr (man page)
Sponsored by: CyberSecure
Differential Revision: https://reviews.freebsd.org/D21795
Before my ZIL space optimization few years ago 128KB writes were logged
as two 64KB+ records in two 128KB log blocks. After that change it became
~124KB+/4KB+ in two 128KB log blocks to free space in the second block
for another record. Unfortunately in case of 128KB only writes, when space
in the second block remained unused, that change increased write latency by
imbalancing checksum computation time between parallel threads.
This change introduces new 68KB log block size, used for both writes below
67KB and 128KB-sharp writes. Writes of 68-127KB are still using one 128KB
block to not increase processing overhead. Writes above 131KB are still
using full 128KB blocks, since possible saving there is small. Mixed loads
will likely also fall back to previous 128KB, since code uses maximum of
the last 10 requested block sizes.
On a simple 128KB write test with queue depth of 1 this change demonstrates
~15-20% performance improvement.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
There is no reason for these routines to be written in assembly. In
the ports of DTrace to other platforms, they are already written in C.
No functional change intended.
MFC after: 1 week
Sponsored by: Netflix
Add a small wrapper around libzfs_core's lzc_send_space() to libzfs so
that every legacy ZFS_IOC_SEND consumer, along with their userland
counterpart estimate_ioctl(), can leverage ZFS_IOC_SEND_SPACE to
request send space estimation.
The legacy functionality in zfs_ioc_send() is left untouched for
compatibility purposes.
Obtained from: ZoL
Obtained from: zfsonlinux/zfs@cf7684bc8d
Author: loli10K <ezomori.nozomu@gmail.com>
MFC after: 2 weeks
It was incorrect with respect to swapping dataset IDs both in the
on-disk ZAP object and the in-memory queue.
In both cases, if only ds1 was already present, then it would be first
replaced with ds2 and then ds2 would be replaced back with ds1. Also,
both cases did not properly handle a situation where both ds1 and ds2
are already queued. A duplicate insertion would be attempted and its
failure would result in a panic.
This change has also been submitted to ZoL as zfsonlinux/zfs@dd262c9
PR: 239566
Reported by: pascal.guitierrez@gmail.com
MFC after: 4 days
Sponsored by: CyberSecure
New primitive is introduced to denote sections can operate locklessly
on aspects of struct mount, but which can also be disabled if necessary.
This provides an opportunity to start scaling common case modifications
while providing stable state of the struct when facing unmount, write
suspendion or other events.
mnt_ref is the first counter to start being managed in this manner with
the intent to make it per-cpu.
Reviewed by: kib, jeff
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21425
- Remove a dead variable from the amd64 pmap_extract_and_hold().
- Fix grammar in the vm_page_wire man page.
Reported by: alc
Reviewed by: alc, kib
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21639
- VM_ALLOC_NOCREAT will grab without creating a page.
- vm_page_grab_valid() will grab and page in if necessary.
- vm_page_busy_acquire() automates some busy acquire loops.
Discussed with: alc, kib, markj
Tested by: pho (part of larger branch)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21546
races with page busy state. The object lock is still used as an interlock
to ensure that the identity stays valid. Most callers should use
vm_page_sleep_if_busy() to handle the locking particulars.
Reviewed by: alc, kib, markj
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D21255
There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486