There are several mechanisms by which a vm_page reference is held,
preventing the page from being freed back to the page allocator. In
particular, holding the page's object lock is sufficient to prevent the
page from being freed; holding the busy lock or a wiring is sufficent as
well. These references are protected by the page lock, which must
therefore be acquired for many per-page operations. This results in
false sharing since the page locks are external to the vm_page
structures themselves and each lock protects multiple structures.
Transition to using an atomically updated per-page reference counter.
The object's reference is counted using a flag bit in the counter. A
second flag bit is used to atomically block new references via
pmap_extract_and_hold() while removing managed mappings of a page.
Thus, the reference count of a page is guaranteed not to increase if the
page is unbusied, unmapped, and the object's write lock is held. As
a consequence of this, the page lock no longer protects a page's
identity; operations which move pages between objects are now
synchronized solely by the objects' locks.
The vm_page_wire() and vm_page_unwire() KPIs are changed. The former
requires that either the object lock or the busy lock is held. The
latter no longer has a return value and may free the page if it releases
the last reference to that page. vm_page_unwire_noq() behaves the same
as before; the caller is responsible for checking its return value and
freeing or enqueuing the page as appropriate. vm_page_wire_mapped() is
introduced for use in pmap_extract_and_hold(). It fails if the page is
concurrently being unmapped, typically triggering a fallback to the
fault handler. vm_page_wire() no longer requires the page lock and
vm_page_unwire() now internally acquires the page lock when releasing
the last wiring of a page (since the page lock still protects a page's
queue state). In particular, synchronization details are no longer
leaked into the caller.
The change excises the page lock from several frequently executed code
paths. In particular, vm_object_terminate() no longer bounces between
page locks as it releases an object's pages, and direct I/O and
sendfile(SF_NOCACHE) completions no longer require the page lock. In
these latter cases we now get linear scalability in the common scenario
where different threads are operating on different files.
__FreeBSD_version is bumped. The DRM ports have been updated to
accomodate the KPI changes.
Reviewed by: jeff (earlier version)
Tested by: gallatin (earlier version), pho
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D20486
This fixes a hole in the situation where the resume state is left from
receiving a new dataset and, so, the state is set on the dataset itself
(as opposed to %recv child).
Additionally, distinguish incremental and resume streams in error
messages.
This was also committed to ZoL:
zfsonlinux/zfs@ebeb6f23bf
MFC after: 2 weeks
Sponsored by: CyberSecure
vnodes have 2 reference counts - holdcnt to keep the vnode itself from getting
freed and usecount to denote it is actively used.
Previously all operations bumping usecount would also bump holdcnt, which is
not necessary. We can detect if usecount is already > 1 (in which case holdcnt
is also > 1) and utilize it to avoid bumping holdcnt on our own. This saves
on atomic ops.
Reviewed by: kib
Tested by: pho (previous version)
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D21471
The page daemon periodically invokes uma_reclaim() to reclaim cached
items from each zone when the system is under memory pressure. This
is important since the size of these caches is unbounded by default.
However it also results in bursts of high latency when allocating from
heavily used zones as threads miss in the per-CPU caches and must
access the keg in order to allocate new items.
With r340405 we maintain an estimate of each zone's usage of its
(per-NUMA domain) cache of full buckets. Start making use of this
estimate to avoid reclaiming the entire cache when under memory
pressure. In particular, introduce TRIM, DRAIN and DRAIN_CPU
verbs for uma_reclaim() and uma_zone_reclaim(). When trimming, only
items in excess of the estimate are reclaimed. Draining a zone
reclaims all of the cached full buckets (the previous behaviour of
uma_reclaim()), and may further drain the per-CPU caches in extreme
cases.
Now, when under memory pressure, the page daemon will trim zones
rather than draining them. As a result, heavily used zones do not incur
bursts of bucket cache misses following reclamation, but large, unused
caches will be reclaimed as before.
Reviewed by: jeff
Tested by: pho (an earlier version)
MFC after: 2 months
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D16667
Current implementation of vnode_create_vobject() and
vnode_destroy_vobject() is written so that it prepared to handle the
vm object destruction for live vnode. Practically, no filesystems use
this, except for some remnants that were present in UFS till today.
One of the consequences of that model is that each filesystem must
call vnode_destroy_vobject() in VOP_RECLAIM() or earlier, as result
all of them get rid of the v_object in reclaim.
Move the call to vnode_destroy_vobject() to vgonel() before
VOP_RECLAIM(). This makes v_object stable: either the object is NULL,
or it is valid vm object till the vnode reclamation. Remove code from
vnode_create_vobject() to handle races with the parallel destruction.
Reviewed by: markj
Tested by: pho
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D21412
Previously, the permissions were checked on the pool which was obviously
incorrect.
After this change, zfs_check_userprops() only validates the properties
without any permission checks. The permissions are checked individually
for each snapshotted dataset.
This was also committed to ZoL: zfsonlinux/zfs@e6203d2
Reported by: CyberSecure
MFC after: 1 week
Sponsored by: CyberSecure
If vn_lock() failed, then the function returned the error but the vnode
obtained via zfs_zget() was never released.
MFC after: 10 days
Sponsored by: Panzura
illumos/illumos-gate@811964cd9f811964cd9fhttps://www.illumos.org/issues/10406
The large dnode changes from 8423 caused problems in zfs recv for a legacy
stream. This manifests when attempting to mount the received stream, but the
problem is in the receive code. We missed the following commit from ZoL which
fixes this.
commit da2feb42fb
Author: Tom Caputi <tcaputi@datto.com>
Date: Thu Jun 28 17:55:11 2018 -0400
Fix 'zfs recv' of non large_dnode send streams
Currently, there is a bug where older send streams without the
DMU_BACKUP_FEATURE_LARGE_DNODE flag are not handled correctly.
The code in receive_object() fails to handle cases where
drro->drr_dn_slots is set to 0, which is always the case when the
sending code does not support this feature flag. This patch fixes
the issue by ensuring that that a value of 0 is treated as
DNODE_MIN_SLOTS.
Author: Tom Caputi <tcaputi@datto.com>
MFC after: 3 weeks
X-MFC after: r351074
8423 8199 7432 Implement large_dnode pool feature
8423 Implement large_dnode pool feature
8199 multi-threaded dmu_object_alloc()
7432 Large dnode pool feature
llumos/illumos-gate@54811da5ac54811da5achttps://www.illumos.org/issues/8423https://www.illumos.org/issues/8199https://www.illumos.org/issues/7432
ZoL issues:
Improved dnode allocation #6564
Clean up large dnode code #6262
Fix dnode_hold() freeing dnode behavior #8172
Fix dnode allocation race #6414, #6439
Partial: Raw sends must be able to decrease nlevels #6821, #6864
Remove unnecessary txg syncs from receive_object() Closes#7197
This updates FreeBSD large_dnode code (that was imported from ZoL) to a version
that was committed to illumos. It has some cleanups, improvements and fixes
comparing to what we have in FreeBSD now. I think that the most significant
update is 8199 multi-threaded dmu_object_alloc().
Obtained from: illumos
MFC after: 3 weeks
illumos/illumos-gate@892586e8a1892586e8a1https://www.illumos.org/issues/6585
In any pool without the extensible dataset feature flag already enabled,
creating a dataset with dedup set to use one of the new checksums would result
in the following panic as soon as any data was added:
panic[cpu0]/thread=ffffff0006761c40: feature_get_refcount(spa, feature,
&refcount) != 48 (0x30 != 0x30), file: ../../common/fs/zfs/zfeature.c line 390
ffffff0006761830 fffffffffba8fbdd ()
ffffff0006761890 zfs:feature_do_action+11a ()
ffffff00067618c0 zfs:spa_feature_incr+1e ()
ffffff0006761920 zfs:dmu_object_zapify+b7 ()
ffffff00067619b0 zfs:dsl_dataset_activate_feature+97 ()
ffffff0006761a20 zfs:dsl_dataset_sync+ba ()
ffffff0006761ab0 zfs:dsl_pool_sync+153 ()
ffffff0006761b70 zfs:spa_sync+26e ()
ffffff0006761c20 zfs:txg_sync_thread+227 ()
ffffff0006761c30 unix:thread_start+8 ()
Inspection showed that feature->fi_feature was 7, which is the value of
SPA_FEATURE_EXTENSIBLE_DATASET in the spa_feature enum.
Testing shows that the panic can be prevented by explicitly setting extensible
dataset as a dependency for the sha512, edonr, and skein feature flags.
Alternatively, the new checksums code could possibly be changed to obviate the
need for the dependency.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Richard Laager <rlaager@wiktel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: ilovezfs <ilovezfs@icloud.com>
Note that FreeBSD does not support ednor yet.
MFC after: 2 weeks
The race was introduced in r337669, the large dnode feature import from
ZoL. The problem was debugged by ZoL developers and then,
independently, on FreeBSD.
The fix is an early proposal by Brian Behlendorf:
50f32ed74e
This fix never went into ZoL. A larger change that was committed later
included a different solution because of the re-worked code.
Ideally, we want to revert this fix and re-synchronize FreeBSD large
dnode code with that in illumos (or newer ZoL). illumos has a later
import of the feature from ZoL that does not have the bug.
PR: 236480
Obtained from: Brian Behlendorf <behlendorf1@llnl.gov>
Submitted by: ncrogers@gmail.com (patch adaptation)
Reported by: ncrogers@gmail.com
Tested by: ncrogers@gmail.com,
Dennis Noordsij <dennis.noordsij@alumni.helsinki.fi>,
Julien Cigar <julien@perdition.city>
MFC after: 10 days
We should support removing vdev from boot pool. Update loader zfs reader
to support com.delphix:removing.
Reviewed by: allanjude
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D18901
with an eventual goal to convert all legacl zlib callers to the new zlib
version:
* Move generic zlib shims that are not specific to zlib 1.0.4 to
sys/dev/zlib.
* Connect new zlib (1.2.11) to the zlib kernel module, currently built
with Z_SOLO.
* Prefix the legacy zlib (1.0.4) with 'zlib104_' namespace.
* Convert sys/opencrypto/cryptodeflate.c to use new zlib.
* Remove bundled zlib 1.2.3 from ZFS and adapt it to new zlib and make
it depend on the zlib module.
* Fix Z_SOLO build of new zlib.
PR: 229763
Submitted by: Yoshihiro Ota <ota j email ne jp>
Reviewed by: markm (sys/dev/zlib/zlib_kmod.c)
Relnotes: yes
Differential Revision: https://reviews.freebsd.org/D19706
This ioctl is used when a breakpoint is encountered while disassembling
a symbol in the target process. Since only one DTrace consumer can
toggle or enumerate fasttrap probes from a given process at time, this
ioctl does not appear to be used in practice.
The hold_count and wire_count fields of struct vm_page are separate
reference counters with similar semantics. The remaining essential
differences are that holds are not counted as a reference with respect
to LRU, and holds have an implicit free-on-last unhold semantic whereas
vm_page_unwire() callers must explicitly determine whether to free the
page once the last reference to the page is released.
This change removes the KPIs which directly manipulate hold_count.
Functions such as vm_fault_quick_hold_pages() now return wired pages
instead. Since r328977 the overhead of maintaining LRU for wired pages
is lower, and in many cases vm_fault_quick_hold_pages() callers would
swap holds for wirings on the returned pages anyway, so with this change
we remove a number of page lock acquisitions.
No functional change is intended. __FreeBSD_version is bumped.
Reviewed by: alc, kib
Discussed with: jeff
Discussed with: jhb, np (cxgbe)
Tested by: pho (previous version)
Sponsored by: Netflix
Differential Revision: https://reviews.freebsd.org/D19247
DMU sync code calls taskq_dispatch() for each sublist of os_dirty_dnodes
and os_synced_dnodes. Since the number of sublists by default is equal
to number of CPUs, it will dispatch equal, potentially large, number of
tasks, waking up many CPUs to handle them, even if only one or few of
sublists actually have any work to do.
This change adds check for empty sublists to avoid this.
For busy ARC situation when arc_size close to arc_c is desired. But
then it is quite likely that aggsum_compare(&arc_size, arc_c) will need
to flush per-CPU buckets to find exact comparison result. Doing that
often in a hot path penalizes whole idea of aggsum usage there, since it
replaces few simple atomic additions with dozens of lock acquisitions.
Replacing aggsum_compare() with aggsum_upper_bound() in code increasing
arc_p when ARC is growing (arc_size < arc_c) according to PMC profiles
allows to save ~5% of CPU time in aggsum code during sequential write
to 12 ZVOLs with 16KB block size on large dual-socket system.
I suppose there some minor arc_p behavior change due to lower precision
of the new code, but I don't think it is a big deal, since it should
affect only very small window in time (aggsum buckets are flushed every
second) and in ARC size (buckets are limited to 10 average ARC blocks
per CPU).
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
It is too generous to collect in production debug traces that can only
be read with kernel debugger. Illumos includes special code in their
mdb debugger to read it, we don't.
MFC after: 1 week
Sponsored by: iXsystems, Inc.
Memory copy is too heavy operation to do under the congested lock.
Moving it out reduces congestion by many times to almost invisible.
Since the original zio removed from the queue, and the child zio is
not executed yet, I don't see why would the copy need protection.
My guess it just remained like this from the time when lock was not
dropped here, which was added later to fix lock ordering issue.
Multi-threaded sequential write tests with both HDD and SSD pools
with ZVOL block sizes of 4KB, 16KB, 64KB and 128KB all show major
reduction of lock congestion, saving from 15% to 35% of CPU time
and increasing throughput from 10% to 40%.
Reviewed by: ahrens, behlendorf, ryao
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
When ARC size is very small, aggsum_lower_bound(&arc_size) may return
negative values, that due to unsigned comparison caused delays, waiting
for arc_adjust() to "fix" it by calling aggsum_value(&arc_size). Use
of signed comparison there fixes the problem.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
While formally it is not necessary, but the sooner it start, the sooner it
finish, and supposedly less disturbing for workload it will be.
MFC after: 2 weeks
Before r305323 (MFV r302991: 6950 ARC should cache compressed data)
arc_read() code did this for access to a ghost buffer:
arc_adapt() (from arc_get_data_buf())
arc_access(hdr, hash_lock)
I.e., we first checked access to the MFU ghost/MRU ghost buffer and
adapt MFU/MRU sizes (in arc_adapt()) and next move buffer from the ghost
state to regular.
After r305323 the sequence is different:
arc_access(hdr, hash_lock);
arc_hdr_alloc_pabd(hdr);
I.e., we first move the buffer from the ghost state in arc_access() and
then we check access to buffer in ghost state (in arc_hdr_alloc_pabd()
-> arc_get_data_abd() -> arc_get_data_impl() -> arc_adapt()). This is
incorrect: arc_adapt() never see access to the ghost buffer because
arc_access() already migrated the buffer from the ghost state to
regular.
So, the fix is to restore a call to arc_adapt() before arc_access() and
to suppress the call to arc_adapt() after arc_access().
Submitted by: Slawa Olhovchenkov <slw@zxy.spb.ru>
MFC after: 2 weeks
Sponsored by: Integros [integros.com]
Differential Revision: https://reviews.freebsd.org/D19094
When disabling the last enabled userspace probe, fasttrap clears the
function pointers which hook in to the breakpoint handler. If a traced
thread hit a fasttrap breakpoint before it was removed, we must ensure
that it is able to call the hook; otherwise fasttrap will not consume
the trap and SIGTRAP will be delievered to the thread. Synchronize
with such threads by ensuring that they load the hook pointer with
interrupts disabled, and by completing an SMP rendezvous after removing
breakpoints and before clearing the pointers.
Reported by: Alexander Alexeev <Alexander.Alexeev@dell.com>
Tested by: Alexander Alexeev (earlier version)
Reviewed by: cem, kib
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D20526
The registers in ilumos and FreeBSD have a different number.
In the illumos, last 32-bits register defined is SS an in FreeBSD is GS.
This off-by-one caused the uregs array to returns the wrong 64-bits register
on amd64.
Reviewed by: markj
MFC after: 2 weeks
Differential Revision: https://reviews.freebsd.org/D20363
illumos/illumos-gate@6fe4f3002c
Reviewed by: Sara Hartse <sara.hartse@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Pavel Zakharov <pavel.zakharov@delphix.com>
This is irrelevant to FreeBSD, just to reduce divergence.
illumos/illumos-gate@17fb938fd6
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@2258ad0b75
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Matt Ahrens <matt@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@84927f52bd
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Allan Jude <allanjude@freebsd.org>
illumos/illumos-gate@047c81d31d
Reviewed by: Jerry Jelinek <jerry.jelinek@joyent.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Saso Kiselkov <saso.kiselkov@nexenta.com>
This is irrelevant to FreeBSD, just a diff reduction.
illumos/illumos-gate@7928f4baf4
Reviewed by: Serapheim Dimitropoulos <serapheim.dimitro@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@7341a7de4f
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brad Lewis <brad.lewis@delphix.com>
Due to an attempt to check two conditions at once in a macro not designed
as such, the assertion would always evaluate to true.
#define VERIFY3_IMPL(LEFT, OP, RIGHT, TYPE) do { \
const TYPE __left = (TYPE)(LEFT); \
const TYPE __right = (TYPE)(RIGHT); \
if (!(__left OP __right)) \
assfail3(#LEFT " " #OP " " #RIGHT, \
(uintmax_t)__left, #OP, (uintmax_t)__right, \
__FILE__, __LINE__); \
_NOTE(CONSTCOND) } while (0)
#define ASSERT3U(x, y, z) VERIFY3_IMPL(x, y, z, uint64_t)
Mean that we compared:
left = (type == ZIO_TYPE_FREE || psize)
OP = "<="
right = (SPA_MAXBLOCKSIZE)
If the type was not FREE, 0 is less than SPA_MAXBLOCKSIZE (16MB)
If the type is ZIO_TYPE_FREE, 1 is less than SPA_MAXBLOCKSIZE
The constraint on psize (physical size of the FREE operation) is never
checked against SPA_MAXBLOCKSIZE
Reported by: Ka Ho Ng <khng300@gmail.com>
Reviewed by: kevans
MFC after: 2 weeks
Sponsored by: Klara Systems
'.' function names exist only in ELFv1. ELFv2 does away with function
descriptors, and look more like they do on powerpc(32) and most other
platforms, as direct function pointers. Stop blacklisting regular function
names in ELFv2.
Submitted by: Brandon Bergren
Differential Revision: https://reviews.freebsd.org/D20346
This allows to reduce memory waste by letting UMA to put multiple small
buffers into one memory page slab. The page sharing means that UMA
may not be able to free memory page when some of buffers are freed, but
alternatively memory used by that buffer would just be wasted from the
beginning.
This change follows alike change in ZoL, but unlike Linux (according to
my understanding of it from comments) FreeBSD never shares slabs bigger
then one memory page, so this should be even less invasive then there.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This allows the user to enable, disable, and adjust the I/O deadman at
runtime. This can be especially useful when a pool is backed by remote
storage (such as iscsi, ggated, etc).
PR: 221906
Submitted by: Fabian Keil <fk@fabiankeil.de>
Obtained from: ElectroBSD
MFC after: 1 week
Sponsored by: Klara Systems
Event: Waterloo Hackathon 2019
This allows replacing "sys/eventfilter.h" includes with "sys/_eventfilter.h"
in other header files (e.g., sys/{bus,conf,cpu}.h) and reduces header
pollution substantially.
EVENTHANDLER_DECLARE and EVENTHANDLER_LIST_DECLAREs were moved out of .c
files into appropriate headers (e.g., sys/proc.h, powernv/opal.h).
As a side effect of reduced header pollution, many .c files and headers no
longer contain needed definitions. The remainder of the patch addresses
adding appropriate includes to fix those files.
LOCK_DEBUG and LOCK_FILE_LINE_ARG are moved to sys/_lock.h, as required by
sys/mutex.h since r326106 (but silently protected by header pollution prior
to this change).
No functional change (intended). Of course, any out of tree modules that
relied on header pollution for sys/eventhandler.h, sys/lock.h, or
sys/mutex.h inclusion need to be fixed. __FreeBSD_version has been bumped.
Fix stack unwinding such that requesting N stack frames in lockstat will
actually give you N frames, not anywhere from 0-3 as had been before.
lockstat prints the mutex function instead of the caller as the reported
locker, but the stack frame is detailed enough to find the real caller.
MFC after: 2 weeks
In all practical situations, the resolver visibility is static.
Requested by: markj
Sponsored by: The FreeBSD Foundation
MFC after: 1 week
Approved by: so (emaste)
Differential revision: https://reviews.freebsd.org/D20281
I believe previous ifdef caused NULL dereference in later zfs_log_create()
on attempt to create file inside directory belonging to ephemeral group
created on illumos, trying to write to log information about GID domain
of the newly created file, inheriting the ephemeral GID.
This patch reuses original illumos SGID code with exception that due to
lack of ID mapping code on FreeBSD ephemeral GID will turn into GID_NOBODY
by another ifdef inside zfs_fuid_map_id().
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Fix some execution bugs in the dtrace powerpc asm. addme pulls in the carry
flag which we don't want, and the result wasn't recorded anyways, so the
following beq to check for exit condition wasn't checking the right
condition.
Simplify the stack walking in dtrace_isa.c, so there's only a single walker
that handles both pc and sp. This should make it easier to follow, and any
bugfix that may be needed for walking only needs to be made in one place
instead of two now.
MFC after: 2 weeks
the file associated with the given file descriptor.
Reviewed by: kib, asomers
Reviewed by: cem, jilles, brooks (they reviewed previous version)
Discussed with: pjd, and many others
Differential Revision: https://reviews.freebsd.org/D14567
I overlooked the fact that that VOP_FSYNC() call is not a FreeBSD VFS
call, but a macro that provides an illumos-compatible wrapper for the
FreeBSD operation.
PR: 236475
Reported by: lwhsu
Pointyhat to: avg
It simply doesn't work in general since VCPUs may migrate between
physical cores. The approach used to measure skew also doesn't
make much sense in a VM.
PR: 218452
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.
The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.
The fix is to use seperate threads to:
1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`
2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.
illumos/illumos-gate@de753e34f9
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>
I tried to save some CPU time on hopeless aggregation attempts, but it seems
the condition I added is overly strict, blocking also aggregation of optional
I/Os in cases which previously were possible. Revert just to be safe.
MFC after: 1 month
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Author: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3712zfsonlinux/zfs@fb40095f5f
To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.
MFC after: 1 month
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks. But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large. Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.
Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs@a58df6f536
MFV/ZoL: Cap maximum aggregate IO size
Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit. Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6259Closes#6270zfsonlinux/zfs@2d678f779a
I just found that at least on Skylake CPUs cpu_ticks() never returns odd
values, only even, and possibly has even bigger step (176/2?), that makes
its lower bits very bad entropy source, leaving half of taskqueues unused.
Switch to sbinuptime(), closer to upstreams, mitigates the problem by the
rate conversion working as kind of hash function. In case that is somehow
not enough (timer rate is too low or too divisible) mix in curcpu.
MFC after: 1 week
- Don't leak the ksiginfo structure.
- Hold the proc lock when sending a signal in fasttrap_sigsegv().
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The file has not been touched upstream in over a decade, and the nature
of the code means that a lot of FreeBSD-specific bits are required. Remove
the dead code to improve readability. No functional change intended.
Discussed with: cem
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
No platforms except i386, amd64 and powerpc implement fasttrap; the
fasttrap files for other arches do not contain any code and bloat
the output from cscope, so just remove them.
MFC after: 1 week
fasttrap hooks the userspace breakpoint handler; the hook looks up the
breakpoint address in a hash table of tracepoints. It is possible for
the tracepoint to be removed by a different thread in between the
breakpoint trap and the hash table lookup, in which case SIGTRAP gets
delivered to the target process. Fix the problem by adding a
per-process generation counter that gets incremented when a tracepoint
belonging to that process is removed. Then, when a lookup fails, the
trapping instruction is restarted if the thread's counter doesn't match
that of the process.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19273
a vdev that has the same name as the one stored in metadata and that has
all VDEV labels in place. If it cannot find a GEOM provider with the given
name and all VDEV labels it will scan all GEOM providers for the best match
(the most VDEV labels available), but here the name is ignored.
In case the ZFS pool is created, eg. using GPT partition label:
# zpool create tank /dev/gpt/tank
everything works, and on every import ZFS will pick /dev/gpt/tank and
not /dev/da0p4.
The problem occurs when da0p4 is extended and ZFS is unable to find all
VDEV labels in /dev/gpt/tank anymore (the VDEV labels stored at the end
of the partition are now somewhere else). In this case it will scan all
GEOM providers and will pick the first one with the best match, ie. da0p4.
Fix this problem by checking the VDEV/provider name even if we get the same
match. If the name is the same as the one we have in pool's metadata, prefer
this GEOM provider.
Reported by: oshogbo, Michal Mroz <m.mroz@fudosecurity.com>
Tested by: Michal Mroz <m.mroz@fudosecurity.com>
Obtained from: Fudo Security
In all cases where ZFS sends BIO_FLUSH, it first waits for all related
writes to complete, so its BIO_FLUSH does not care about strict ordering.
Removal of one makes life much easier at least for NVMe driver, which
hardware has no concept of request ordering, relying completely on software.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
There is no reason for this variable to be tunable.
This variable is used as a barrier in few places.
Discussed with: pjd
MFC after: 2 weeks
Sponsored by: Fudo Security
UFS will return EINVAL when quotas are not enabled on a filesystem; ZFS'
equivalent involves not having quotas (there is not way to enable or disable
quotas as such). My initial implementation had it return ENOENT, but
quotactl(2) indicates EINVAL is more appropriate.
MFC after: 2 weeks
Approved by: mav
Reviewed by: markj
Reported by: Emrion <kmachine@free.fr>
Sponsored by: iXsystems Inc
PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234413
declare v3 objset size/layout to fix userboot and possibly other loader issues
- fix for userboot assertion failure in zfs_dev_close in free due to out of bounds write
- fix for zfs_alloc / zfs_free mismatch assertion failure when booting GPT on BIOS
Note that this commit brings only formatting changes that were done
during the final review of the illumos change, because FreeBSD got the
main changes before illumos.
illumos/illumos-gate@04e563565204e5635652https://www.illumos.org/issues/5882
This is an import of the temporary pool names functionality from ZoL:
e2282ef57e26b42f3f9d2f3ec9006100d2a8c92f83e9986f6e023bbe6f01
It is intended to assist the creation and management of virtual machines
that have their rootfs on ZFS on hosts that also have their rootfs on
ZFS. These situations cause SPA namespace collisions when the standard
name rpool is used in both cases. The solution is either to give each
guest pool a name unique to the host, which is not always desireable, or
boot a VM environment containing an ISO image to install it, which is
cumbersome.
MFC after: 1 week
Sponsored by: Panzura
dtrace has its own routines which were not updated after SMAP support got
implemented. Use ifunc just like for other routines.
This in particular fixes ustack().
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18542
loader has been supporting large_dnode for some time, no need to block the
feature for boot dataset.
Reviewed by: avg
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D18391
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace. With the ino64 work there
are also some explicit pad fields in struct dirent. Add a subroutine
to clear these bytes and use it in the in-tree filesystems. The
NFS client is omitted for now as it was fixed separately in r340787.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
It was reported, and I easily reproduced it, that this change triggers panic
when receiving replication stream with enabled embedded blocks, when short
file compressing into one embedded block changes its block size. I am not
sure that the problem is in this particuler patch, not just triggered by it,
but since investigation and fix will take some time, I've decided to revert
this for now.
PR: 198457, 233277
The FBT fuction boundary prober was setting one return probe marker value,
but the dtrace handler was expecting another. This causes a hang when
tracing return probes.
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature. Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.
Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs. At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.
Submitted by: Jack Halford <jack@gandi.net>
Reviewed by: mckusick, pfg, rmacklem (previous versions)
Sponsored by: Gandi.net
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17917