It simply doesn't work in general since VCPUs may migrate between
physical cores. The approach used to measure skew also doesn't
make much sense in a VM.
PR: 218452
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
`arc_reclaim_thread()` calls `arc_adjust()` after calling
`arc_kmem_reap_now()`; `arc_adjust()` signals `arc_get_data_buf()` to
indicate that we may no longer be `arc_is_overflowing()`.
The problem is, `arc_kmem_reap_now()` can take several seconds to
complete, has no impact on `arc_is_overflowing()`, but due to how the
code is structured, can impact how long the ARC will remain in the
`arc_is_overflowing()` state.
The fix is to use seperate threads to:
1. keep `arc_size` under `arc_c`, by calling `arc_adjust()`, which
improves `arc_is_overflowing()`
2. keep enough free memory in the system, by calling
`arc_kmem_reap_now()` plus `arc_shrink()`, which improves
`arc_available_memory()`.
illumos/illumos-gate@de753e34f9
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Dan McDonald <danmcd@joyent.com>
Reviewed by: Tim Kordas <tim.kordas@joyent.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Brad Lewis <brad.lewis@delphix.com>
I tried to save some CPU time on hopeless aggregation attempts, but it seems
the condition I added is overly strict, blocking also aggregation of optional
I/Os in cases which previously were possible. Revert just to be safe.
MFC after: 1 month
The LBA weighting makes sense on rotational media where the outer tracks
have twice the bandwidth of the inner tracks. However, it is detrimental
on nonrotational media such as solid state disks, where the only effect
is to ensure that metaslabs enter the best-fit allocation behavior
sooner, which is detrimental to performance. It also makes no sense on
files where the underlying filesystem can arrange things however it
wants.
Author: Richard Yao <ryao@gentoo.org>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#3712zfsonlinux/zfs@fb40095f5f
To reduce code divergence this merge replaces equivalent but different
FreeBSD code detecting non-rotating medium vdevs.
MFC after: 1 month
Before sequential scrub patches ZFS never aggregated I/Os above 128KB.
Sequential scrub bumped that to 1MB, which motivation I understand for
spinning disks, since it should reduce number of head seeks. But for
SSDs it makes much less sense to me, especially on FreeBSD, where due
to MAXPHYS limitation device will likely still see bunch of 128KB I/Os
instead of one large. Having more strict aggregation limit allows to
avoid allocation of large memory buffer and memcpy to/from it, that is
a serious problem when bandwidth reaches few GB/s.
MFC after: 1 month
Sponsored by: iXsystems, Inc.
Update the bounds checking for zfs_vdev_aggregation_limit so that
it has a floor of zero and a maximum value of the supported block
size for the pool.
Additionally add an early return when zfs_vdev_aggregation_limit
equals zero to disable aggregation. For very fast solid state or
memory devices it may be more expensive to perform the aggregation
than to issue the IO immediately.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
zfsonlinux/zfs@a58df6f536
MFV/ZoL: Cap maximum aggregate IO size
Commit 8542ef8 allowed optional IOs to be aggregated beyond
the specified aggregation limit. Since the aggregation limit
was also used to enforce the maximum block size, setting
`zfs_vdev_aggregation_limit=16777216` could result in an
attempt to allocate an ABD larger than 16M.
Author: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Giuseppe Di Natale <dinatale2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#6259Closes#6270zfsonlinux/zfs@2d678f779a
I just found that at least on Skylake CPUs cpu_ticks() never returns odd
values, only even, and possibly has even bigger step (176/2?), that makes
its lower bits very bad entropy source, leaving half of taskqueues unused.
Switch to sbinuptime(), closer to upstreams, mitigates the problem by the
rate conversion working as kind of hash function. In case that is somehow
not enough (timer rate is too low or too divisible) mix in curcpu.
MFC after: 1 week
- Don't leak the ksiginfo structure.
- Hold the proc lock when sending a signal in fasttrap_sigsegv().
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
The file has not been touched upstream in over a decade, and the nature
of the code means that a lot of FreeBSD-specific bits are required. Remove
the dead code to improve readability. No functional change intended.
Discussed with: cem
MFC after: 1 week
Sponsored by: The FreeBSD Foundation
No platforms except i386, amd64 and powerpc implement fasttrap; the
fasttrap files for other arches do not contain any code and bloat
the output from cscope, so just remove them.
MFC after: 1 week
fasttrap hooks the userspace breakpoint handler; the hook looks up the
breakpoint address in a hash table of tracepoints. It is possible for
the tracepoint to be removed by a different thread in between the
breakpoint trap and the hash table lookup, in which case SIGTRAP gets
delivered to the target process. Fix the problem by adding a
per-process generation counter that gets incremented when a tracepoint
belonging to that process is removed. Then, when a lookup fails, the
trapping instruction is restarted if the thread's counter doesn't match
that of the process.
Reviewed by: cem
MFC after: 2 weeks
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D19273
a vdev that has the same name as the one stored in metadata and that has
all VDEV labels in place. If it cannot find a GEOM provider with the given
name and all VDEV labels it will scan all GEOM providers for the best match
(the most VDEV labels available), but here the name is ignored.
In case the ZFS pool is created, eg. using GPT partition label:
# zpool create tank /dev/gpt/tank
everything works, and on every import ZFS will pick /dev/gpt/tank and
not /dev/da0p4.
The problem occurs when da0p4 is extended and ZFS is unable to find all
VDEV labels in /dev/gpt/tank anymore (the VDEV labels stored at the end
of the partition are now somewhere else). In this case it will scan all
GEOM providers and will pick the first one with the best match, ie. da0p4.
Fix this problem by checking the VDEV/provider name even if we get the same
match. If the name is the same as the one we have in pool's metadata, prefer
this GEOM provider.
Reported by: oshogbo, Michal Mroz <m.mroz@fudosecurity.com>
Tested by: Michal Mroz <m.mroz@fudosecurity.com>
Obtained from: Fudo Security
In all cases where ZFS sends BIO_FLUSH, it first waits for all related
writes to complete, so its BIO_FLUSH does not care about strict ordering.
Removal of one makes life much easier at least for NVMe driver, which
hardware has no concept of request ordering, relying completely on software.
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
There is no reason for this variable to be tunable.
This variable is used as a barrier in few places.
Discussed with: pjd
MFC after: 2 weeks
Sponsored by: Fudo Security
UFS will return EINVAL when quotas are not enabled on a filesystem; ZFS'
equivalent involves not having quotas (there is not way to enable or disable
quotas as such). My initial implementation had it return ENOENT, but
quotactl(2) indicates EINVAL is more appropriate.
MFC after: 2 weeks
Approved by: mav
Reviewed by: markj
Reported by: Emrion <kmachine@free.fr>
Sponsored by: iXsystems Inc
PR: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=234413
declare v3 objset size/layout to fix userboot and possibly other loader issues
- fix for userboot assertion failure in zfs_dev_close in free due to out of bounds write
- fix for zfs_alloc / zfs_free mismatch assertion failure when booting GPT on BIOS
Note that this commit brings only formatting changes that were done
during the final review of the illumos change, because FreeBSD got the
main changes before illumos.
illumos/illumos-gate@04e563565204e5635652https://www.illumos.org/issues/5882
This is an import of the temporary pool names functionality from ZoL:
e2282ef57e26b42f3f9d2f3ec9006100d2a8c92f83e9986f6e023bbe6f01
It is intended to assist the creation and management of virtual machines
that have their rootfs on ZFS on hosts that also have their rootfs on
ZFS. These situations cause SPA namespace collisions when the standard
name rpool is used in both cases. The solution is either to give each
guest pool a name unique to the host, which is not always desireable, or
boot a VM environment containing an ISO image to install it, which is
cumbersome.
MFC after: 1 week
Sponsored by: Panzura
dtrace has its own routines which were not updated after SMAP support got
implemented. Use ifunc just like for other routines.
This in particular fixes ustack().
Reviewed by: markj
Sponsored by: The FreeBSD Foundation
Differential Revision: https://reviews.freebsd.org/D18542
loader has been supporting large_dnode for some time, no need to block the
feature for boot dataset.
Reviewed by: avg
MFC after: 1 month
Differential Revision: https://reviews.freebsd.org/D18391
Directory entries must be padded to maintain alignment; in many
filesystems the padding was not initialized, resulting in stack
memory being copied out to userspace. With the ino64 work there
are also some explicit pad fields in struct dirent. Add a subroutine
to clear these bytes and use it in the in-tree filesystems. The
NFS client is omitted for now as it was fixed separately in r340787.
Reported by: Thomas Barabosch, Fraunhofer FKIE
Reviewed by: kib
MFC after: 3 days
Sponsored by: The FreeBSD Foundation
It was reported, and I easily reproduced it, that this change triggers panic
when receiving replication stream with enabled embedded blocks, when short
file compressing into one embedded block changes its block size. I am not
sure that the problem is in this particuler patch, not just triggered by it,
but since investigation and fix will take some time, I've decided to revert
this for now.
PR: 198457, 233277
The FBT fuction boundary prober was setting one return probe marker value,
but the dtrace handler was expecting another. This causes a hang when
tracing return probes.
The d_off field has been added to the dirent structure recently.
Currently filesystems don't support this feature. Support has been
added and tested for zfs, ufs, ext2fs, fdescfs, msdosfs and unionfs.
A stub implementation is available for cd9660, nandfs, udf and
pseudofs but hasn't been tested.
Motivation for this feature: our usecase is for a userspace nfs server
(nfs-ganesha) with zfs. At the moment we cache direntry offsets by
calling lseek once per entry, with this patch we can get the offset
directly from getdirentries(2) calls which provides a significant
speedup.
Submitted by: Jack Halford <jack@gandi.net>
Reviewed by: mckusick, pfg, rmacklem (previous versions)
Sponsored by: Gandi.net
MFC after: 1 week
Differential revision: https://reviews.freebsd.org/D17917
This covers scenario when ARC may not shrink as fast as it could:
1. arc_size < arc_c and arc_adjust() does not evict anything, returning
zero to arc_reclaim_thread();
2. arc_available_memory() reports memory pressure, which can not be
satisfied by arc_kmem_reap_now();
3. arc_shrink() reduces arc_c and calls arc_adjust(), return of which is
ignored;
4. even if the last arc_adjust() could not satisfy arc_size < arc_c,
arc_reclaim_thread() will still go to sleep, since the first one
returned zero.
Reviewed by: allanjude, markj, sef
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
Differential Revision: https://reviews.freebsd.org/D17927
Replication code in receive_object() falsely assumes that if received
object block size is different from local, then it must be a new object
and calls dmu_object_reclaim() to wipe it out. In most cases it is not a
problem, since all dnode, bonus buffer and data block(s) are immediately
rewritten any way, but the problem is that spill block (if used) is not.
This means loss of ACLs, extended attributes, etc.
This issue can be triggered in very simple way:
1. create 4KB file with 10+ ACL entries;
2. take snapshot and send it to different dataset;
3. append another 4KB to the file;
4. take another snapshot and send incrementally;
5. witness ACL loss on receive side.
PR: 198457
Discussed with: mahrens
MFC after: 2 weeks
Sponsored by: iXsystems, Inc.
This will enable callers to take const paths as part of syscall
decleration improvements.
Where doing so is easy and non-distruptive carry the const through
implementations. In UFS the value is passed to an interface that must
take non-const values. In ZFS, const poisoning would touch code shared
with upstream and it's not worth adding diffs.
Bump __FreeBSD_version for external API consumers.
Reviewed by: kib (prior version)
Obtained from: CheriBSD
Sponsored by: DARPA, AFRL
Differential Revision: https://reviews.freebsd.org/D17805
Flags prevent open(2) and *at(2) vfs syscalls name lookup from
escaping the starting directory. Supposedly the interface is similar
to the same proposed Linux flags.
Reviewed by: jilles (code, previous version of manpages), 0mp (manpages)
Discussed with: allanjude, emaste, jonathan
Sponsored by: The FreeBSD Foundation
Differential revision: https://reviews.freebsd.org/D17547
There seems to be a race in CI, such that dtrace_asm.S might be assembled
before the genassym is completed. This causes a build failure when PSL_EE
doesn't exist, and is read as 0. Get around this by explicitly specifying
the bits in the mask instead.
Device removal code uses zio_vdev_child_io() with ZIO_TYPE_NULL parent,
that never happened before. It confused FreeBSD-specific TRIM code,
which does not use VDEV_IO_DONE for logical ZIO_TYPE_FREE ZIOs. As
result of that stage being skipped device removal ZIOs leaked references
and memory that supposed to be freed by VDEV_IO_DONE, making it stuck.
It is a quick patch rather then a nice fix, but hopefully we'll be able
to drop it all together when alternative TRIM implementation finally get
landed.
PR: 228750, 229007
Discussed with: allanjude, avg, smh
Approved by: re (delphij)
MFC after: 5 days
Sponsored by: iXsystems, Inc.
- Remove the arm64-specific cpu_*cache* and cpu_tlb_flush* functions.
Instead, add RISC-V specific inline functions in cpufunc.h for the
fence.i and sfence.vma instructions.
- Catch up to changes in the arm64 pmap and remove all the cpu_dcache_*
calls, pmap_is_current, pmap_l3_valid_cacheable, and PTE_NEXT bits from
pmap.
- Remove references to the unimplemented riscv_setttb().
- Remove unused cpu_nullop.
- Add a link to the SBI doc to sbi.h.
- Add support for a 4th argument in SBI calls. It's not documented but
it seems implied for the asid argument to SBI_REMOVE_SFENCE_VMA_ASID.
- Pass the arguments from sbi_remote_sfence*() to the SEE. BBL ignores
them so this is just cosmetic.
- Flush icaches on other CPUs when they resume from kdb in case the
debugger wrote any breakpoints while the CPUs were paused in the IPI_STOP
handler.
- Add SMP vs UP versions of pmap_invalidate_* similar to amd64. The
UP versions just use simple fences. The SMP versions use the
sbi_remove_sfence*() functions to perform TLB shootdowns. Since we
don't have a valid pm_active field in the riscv pmap, just IPI all
CPUs for all invalidations for now.
- Remove an extraneous TLB flush from the end of pmap_bootstrap().
- Don't do a TLB flush when writing new mappings in pmap_enter(), only if
modifying an existing mapping. Note that for COW faults a TLB flush is
only performed after explicitly clearing the old mapping as is done in
other pmaps.
- Sync the i-cache on all harts before updating the PTE for executable
mappings in pmap_enter and pmap_enter_quick. Previously the i-cache was
only sync'd after updating the PTE in pmap_enter.
- Use sbi_remote_fence() instead of smp_rendezvous in pmap_sync_icache().
Reviewed by: markj
Approved by: re (gjb, kib)
Sponsored by: DARPA
Differential Revision: https://reviews.freebsd.org/D17414
r338927("zfs: depessimize zfs_root with rmlocks") failed to error check
the mount before caching root vnode.
Results in crashes in rrw_enter_read_impl tracing back to zfs_mount.
Reported by: Mike Tancsa
Tested by: allanjude
Approved by: re (kib)
The device evacuation code adds a dependency that
vdev_compact_children() be able to properly empty the vdev_child
array by setting it to NULL and zeroing vdev_children. Under Linux,
kmem_alloc() and related functions return a sentinel pointer rather
than NULL for zero-sized allocations.
This is a part of ZoL port of device removal patch:
commit a1d477c24c7badc89c60955995fd84d311938486
Author: Matthew Ahrens <mahrens@delphix.com>
Ported-by: Tim Chase <tim@chase2k.com>
Approved by: re (kib)
MFC after: 1 week
Upstream code expects only ZIO_TYPE_READ and some ZIO_TYPE_WRITE
requests to removed (indirect) vdevs, while on FreeBSD there is also
ZIO_TYPE_FREE (TRIM). ZIO_TYPE_FREE requests do not have the data
buffers, so don't need the pointer adjustment.
PR: 228750, 229007
Reviewed by: allanjude, sef
Approved by: re (kib)
MFC after: 1 week
Differential Revision: https://reviews.freebsd.org/D17523