Commit Graph

4142 Commits

Author SHA1 Message Date
Alexander Motin
190290a9ac
Fix two abd_gang_add_gang() issues.
- There is no reason to assert that added gang is not empty.  It
may be weird to add an empty gang, but it is legal.
 - When moving chain list from the added gang clear its size, or it
will trigger assertion in abd_verify() when that gang is freed.

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14816
2023-05-05 09:17:55 -07:00
Pawel Jakub Dawidek
599df82049
Plug memory leak in zfsdev_state.
On kernel module unload, free all zfsdev state structures, except for
zfsdev_state_listhead, which is statically allocated.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14824
2023-05-05 08:51:41 -07:00
Ameer Hamza
82ac409acc
zpool import -m also removing spare and cache when log device is missing
spa_import() relies on a pool config fetched by spa_try_import() for
spare/cache devices. Import flags are not passed to spa_tryimport(),
which makes it return early due to a missing log device and missing
retrieving the cache device and spare eventually. Passing
ZFS_IMPORT_MISSING_LOG to spa_tryimport() makes it fetch the correct
configuration regardless of the missing log device.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14794
2023-05-03 15:10:32 -07:00
George Amanakis
9de5300c7f
Optimize check_filesystem() and process_error_log()
Integrate check_clones() into check_filesystem() and implement a list
instead of iterating recursively over the clones, thus eliminating the
risk of a stack overflow.

Also use kmem_zalloc() to allocate large structures in
process_error_log() reducing its stack size from ~700 to ~128 bytes.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14744
2023-05-03 09:00:14 -07:00
Pawel Jakub Dawidek
d96e29576c
Use correct block pointer in block cloning case.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14806
2023-05-02 09:24:26 -07:00
Mateusz Guzik
e2a92d726e
blake3: fix up bogus checksums in face of cpu migration
This is a temporary measure until a better fix is sorted out.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Closes #14785
Closes #14808
2023-05-01 17:21:27 -07:00
Serapheim Dimitropoulos
0c93d86f01
Correct ABD size for split block ZIOs
Currently when layering the ABD buffer of each split block on top of
an indirect vdev's ZIO ABD we don't specify the split block's ABD.
This results in those ABDs being incorrectly sized by inheriting
the size of their parent ABD which is larger than what each split
block needs.

The above behavior isn't causing any bugs currently but can lead
to unexpected ABD sizes for people analyzing and/or working on
the ZIO codepath. This patch fixes this behavior by properly setting
the ABD size for split block ZIOs.

Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Serapheim Dimitropoulos <serapheim@delphix.com>
Closes #14804
2023-05-01 17:18:42 -07:00
Justin Hibbits
5a83f761c7
powerpc64: Support ELFv2 asm on Big Endian
FreeBSD/powerpc64 is all ELFv2 since FreeBSD 13, even big endian.  The
existing sha256 and sha512 asm code assumes that BE is all ELFv1, and LE
is ELFv2.  Minor changes to add ELFv2 in the BE side gets this working
correctly on FreeBSD with latest OpenZFS import.

Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Justin Hibbits <chmeeedalf@gmail.com>
Closes #14779
2023-04-27 12:49:21 -07:00
Alexander Motin
2fd1c30423
Mark TX_COMMIT transaction with TXG_NOTHROTTLE.
TX_COMMIT has no on-disk representation and does not produce any more
dirty data.  It should not wait for anything, and even just skipping
the checks if not waiting gives improvement noticeable in profiler.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14798
2023-04-27 12:32:58 -07:00
Tino Reichardt
ee728008a4
Fix BLAKE3 aarch64 assembly for FreeBSD and macOS
The x18 register isn't useable within FreeBSD kernel space, so we
have to fix the BLAKE3 aarch64 assembly for not using it.

The source files are here: https://github.com/mcmilk/BLAKE3-tests

Reviewed-by: Kyle Evans <kevans@FreeBSD.org>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes #14728
2023-04-26 12:40:26 -07:00
Brian Behlendorf
b5411618f7
Fix checkstyle warning
Resolve a missed checkstyle warning.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Mateusz Guzik <mjguzik@gmail.com>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14799
2023-04-26 11:49:16 -07:00
Alexander Motin
bba7cbf0a4
Fix positive ABD size assertion in abd_verify().
Gang ABDs without childred are legal, and they do have zero size.
For other ABD types zero size doesn't have much sense and likely
not working correctly now.

Reviewed-by: Igor Kozhukhov <igor@dilos.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by:	Alexander Motin <mav@FreeBSD.org>
Sponsored by:	iXsystems, Inc.
Closes #14795
2023-04-26 09:20:43 -07:00
Mateusz Guzik
e37a89d5d0 FreeBSD: fix up EINVAL from getdirentries on .zfs
Without the change:
/.zfs
/.zfs/snapshot
find: /.zfs: Invalid argument

Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14774
2023-04-26 09:16:37 -07:00
Mateusz Guzik
88b8777159 FreeBSD: add missing vn state transition for .zfs
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14774
2023-04-26 09:16:09 -07:00
Brian Behlendorf
0e8a42bbee
Revert "Fix data race between zil_commit() and zil_suspend()"
This reverts commit 4c856fb333 to
resolve a newly introduced deadlock which in practice in more
disruptive that the issue this commit intended to address.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #14775
Closes #14790
2023-04-25 16:40:55 -07:00
Han Gao
6d59d5df98
Add loongarch64 support
Add loongarch64 definitions & lua module setjmp asm

LoongArch is a new RISC ISA, which is a bit like MIPS or RISC-V.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Han Gao <gaohan@uniontech.com>
Signed-off-by: WANG Xuerui <xen0n@gentoo.org>
Closes #13422
2023-04-25 16:05:45 -07:00
Mateusz Guzik
81a2b2e6a6
FreeBSD: add missing vop_fplookup assignments
It became illegal to not have them as of
5f6df17775 ("vfs: validate that vop
vectors provide all or none fplookup vops") upstream.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Closes #14788
2023-04-24 16:15:42 -07:00
Mateusz Guzik
ff0e135e25 FreeBSD: try to fallback early if can't do optimized copy
Not complete, but already shaves on some locking.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Closes #14723
2023-04-24 16:13:52 -07:00
Mateusz Guzik
a7982d5d30 FreeBSD: fix up EXDEV handling for clone_range
API contract requires VOPs to handle EXDEV internally, worst case by
falling back to the generic copy routine. This broke with the recent
changes.

While here whack custom loop to lock 2 vnodes with vn_lock_pair, which
provides the same functionality internally. write start/finish around
it plays no role so got eliminated.

One difference is that vn_lock_pair always takes an exclusive lock on
both vnodes. I did not patch around it because current code takes an
exclusive lock on the target vnode. zfs supports shared-locking for
writes, so this serializes different calls to the routine as is, despite
range locking inside. At the same time you may notice the source vnode
can get some traffic if only shared-locked, thus once more this goes
the safer route of exclusive-locking. Note this should be patched to
use shared-locking for both once the feature is considered stable.

Technically the switch to vn_lock_pair should be a separate change, but
it would only introduce churn immediately whacked by the rest of the
patch.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mateusz Guzik <mjguzik@gmail.com>
Sponsored by:	Rubicon Communications, LLC ("Netgate")
Closes #14723
2023-04-24 16:13:09 -07:00
Dimitry Andric
62cc9d4f6b
FreeBSD: make zfs_vfs_held() definition consistent with declaration
Noticed while attempting to change FreeBSD's boolean_t into an actual
bool: in include/sys/zfs_ioctl_impl.h, zfs_vfs_held() is declared to
return a boolean_t, but in module/os/freebsd/zfs/zfs_ioctl_os.c it is
defined to return an int. Make the definition match the declaration.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Dimitry Andric <dimitry@andric.com>
Closes #14776
2023-04-21 10:22:52 -07:00
Allan Jude
8eae2d214c
Add support for zpool user properties
Usage:

    zpool set org.freebsd:comment="this is my pool" poolname

Tests are based on zfs_set's user property tests.

Also stop truncating property values at MAXNAMELEN, use ZFS_MAXPROPLEN.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Mateusz Piotrowski <mateusz.piotrowski@klarasystems.com>
Sponsored-by: Beckhoff Automation GmbH & Co. KG.
Sponsored-by: Klara Inc.
Closes #11680
2023-04-21 10:20:36 -07:00
Richard Yao
ab71b24d20 Linux: zfs_zaccess_trivial() should always call generic_permission()
Building with Clang on Linux generates a warning that err could be
uninitialized if mnt_ns is a NULL pointer. However, mnt_ns should never
be NULL, so there is no need to put this behind an if statement.  Taking
it outside of the if statement means that the possibility of err being
uninitialized goes from being always zero in a way that the compiler
could not realize to a way that is always zero in a way that the
compiler can realize.

Sponsored-By: Wasabi Technology, Inc.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Youzhong Yang <yyang@mathworks.com>
Signed-off-by: Richard Yao <richard.yao@klarasystems.com>
Closes #14738
2023-04-20 10:29:44 -07:00
rob-wing
3e4ed4213d
Create zap for root vdev
And add it to the AVZ, this is not backwards compatible with older pools
due to an assertion in spa_sync() that verifies the number of ZAPs of
all vdevs matches the number of ZAPs in the AVZ.

Granted, the assertion only applies to #DEBUG builds - still, a feature
flag is introduced to avoid the assertion, com.klarasystems:vdev_zaps_v2

Notably, this allows to get/set properties on the root vdev:

    % zpool set user:prop=value <pool> root-0

Before this commit, it was already possible to get/set properties on
top-level vdevs with the syntax <type>-<vdev_id> (e.g. mirror-0):

    % zpool set user:prop=value <pool> mirror-0

This syntax also applies to the root vdev as it is is of type 'root'
with a vdev_id of 0, root-0. The keyword 'root' as an alias for
'root-0'.

The following tests have been added:

    - zpool get all properties from root vdev
    - zpool set a property on root vdev
    - verify root vdev ZAP is created

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes #14405
2023-04-20 10:07:56 -07:00
Herb Wartens
71d191ef25
Allow MMP to bypass waiting for other threads
At our site we have seen cases when multi-modifier protection is enabled
(multihost=on) on our pool and the pool gets suspended due to a single
disk that is failing and responding very slowly. Our pools have 90 disks
in them and we expect disks to fail. The current version of MMP requires
that we wait for other writers before moving on. When a disk is
responding very slowly, we observed that waiting here was bad enough to
cause the pool to suspend. This change allows the MMP thread to bypass
waiting for other threads and reduces the chances the pool gets
suspended.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Herb Wartens <hawartens@gmail.com>
Closes #14659
2023-04-19 13:22:59 -07:00
Ameer Hamza
719534ca8e
Fix "Detach spare vdev in case if resilvering does not happen"
Spare vdev should detach from the pool when a disk is reinserted.
However, spare detachment depends on the completion of resilvering,
and if resilver does not schedule, the spare vdev keeps attached to
the pool until the next resilvering. When a zfs pool contains
several disks (25+ mirror), resilvering does not always happen when
a disk is reinserted. In this patch, spare vdev is manually detached
from the pool when resilvering does not occur and it has been tested
on both Linux and FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14722
2023-04-19 09:04:32 -07:00
Tony Hutter
accfdeb948
Revert "ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced()"
This reverts commit 4b3133e671.

Users identified this commit as a possible source of data
corruption:
https://github.com/openzfs/zfs/issues/14753

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tony Hutter <hutter2@llnl.gov>
Issue #14753 
Closes #14761
2023-04-18 08:41:52 -07:00
Pawel Jakub Dawidek
3b5af20139
Fix VERIFY(!zil_replaying(zilog, tx)) panic
The zfs_log_clone_range() function is never called from the
zfs_clone_range_replay() function, so I assumed it is safe to assert
that zil_replaying() is never TRUE here. It turns out zil_replaying()
also returns TRUE when the sync property is set to disabled.

Fix the problem by just returning if zil_replaying() returns TRUE.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reported by: Florian Smeets
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14758
2023-04-17 16:42:09 -07:00
Pawel Jakub Dawidek
c71fe71640
Fix data corruption when cloning embedded blocks
Don't overwrite blk_phys_birth, as for embedded blocks it is part of
the payload.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Issue #13392 
Closes #14739
2023-04-12 16:15:05 -07:00
George Amanakis
574e09d8c6
Fix in check_filesystem()
Fix the code in case of missing snapshots. Previously the check was in
a conditional that would be executed if the filesystem had snapshots.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14735
2023-04-12 08:53:53 -07:00
Alan Somers
678a3b8f99
Trim needless zeroes from checksum events
The ereport.fs.zfs.checksum event contains histograms of the bits that
were wrongly set or cleared according to their bit position in a 64-bit
word.  So the maximum value that any histogram bucket could have would
be 64.  But ZFS currently uses a uint32_t to hold each bucket.  As a
result, the event report is full of needless zeroes.

Change the bucket size to uint8_t, stripping 768 needless zeros from
each event.

Original event format:
```
 class=ereport.fs.zfs.checksum ena=639460469834258433 pool=testpool.1933 pool_guid=4979719877084416563 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=4136721804819128578 vdev_type=file vdev_path=/tmp/kyua.1TxP3A/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=609837019678 vdev_delta_ts=33450 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=2751977006639883417 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=702976 zio_size=1024 zio_objset=24 zio_object=0 zio_level=3 zio_blkid=0 bad_ranges=0000000000000400 bad_ranges_min_gap=8 bad_range_sets=0000079e bad_range_clears=00000854 bad_set_histogram=000000210000001a000000150000001d000000240000001b000000220000001b000000210000002100000018000000260000002300000025000000210000001e000000250000001b0000001d0000001e0000001600000025000000180000001b000000240000001b000000240000001b0000001c000000210000001b0000001e000000210000001a0000001e000000220000001d0000001b000000200000001f0000001a000000250000001f0000001d0000001b0000001d000000240000001d0000001b0000001b0000001f00000024000000190000001a0000001f0000001e000000240000001e0000002400000021000000200000001d0000001d00000021 bad_cleared_histogram=000000220000002700000021000000210000001b0000001a000000250000001f0000001c0000001e0000002400000022000000220000002400000022000000240000002200000021000000220000001b0000002100000021000000190000001b000000240000002400000020000000290000002a00000028000000250000002400000020000000270000002500000016000000270000001c000000210000001f000000240000001c0000002100000022000000240000002100000023000000210000002700000022000000240000001b00000022000000210000001c00000023000000150000002600000020000000270000001e0000001d0000002400000026 time=00000016806457270000000323406839 eid=458
```

New format:
```
 class=ereport.fs.zfs.checksum ena=96599319807790081 pool=testpool.1933 pool_guid=1236902063710799041 pool_state=0 pool_context=0 pool_failmode=wait vdev_guid=2774253874431514999 vdev_type=file vdev_path=/tmp/kyua.6Temlq/2/work/file1.1933 vdev_ashift=9 vdev_complete_ts=92124283803 vdev_delta_ts=46670 vdev_read_errors=0 vdev_write_errors=0 vdev_cksum_errors=20 vdev_delays=0 parent_guid=8090931855087882905 parent_type=raidz vdev_spare_guids= zio_err=0 zio_flags=1048752 zio_stage=4194304 zio_pipeline=65011712 zio_delay=0 zio_timestamp=0 zio_delta=0 zio_priority=4 zio_offset=1028608 zio_size=512 zio_objset=0 zio_object=0 zio_level=0 zio_blkid=4 bad_ranges=0000000000000200 bad_ranges_min_gap=8 bad_range_sets=0000061f bad_range_clears=000001f4 bad_set_histogram=1719161c1c1c101618171a151a1a19161e1c171d1816161c191f1a18192117191c131d171b1613151a171419161a1b1319101b14171b18151e191a1b141a1c17 bad_cleared_histogram=06090a0808070a0b020609060506090a01090a050a0a0509070609080d050d0607080d060507080c04070807070a0608020c080c080908040808090a05090a07 time=00000016806477050000000604157480 eid=62
```

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Alan Somers <asomers@FreeBSD.org>
Sponsored-by: Axcient
Closes #14716
2023-04-10 14:24:27 -07:00
youzhongyang
d4dc53dad2
Linux 6.3 compat: idmapped mount API changes
Linux kernel 6.3 changed a bunch of APIs to use the dedicated idmap 
type for mounts (struct mnt_idmap), we need to detect these changes 
and make zfs work with the new APIs.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14682
2023-04-10 14:15:36 -07:00
Kyle Evans
dee77f45d0 module: resync part of Makefile.bsd
sha256-armv8.S and sha512-armv8.S need the same treatment as the sse
bits; removal of -mgeneral-regs-only from flags.

This fixes errors about requiring NEON, which is a difference in clang
vs. gcc treatment of -mgeneral-regs-only being specified on asm files.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Signed-off-by: Kyle Evans <kevans@FreeBSD.org>
Closes #14715
2023-04-10 12:39:43 -07:00
Rob N
ff73574cd8
vdev: expose zfs_vdev_max_ms_shift as a module parameter
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14719
2023-04-06 10:52:50 -07:00
George Amanakis
a8a127e2c9
Fix typo in check_clones()
Run kmem_free() after zap_cursor_fini().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Adam Moss <c@yotes.com>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14702
2023-04-06 10:46:18 -07:00
Martin Matuška
a3f82aec93
Miscellaneous FreBSD compilation bugfixes
Add missing machine/md_var.h to spl/sys/simd_aarch64.h and
spl/sys/simd_arm.h

In spl/sys/simd_x86.h, PCB_FPUNOSAVE exists only on amd64, use PCB_NPXNOSAVE
on i386

In FreeBSD sys/elf_common.h redefines AT_UID and AT_GID on FreeBSD, we need
a hack in vnode.h similar to Linux. sys/simd.h needs to be included early.

In zfs_freebsd_copy_file_range() we pass a (size_t *)lenp to
zfs_clone_range() that expects a (uint64_t *)

Allow compiling armv6 world by limiting ARM macros in sha256_impl.c and
sha512_impl.c to __ARM_ARCH > 6

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Reviewed-by: Signed-off-by: WHR <msl0000023508@gmail.com>
Signed-off-by: Martin Matuska <mm@FreeBSD.org>
Closes #14674
2023-04-06 10:35:02 -07:00
Rob N
ece7ab7e7d
vdev: expose zfs_vdev_def_queue_depth as a module parameter
It was previously available only to FreeBSD.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Rob Norris <rob.norris@klarasystems.com>
Sponsored-by: Klara, Inc.
Sponsored-by: Seagate Technology LLC
Closes #14718
2023-04-06 10:31:19 -07:00
Alexander Motin
1038f87c4e
Fix some signedness issues in arc_evict()
It may happen that "wanted total ARC size" (wt) is negative, that was
expected.  But multiplication product of it and unsigned fractions
result in unsigned value, incorrectly shifted right with a sing loss.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Paul Dagnelie <pcd@delphix.com>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by:   iXsystems, Inc.
Closes #14692
2023-04-05 10:42:22 -07:00
youzhongyang
8eb2f26057
Linux 6.3 compat: writepage_t first arg struct folio*
The type def of writepage_t in kernel 6.3 is changed to take
struct folio* as the first argument. We need to detect this
change and pass correct function to write_cache_pages().

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Youzhong Yang <yyang@mathworks.com>
Closes #14699
2023-04-05 10:01:38 -07:00
Brian Behlendorf
1142362ff6
Use vmem_zalloc to silence allocation warning
The kmem allocation in zfs_prune_aliases() will trigger a large
allocation warning on systems with 64K pages.  Resolve this by
switching to vmem_alloc() which internally uses kvmalloc() so the
right allocator will be used based on the allocation size.

Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #8491
Closes #14694
2023-03-31 09:43:54 -07:00
George Amanakis
431083f75b
Fixes in persistent error log
Address the following bugs in persistent error log:

1) Check nested clones, eg "fs->snap->clone->snap2->clone2".

2) When deleting files containing error blocks in those clones (from
   "clone" the example above), do not break the check chain.

3) When deleting files in the originating fs before syncing the errlog
   to disk, do not break the check chain. This happens because at the
   time of introducing the error block in the error list, we do not have
   its birth txg and the head filesystem. If the original file is
   deleted before the error list is synced to the error log (which is
   when we actually lookup the birth txg and the head filesystem), then
   we do not have access to this info anymore and break the check chain.

The most prominent change is related to achieving (3). We expand the
spa_error_entry_t structure to accommodate the newly introduced
zbookmark_err_phys_t structure (containing the birth txg of the error
block).Due to compatibility reasons we cannot remove the
zbookmark_phys_t structure and we also need to place the new structure
after se_avl, so it is not accounted for in avl_find(). Then we modify
spa_log_error() to also provide the birth txg of the error block. With
these changes in place we simplify the previously introduced function
get_head_and_birth_txg() (now named get_head_ds()).

We chose not to follow the same approach for the head filesystem (thus
completely removing get_head_ds()) to avoid introducing new lock
contentions.

The stack sizes of nested functions (as measured by checkstack.pl in the
linux kernel) are:
check_filesystem [zfs]: 272 (was 912)
check_clones [zfs]: 64

We also introduced two new tests covering the above changes.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes #14633
2023-03-28 16:51:58 -07:00
Kevin Jin
65d10bd87c
Fix short-lived txg caused by autotrim
Current autotrim causes short-lived txg through:

1. calling txg_wait_synced() in metaslab_enable()
2. calling txg_wait_open() with should_quiesce = true

This patch addresses all the issues mentioned above.

A new cv, vdev_autotrim_kick_cv is added to kick autotrim activity.
It will be signaled once a txg is synced so that it does not change 
the original autotrim pace. Also because it is a cv, the wait is 
interruptible which speeds up the vdev_autotrim_stop_wait() call.

Finally, combining big zfs_txg_timeout, txg_wait_open() also causes
delay when exporting a pool.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: jxdking <lostking2008@hotmail.com>
Issue #8993
Closes #12194
2023-03-28 08:43:41 -07:00
Brian Behlendorf
64bfa6bae3
Additional limits on hole reporting
Holding the zp->z_rangelock as a RL_READER over the range
0-UINT64_MAX is sufficient to prevent the dnode from being
re-dirtied by concurrent writers.  To avoid potentially
looping multiple times for external caller which do not
take the rangelock holes are not reported after the first
sync.  While not optimal this is always functionally correct.

This change adds the missing rangelock calls on FreeBSD to
zvol_cdev_ioctl().

Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #14512
Closes #14641
2023-03-28 08:19:03 -07:00
George Wilson
a604d3243b
Revert "Do not hold spa_config in ZIL while blocked on IO"
This reverts commit 7d638df09b.

Reviewed-by: Prakash Surya <prakash.surya@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #14678
2023-03-28 08:13:32 -07:00
Ameer Hamza
a05263b7aa
Update vdev state for spare vdev
zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but
it does not get updated nvlist configuration for spare vdev since
the configuration is read by spa_spares->sav_config. In this commit,
updating the vdev state for spare vdev that is consumed by zfsd on
spare disk hotplug.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes #14653
2023-03-24 10:30:38 -07:00
Rich Ercolani
0ad5f43442
Drop lying to the compiler in the fletcher4 code
This is probably the uncontroversial part of #13631, which fixes
a real problem people are having.

There's still things to improve in our code after this is merged,
but it should stop the breakage that people have reported, where
we lie about a type always being aligned and then pass in stack
objects with no alignment requirement and hope for the best.

Of course, our SIMD code was written with unaligned accesses, so it
doesn't care if we drop this...but some auto-vectorized code that
gcc emits sure does, since we told it it can assume they're aligned.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rich Ercolani <rincebrain@gmail.com>
Closes #14649
2023-03-24 10:29:19 -07:00
George Wilson
460d887c43
panic loop when removing slog device
There is a window in the slog removal code where a panic loop could
ensue if the system crashes during that operation. The original design
of slog removal did not persisted any state because the removal happened
synchronously. This was changed by a later commit which persisted the
vdev_removing flag and exposed this bug. If a slog removal is in
progress and happens to crash after persisting the vdev_removing flag to
the label but before the vdev is removed from the spa config, then the
pool will continue to panic on import. Here's a sample of the panic:

[  134.387411] VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size,
FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22)
[  134.393865] PANIC at dmu.c:1135:dmu_write()
[  134.396035] Kernel panic - not syncing: VERIFY0(0 ==
dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs,
&dbp)) failed (0 == 22)
[  134.397857] CPU: 2 PID: 5914 Comm: txg_sync Kdump: loaded Tainted:
P           OE     5.4.0-1100-dx2023020205-b3751f8c2-azure #106
[  134.407938] Hardware name: Microsoft Corporation Virtual
Machine/Virtual Machine, BIOS 090008  12/07/2018
[  134.407938] Call Trace:
[  134.407938]  dump_stack+0x57/0x6d
[  134.407938]  panic+0xfb/0x2d7
[  134.407938]  spl_panic+0xcf/0x102 [spl]
[  134.407938]  ? traverse_impl+0x1ca/0x420 [zfs]
[  134.407938]  ? dmu_object_alloc_impl+0x3b4/0x3c0 [zfs]
[  134.407938]  ? dnode_hold+0x1b/0x20 [zfs]
[  134.407938]  dmu_write+0xc3/0xd0 [zfs]
[  134.407938]  ? space_map_alloc+0x55/0x80 [zfs]
[  134.407938]  metaslab_sync+0x61a/0x830 [zfs]
[  134.407938]  ? queued_spin_unlock+0x9/0x10 [zfs]
[  134.407938]  vdev_sync+0x72/0x190 [zfs]
[  134.407938]  spa_sync_iterate_to_convergence+0x160/0x250 [zfs]
[  134.407938]  spa_sync+0x2f7/0x670 [zfs]
[  134.407938]  txg_sync_thread+0x22d/0x2d0 [zfs]
[  134.407938]  ? txg_dispatch_callbacks+0xf0/0xf0 [zfs]
[  134.407938]  thread_generic_wrapper+0x83/0xa0 [spl]
[  134.407938]  kthread+0x104/0x140
[  134.407938]  ? kasan_check_write.constprop.0+0x10/0x10 [spl]
[  134.407938]  ? kthread_park+0x90/0x90
[  134.457802]  ret_from_fork+0x1f/0x40

This change no longer persists the vdev_removing flag when removing slog
devices and also cleans up some code that was added which is not used.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Signed-off-by: George Wilson <gwilson@delphix.com>
Closes #14652
2023-03-24 10:27:07 -07:00
Matthew Ahrens
d2d4f8554f
Fix prefetching of indirect blocks while destroying
When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or
`zfs send`), we prefetch the indirect blocks that will be needed, in
`traverse_prefetch_metadata()`.  In the case of `zfs destroy <fs>`, we
do a little traversing each txg, and resume the traversal the next txg.
So the indirect blocks that will be needed, and thus are candidates for
prefetching, does not include blocks that are before the resume point.

The problem is that the logic for determining if the indirect blocks are
before the resume point is incorrect, causing the (up to 1024) L1
indirect blocks that are inside the first L2 to not be prefetched.  In
practice, if we are able to read many more than 1024 blocks per txg,
then this will be inconsequential.  But if i/o latency is more than a
few milliseconds, almost no L1's will be prefetched, so they will be
read serially, and thus the destroying will be very slow.  This can be
observed as `zpool get freeing` decreasing very slowly.

Specifically: When we first examine the L2 that contains the block we'll
be resuming from, we have not yet resumed, so `td_resume` is nonzero.
At this point, all calls to `traverse_prefetch_metadata()` will fail,
even if the L1 in question is after the resume point.  It isn't until
the callback is issued for the resume point that we zero out
`td_resume`, but by this point we've already attempted and failed to
prefetch everything under this L2 indirect block.

This commit addresses the issue by reusing the existing
`resume_skip_check()` to determine if the L1's bookmark is before or
after the resume point.  To do so, this function is made non-mutating
(the caller now zeros `td_resume`).

Note, this bug likely predates (was not introduced by) #11803.

Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes #14603
2023-03-24 10:20:07 -07:00
Pawel Jakub Dawidek
ce0e1cc402
Fix cloning into already dirty dbufs.
Undirty the dbuf and destroy its buffer when cloning into it.

Coverity ID: CID-1535375
Reported-by: Richard Yao
Reported-by: Benjamin Coddington
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14655
2023-03-24 10:18:35 -07:00
Pawel Jakub Dawidek
9fa007d35d
Fix build on FreeBSD
Constify some variables after d1807f168e.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by:  Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes #14656
2023-03-22 09:24:41 -07:00
Alexander Motin
d520f64342
FreeBSD: Remove extra arc_reduce_target_size() call
Remove arc_reduce_target_size() call from arc_prune_task().  The idea
of arc_prune_task() is to remove external references on ARC metadata,
such as vnodes. Since arc_prune_async() is called only from ARC itself,
it makes no sense to create a parasitic loop between ARC eviction and
the pruning, treatening to drop ARC to its minimum.  I can't guess why
it was added as part of FreeBSD to OpenZFS integration.

Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Brian Atkinson <batkinson@lanl.gov>
Signed-off-by:  Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes #14639
2023-03-17 17:31:08 -07:00