freebsd-dev

Author	SHA1	Message	Date
George Wilson	a604d3243b	Revert "Do not hold spa_config in ZIL while blocked on IO" This reverts commit `7d638df09b`. Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #14678	2023-03-28 08:13:32 -07:00
Ameer Hamza	a05263b7aa	Update vdev state for spare vdev zfsd fetches new pool configuration through ZFS_IOC_POOL_STATS but it does not get updated nvlist configuration for spare vdev since the configuration is read by spa_spares->sav_config. In this commit, updating the vdev state for spare vdev that is consumed by zfsd on spare disk hotplug. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Ameer Hamza <ahamza@ixsystems.com> Closes #14653	2023-03-24 10:30:38 -07:00
Rich Ercolani	0ad5f43442	Drop lying to the compiler in the fletcher4 code This is probably the uncontroversial part of #13631, which fixes a real problem people are having. There's still things to improve in our code after this is merged, but it should stop the breakage that people have reported, where we lie about a type always being aligned and then pass in stack objects with no alignment requirement and hope for the best. Of course, our SIMD code was written with unaligned accesses, so it doesn't care if we drop this...but some auto-vectorized code that gcc emits sure does, since we told it it can assume they're aligned. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14649	2023-03-24 10:29:19 -07:00
George Wilson	460d887c43	panic loop when removing slog device There is a window in the slog removal code where a panic loop could ensue if the system crashes during that operation. The original design of slog removal did not persisted any state because the removal happened synchronously. This was changed by a later commit which persisted the vdev_removing flag and exposed this bug. If a slog removal is in progress and happens to crash after persisting the vdev_removing flag to the label but before the vdev is removed from the spa config, then the pool will continue to panic on import. Here's a sample of the panic: [ 134.387411] VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22) [ 134.393865] PANIC at dmu.c:1135:dmu_write() [ 134.396035] Kernel panic - not syncing: VERIFY0(0 == dmu_buf_hold_array(os, object, offset, size, FALSE, FTAG, &numbufs, &dbp)) failed (0 == 22) [ 134.397857] CPU: 2 PID: 5914 Comm: txg_sync Kdump: loaded Tainted: P OE 5.4.0-1100-dx2023020205-b3751f8c2-azure #106 [ 134.407938] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090008 12/07/2018 [ 134.407938] Call Trace: [ 134.407938] dump_stack+0x57/0x6d [ 134.407938] panic+0xfb/0x2d7 [ 134.407938] spl_panic+0xcf/0x102 [spl] [ 134.407938] ? traverse_impl+0x1ca/0x420 [zfs] [ 134.407938] ? dmu_object_alloc_impl+0x3b4/0x3c0 [zfs] [ 134.407938] ? dnode_hold+0x1b/0x20 [zfs] [ 134.407938] dmu_write+0xc3/0xd0 [zfs] [ 134.407938] ? space_map_alloc+0x55/0x80 [zfs] [ 134.407938] metaslab_sync+0x61a/0x830 [zfs] [ 134.407938] ? queued_spin_unlock+0x9/0x10 [zfs] [ 134.407938] vdev_sync+0x72/0x190 [zfs] [ 134.407938] spa_sync_iterate_to_convergence+0x160/0x250 [zfs] [ 134.407938] spa_sync+0x2f7/0x670 [zfs] [ 134.407938] txg_sync_thread+0x22d/0x2d0 [zfs] [ 134.407938] ? txg_dispatch_callbacks+0xf0/0xf0 [zfs] [ 134.407938] thread_generic_wrapper+0x83/0xa0 [spl] [ 134.407938] kthread+0x104/0x140 [ 134.407938] ? kasan_check_write.constprop.0+0x10/0x10 [spl] [ 134.407938] ? kthread_park+0x90/0x90 [ 134.457802] ret_from_fork+0x1f/0x40 This change no longer persists the vdev_removing flag when removing slog devices and also cleans up some code that was added which is not used. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Signed-off-by: George Wilson <gwilson@delphix.com> Closes #14652	2023-03-24 10:27:07 -07:00
Matthew Ahrens	d2d4f8554f	Fix prefetching of indirect blocks while destroying When traversing a tree of block pointers (e.g. for `zfs destroy <fs>` or `zfs send`), we prefetch the indirect blocks that will be needed, in `traverse_prefetch_metadata()`. In the case of `zfs destroy <fs>`, we do a little traversing each txg, and resume the traversal the next txg. So the indirect blocks that will be needed, and thus are candidates for prefetching, does not include blocks that are before the resume point. The problem is that the logic for determining if the indirect blocks are before the resume point is incorrect, causing the (up to 1024) L1 indirect blocks that are inside the first L2 to not be prefetched. In practice, if we are able to read many more than 1024 blocks per txg, then this will be inconsequential. But if i/o latency is more than a few milliseconds, almost no L1's will be prefetched, so they will be read serially, and thus the destroying will be very slow. This can be observed as `zpool get freeing` decreasing very slowly. Specifically: When we first examine the L2 that contains the block we'll be resuming from, we have not yet resumed, so `td_resume` is nonzero. At this point, all calls to `traverse_prefetch_metadata()` will fail, even if the L1 in question is after the resume point. It isn't until the callback is issued for the resume point that we zero out `td_resume`, but by this point we've already attempted and failed to prefetch everything under this L2 indirect block. This commit addresses the issue by reusing the existing `resume_skip_check()` to determine if the L1's bookmark is before or after the resume point. To do so, this function is made non-mutating (the caller now zeros `td_resume`). Note, this bug likely predates (was not introduced by) #11803. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #14603	2023-03-24 10:20:07 -07:00
Pawel Jakub Dawidek	ce0e1cc402	Fix cloning into already dirty dbufs. Undirty the dbuf and destroy its buffer when cloning into it. Coverity ID: CID-1535375 Reported-by: Richard Yao Reported-by: Benjamin Coddington Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14655	2023-03-24 10:18:35 -07:00
Pawel Jakub Dawidek	9fa007d35d	Fix build on FreeBSD Constify some variables after `d1807f168e`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #14656	2023-03-22 09:24:41 -07:00
Alexander Motin	d520f64342	FreeBSD: Remove extra arc_reduce_target_size() call Remove arc_reduce_target_size() call from arc_prune_task(). The idea of arc_prune_task() is to remove external references on ARC metadata, such as vnodes. Since arc_prune_async() is called only from ARC itself, it makes no sense to create a parasitic loop between ARC eviction and the pruning, treatening to drop ARC to its minimum. I can't guess why it was added as part of FreeBSD to OpenZFS integration. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14639	2023-03-17 17:31:08 -07:00
Richard Yao	fa46802585	Fix possible bad bit shift in dnode_next_offset_level() `031d7c2fe6` did not handle reverse iteration, such that the original issue theoretically could still occur. Note that contrary to the claim in the ZFS disk format specification that a maximum of 6 levels are possible, 9 levels are possible with recordsize=512 and and indirect block size of 16KB. In this unusual configuration, span will be 65. The maximum size of span at 70 can be reached at recordsize=16K and an indirect blocksize of 16KB. When we are at this indirection level and are traversing backward, the minimum value is start, but we cannot calculate that with 64-bit arithmetic, so we avoid the calculation and instead rely on the earlier statement that did `*offset = start;`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1466214) Closes #14618	2023-03-16 14:27:49 -07:00
naivekun	60cfd3bbc2	QAT: Fix uninitialized seed in QAT compression CpaDcRqResults have to be initialized with checksum=1 for adler32. Otherwise when error CPA_DC_OVERFLOW occurred, the next compress operation will continue on previously part-compressed data, and write invalid checksum data. When zfs decompress the compressed data, a invalid checksum will occurred and lead to #14463 Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Weigang Li <weigang.li@intel.com> Reviewed-by: Chengfei Zhu <chengfeix.zhu@intel.com> Signed-off-by: naivekun <naivekun0817@gmail.com> Closes #14632 Closes #14463	2023-03-16 11:54:10 -07:00
Tino Reichardt	fe6a7b787f	Remove unused Edon-R variants This commit removes the edonr_byteorder.h file and all unused variants of Edon-R. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13618	2023-03-14 15:59:58 -07:00
Richard Yao	d1807f168e	nvpair: Constify string functions After addressing coverity complaints involving `nvpair_name()`, the compiler started complaining about dropping const. This lead to a rabbit hole where not only `nvpair_name()` needed to be constified, but also `nvpair_value_string()`, `fnvpair_value_string()` and a few other static functions, plus variable pointers throughout the code. The result became a fairly big change, so it has been split out into its own patch. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14612	2023-03-14 15:25:50 -07:00
Richard Yao	27ff18cd43	Fix possible NULL pointer dereference in nvlist_lookup_nvpair_ei_sep() Clang's static analyzer complains about a possible NULL pointer dereference in nvlist_lookup_nvpair_ei_sep() because it unconditionally dereferences a pointer initialized by `nvpair_value_nvlist_array()` under the assumption that `nvpair_value_nvlist_array()` will always initialize the pointer without checking to see if an error was returned to indicate otherwise. This itself is improper error handling, so we fix it. However, fixing it to properly respond to errors is not enough to avoid a NULL pointer dereference, since we can receive NULL when the array is empty, so we also add a NULL check. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14612	2023-03-14 15:25:35 -07:00
Richard Yao	47b994049f	Silence clang static analyzer warnings about stored stack addresses Clang's static analyzer complains that nvs_xdr() and nvs_native() functions return pointers to stack memory. That is technically true, but the pointers are stored in stack memory from the caller's stack frame, are not read by the caller and are deallocated when the caller returns, so this is harmless. We set the pointers to NULL to silence the warnings. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14612	2023-03-14 15:25:01 -07:00
Richard Yao	3cb293a6f8	Fix possible NULL pointer dereference in dbuf_verify() Coverity reported a dereference after a NULL check in dbuf_verify(). If `dn` is `NULL`, we can just assume that !dn->dn_free_txg, so we change `!dn->dn_free_txg` to `(dn == NULL \|\| !dn->dn_free_txg)`. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-992298) Closes #14619	2023-03-14 15:00:54 -07:00
Tino Reichardt	3a03c96381	Replace dead opensolaris.org license links The commit replaces all findings of the link: http://www.opensolaris.org/os/licensing with this one: https://opensource.org/licenses/CDDL-1.0 Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: WHR <msl0000023508@gmail.com> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #14625	2023-03-14 14:44:01 -07:00
Richard Yao	24e61911f0	Zero zio_prop_t in flush_write_batch_impl() After `67a1b03791` was merged, coverity started complaining about an uninitialized scalar variable in flush_write_batch_impl() due to the new field zp.zp_brtwrite. Upon inspection, it appears that uninitialized memory was being copied for non-raw streams, so this is a pre-existing issue. The addition of zp_brtwrite by the block cloning commit caused Coverity to begin to notice it. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1535378) Closes #14607	2023-03-14 14:41:45 -07:00
Richard Yao	1c212d1b7c	Fix uninitialized scalar value read regression in dmu_recv_begin() `da19d919a8` changed this in a way that permits execution to reach `if (err == 0)` without initializing err. This could randomly cause the sync task to not execute. We fix that by initializing err to zero. Reviewed-by: George Melikov <mail@gmelikov.ru> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1535377) Closes #14607	2023-03-14 14:40:49 -07:00
Matthew Ahrens	519851122b	ZFS_IOC_COUNT_FILLED does unnecessary txg_wait_synced() `lseek(SEEK_DATA \| SEEK_HOLE)` are only accurate when the on-disk blocks reflect all writes, i.e. when there are no dirty data blocks. To ensure this, if the target dnode is dirty, they wait for the open txg to be synced, so we can call them "stabilizing operations". If they cause txg_wait_synced often, it can be detrimental to performance. Typically, a group of files are all modified, and then SEEK_DATA/HOLE are performed on them. In this case, the first SEEK does a txg_wait_synced(), and subsequent SEEKs don't need to wait, so performance is good. However, if a workload involves an interleaved metadata modification, the subsequent SEEK may do a txg_wait_synced() unnecessarily. For example, if we do a `read()` syscall to each file before we do its SEEK. This applies even with `relatime=on`, when the `read()` is the first read after the last write. The txg_wait_synced() is unnecessary because the SEEK operations only care that the structure of the tree of indirect and data blocks is up to date on disk. They don't care about metadata like the contents of the bonus or spill blocks. (They also don't care if an existing data block is modified, but this would be more involved to filter out.) This commit changes the behavior of SEEK_DATA/HOLE operations such that they do not call txg_wait_synced() if there is only a pending change to the bonus or spill block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Closes #13368 Issue #14594 Issue #14512 Issue #14009	2023-03-14 14:30:29 -07:00
Attila Fülöp	78289b8458	zcommon: Refactor FPU state handling in fletcher4 Currently calls to kfpu_begin() and kfpu_end() are split between the init() and fini() functions of the particular SIMD implementation. This was done in #14247 as an optimization measure for the ABD adapter. Unfortunately the split complicates FPU handling on platforms that use a local FPU state buffer, like Windows and macOS. To ease porting, we introduce a boolean struct member in fletcher_4_ops_t, indicating use of the FPU, and move the FPU state handling from the SIMD implementations to the call sites. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Jorgen Lundman <lundman@lundman.net> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14600	2023-03-14 09:45:28 -07:00
Pawel Jakub Dawidek	67a1b03791	Implementation of block cloning for ZFS Block Cloning allows to manually clone a file (or a subset of its blocks) into another (or the same) file by just creating additional references to the data blocks without copying the data itself. Those references are kept in the Block Reference Tables (BRTs). The whole design of block cloning is documented in module/zfs/brt.c. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rich Ercolani <rincebrain@gmail.com> Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net> Closes #13392	2023-03-10 11:59:53 -08:00
Paul Dagnelie	da19d919a8	Fix incremental receive silently failing for recursive sends The problem occurs because dmu_recv_begin pulls in the payload and next header from the input stream in order to use the contents of the begin record's nvlist. However, the change to do that before the other checks in dmu_recv_begin occur caused a regression where an empty send stream in a recursive send could have its END record consumed by this, which broke the logic of recv_skip. A test is also included to protect against this case in the future. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #12661 Closes #14568	2023-03-10 09:52:44 -08:00
Richard Yao	7316fdd1c0	txg_sync should handle write errors in ZIL The txg_sync thread will see certain buffers in a DR_IN_DMU_SYNC state when ZIL is writing them out. Then it waits until the state changes, but has an assertion to check that they were not DR_NOT_OVERRIDDEN. If the data write failed with an error, ZIL will put it into the DR_NOT_OVERRIDDEN state. It looks like the code will handle that state without an issue, so we can just delete the assertion. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Closes #14283	2023-03-10 09:34:00 -08:00
Richard Yao	950980b4c4	Suppress clang static analyzer warning in vdev_stat_update() `63652e1546` added unnecessary branches in `vdev_stat_update()` to suppress an ASAN false positive the breaks ztest. This had the downside of causing false positive reports in both Coverity and Clang's static analyzer. vd is never NULL, so we add a preprocessor check to only apply the workaround when compiling with ASAN support. Reported-by: Coverity (CID-1524583) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:52:24 -08:00
Richard Yao	08641d9007	Suppress static analyzer warning in dmu_objset_create_impl_dnstats() `ae7e700650` added an assertion to suppress a complaint from Clang's static analyzer. Unfortunately, it missed another way for Clang to complain about this function. This adds another assertion to handle that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:52:15 -08:00
Richard Yao	703283fabd	Linux: Fix octal detection in define_ddi_strtox() Clang Tidy reported this as a misc-redundant-expression because writing `8` instead of `'8'` meant that the condition could never be true. The only place where we have a chance of this being a bug would be in nvlist_lookup_nvpair_ei_sep(). I am not sure if we ever pass an octal to that, but if we ever do, it should work properly now instead of failing. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:52:09 -08:00
Richard Yao	66a38fd10a	Linux: Suppress clang static analyzer warning in zfs_remove() Clang's static analyzer points out that if we fail to find an extended attribute directory, but somehow find it when calculating delete_now and delete_now is true, we will have a NULL pointer dereference when we try to unlink the extended attribute directory. I am not sure if this is possible, but if it is, I do not see a sane way of handling this other than rolling back the transaction and retrying. For now, let us do an VERIFY_IMPLY(). If this trips, it will stop the transaction from committing, which will prevent an attribute directory leak. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:52:04 -08:00
Richard Yao	c2550a136e	Linux: Silence static analyzer warning in crypto_create_ctx_template() A CodeChecker report from Clang's CTU analysis indicated that we were assigning uninitialized values in crypto_create_ctx_template() when we call it from zio_crypt_key_init(). This occurs because the ->cm_param and ->cm_param_len fields are uninitialized. Thankfully, the uninitialized values are only used in the skein via KCF_PROV_CREATE_CTX_TEMPLATE() -> skein_create_ctx_template() -> skein_mac_ctx_build() -> skein_get_digest_bitlen(), but that should not be called from here. We fix this to avoid a possible trap should this code change in the future. The FreeBSD version of zio_crypt_key_init() is unaffected. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:59 -08:00
Richard Yao	51f55742f6	Suppress Clang Static Analyzer warning in bpobj_enqueue() scan-build does not do cross translation unit analysis to realize that `dmu_buf_hold()` will always set `bpo->bpo_cached_dbuf` to a non-NULL pointer, so we add an assertion to make it realize this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:55 -08:00
Richard Yao	45c446308a	Suppress Clang Static Analyzer warning in dsl_dir_rename_sync() Clang's static analyzer reports that if we try to rename a root dataset in `dsl_dir_rename_sync()`, we will have a NULL pointer passed to strlcpy(). This is impossible because `dsl_dir_rename_check()` will prevent us from doing this. We add an assertion to silence this warning. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:50 -08:00
Richard Yao	17443e0b20	Cleanup: Remove constant comparisons reported by CodeQL CodeQL's cpp/constant-comparison query from its security-and-extended query set reported 4 instances where we have comparions that always evaluate the same way. In `draid_config_by_type()`, we have an early `if (nparity == 0)` check that returns `EINVAL`, making a later `if (nparity == 0 \|\| nparity > VDEV_DRAID_MAXPARITY)` partially redundant. The later check prints an error message when parity is 0, but the early check does not. This is not useful feedback, so we move the later check to the place where the early check runs to replace the early check. In `perform_thread_merge()`, we return when `num_threads == 0`. After that block, we do `if (num_threads > 0) {`, which will always be true. We remove the `if` statement. In `sa_modify_attrs()`, we have a loop condition that is `k != 2`, but at the end of the loop, we have `if (k == 0 && hdl->sa_spill)` followed by an else that does a break. The result is that k != 2 will never be evaluated when it is false. We drop the comparison. In `zap_leaf_array_read()`, we have a for loop condition that is `i < ZAP_LEAF_ARRAY_BYTES && len > 0`. However, that loop itself is in a loop that is `while (len > 0)` and while the value of len is decremented inside the loop, when `len == 0`, it will return, such that `len > 0` inside the loop condition will always be true. We drop that part of the condition. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:46 -08:00
Richard Yao	5dd0f019cd	Linux cleanup: zvol_discard() should only call blk_queue_io_stat() once Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:40 -08:00
Richard Yao	f9e109223b	Suppress Clang Static Analyzer warning in dbuf_dnode_findbp() Clang's static analyzer reports that if a `blkid == DMU_SPILL_BLKID` is passed, then we can have a NULL pointer dereference when either ->dn_have_spill or `DNODE_FLAG_SPILL_BLKPTR` is not set. This should not happen. We add an `ASSERT()` to suppress reports about NULL pointer dereferences. Originally, I wanted to use one or two IMPLY statements on pre-conditions before the call to `dbuf_findbp()`, but Clang's static analyzer did not understand it. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:36 -08:00
Richard Yao	399bb81607	Suppress Clang Static Analyzer warning in vdev_split() Clang's static analyzer pointed out that we can have a NULL pointer dereference if we ever attempt to split a vdev that has only 1 child. If that happens, we are left with zero children, but then try to access a non-existent child. Calling vdev_split() on a vdev with only 1 child should be impossible due to how the code is structured. If this ever happens, it would be best to stop execution immediately even in a production environment to allow for the best possible chance of recovery by an expert, so we use `VERIFY3U()` instead of `ASSERT3U()`. Unfortunately, while that defensive assertion will prevent execution from ever reaching the NULL pointer dereference, Clang's static analyzer does not realize that, so we add an `ASSERT()` to inform it of this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:31 -08:00
Richard Yao	a4240a8ac7	Suppress Clang Static Analyzer false positive in the AVL tree code. This has been filed as llvm/llvm-project#60694. Switching from a copy through a C pointer dereference to an explicit memcpy() is a workaround that prevents a false positive. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:21 -08:00
Richard Yao	8b72dfed11	Suppress Clang Static Analyzer defect report in abd_get_size() Clang's static analyzer reports a possible NULL pointer dereference in abd_get_size() when called from vdev_draid_map_alloc_write() called from vdev_draid_map_alloc_row() and vdc->vdc_nparity == 0. This should be impossible, so we add an assertion to silence the defect report. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14575	2023-03-08 13:51:09 -08:00
Alexander Motin	a8d83e2a24	More adaptive ARC eviction Traditionally ARC adaptation was limited to MRU/MFU distribution. But for years people with metadata-centric workload demanded mechanisms to also manage data/metadata distribution, that in original ZFS was just a FIFO. As result ZFS effectively got separate states for data and metadata, minimum and maximum metadata limits etc, but it all required manual tuning, was not adaptive and in its heart remained a bad FIFO. This change removes most of existing eviction logic, rewriting it from scratch. This makes MRU/MFU adaptation individual for data and meta- data, same as the distribution between data and metadata themselves. Since most of required states separation was already done, it only required to make arcs_size state field specific per data/metadata. The adaptation logic is still based on previous concept of ghost hits, just now it balances ARC capacity between 4 states: MRU data, MRU metadata, MFU data and MFU metadata. To simplify arc_c changes instead of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd and arc_pm, representing ARC balance between metadata and data, MRU and MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed point fractions. Since we care about the math result only when need to evict, this moves all the logic from arc_adapt() to arc_evict(), that reduces per-block overhead, since per-block operations are limited to stats collection, now moved from arc_adapt() to arc_access() and using cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag from many places. This change also removes number of metadata specific tunables, part of which were actually not functioning correctly, since not all metadata are equal and some (like L2ARC headers) are not really evictable. Instead it introduced single opaque knob zfs_arc_meta_balance, tuning ARC's reaction on ghost hits, allowing administrator give more or less preference to metadata without setting strict limits. Some of old code parts like arc_evict_meta() are just removed, because since introduction of ABD ARC they really make no sense: only headers referenced by small number of buffers are not evictable, and they are really not evictable no matter what this code do. Instead just call arc_prune_async() if too much metadata appear not evictable. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Allan Jude <allan@klarasystems.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14359	2023-03-08 11:17:23 -08:00
Attila Fülöp	8d9752569b	ICP: AES-GCM: Unify gcm_init_ctx() and gmac_init_ctx() gmac_init_ctx() duplicates most of the code in gcm_int_ctx() while it just needs to set its own IV length and AAD tag length. Introduce gcm_init_ctx_impl() which handles the GCM and GMAC differences while reusing the duplicated code. While here, fix a flaw where the AVX implementation would accept a context using a byte swapped key schedule which it could not handle. Also constify the IV and AAD pointers passed to gcm_init{,_avx}(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14529	2023-03-08 11:12:15 -08:00
Richard Yao	7d638df09b	Do not hold spa_config in ZIL while blocked on IO Otherwise, we can get a deadlock that looks like this: 1. fsync() grabs spa_config_enter(zilog->zl_spa, SCL_STATE, lwb, RW_READER) as part of zil_lwb_write_issue() . It then blocks on the txg_sync when a flush fails from a drive power cycling. 2. The txg_sync then blocks on the pool suspending due to the loss of too many disks. 3. zpool clear then blocks on spa_config_enter(spa, SCL_STATE \| SCL_L2ARC \| SCL_ZIO, spa, RW_WRITER) because it is a writer. The disks cannot be brought online due to fsync() holding that lock and the user gets upset since fsync() is uninterruptibly blocked inside the kernel. We need to grab the lock for vdev_lookup_top(), but we do not need to hold it while there is outstanding IO. This fixes a regression introduced by `1ce23dcaff`. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Closes #14519	2023-03-07 16:12:28 -08:00
Rob N	b988f32c70	Better handling for future crypto parameters The intent is that this is like ENOTSUP, but specifically for when something can't be done because we have no support for the requested crypto parameters; eg unlocking a dataset or receiving a stream encrypted with a suite we don't support. Its not intended to be recoverable without upgrading ZFS itself. If the request could be made to work by enabling a feature or modifying some other configuration item, then some other code should be used. load-key: In the future we might have more crypto suites (ie new values for the `encryption` property. Right now trying to load a key on such a future crypto suite will look up suite parameters off the end of the crypto table, resulting in misbehaviour and/or crashes (or, with debug enabled, trip the assertion in `zio_crypt_key_unwrap`). Instead, lets check the value we got from the dataset, and if we can't handle it, abort early. recv: When receiving a raw stream encrypted with an unknown crypto suite, `zfs recv` would report a generic `invalid backup stream` (EINVAL). While technically correct, its not super helpful, so lets ship a more specific error code and message. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Rob Norris <robn@despairlabs.com> Closes #14577	2023-03-07 14:05:14 -08:00
Andriy Gapon	a55254be7a	[FreeBSD] fix false assert in cache_vop_rmdir when replaying ZIL The assert is enabled when DEBUG_VFS_LOCKS kernel option is set. The exact panic is: panic: condition seqc_in_modify(_vp->v_seqc) not met It happens because seqc protocol is not followed for ZIL replay. But we actually do not need to make any namecache calls at that stage, because the namecache use is not enabled until after the replay is completed. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #14566	2023-03-07 13:48:43 -08:00
Tino Reichardt	84a1c48c86	Fix detection of IBM Power8 machines (ISA 2.07) An IBM POWER7 system with Power ISA 2.06 tried to execute zfs_sha256_power8() - which should only be run on ISA 2.07 machines. The detection is implemented via the zfs_isa207_available() call, but this check was not used. This pull request will fix this. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Low-power <msl0000023508@gmail.com> Closes #14576	2023-03-06 17:01:01 -08:00
Andriy Gapon	28bf26acb6	[FreeBSD] zfs_znode_alloc: lock the vnode earlier This is needed because of a possible error path where zfs_vnode_forget() is called. That function calls vgone() and vput(), the former requires the vnode to be exclusively locked and the latter expects it to be locked. It should be safe to lock the vnode as early as possible because it is not yet visible, so there is no interaction with other locks. While here, remove a tautological assignment to 'vp'. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Andriy Gapon <avg@FreeBSD.org> Closes #14565	2023-03-06 16:30:54 -08:00
George Amanakis	ca9e32d3a7	Optimize the is_l2cacheable functions by placing the most common use case (no special vdevs) first and avoid allocating new variables. Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14494 Closes #14563	2023-03-06 16:13:05 -08:00
Richard Yao	b79e7114bb	Add missing increment to dsl_deadlist_move_bpobj() `dc5c8006f6` was recently merged to prefetch up to 128 deadlists. Unfortunately, a loop was missing an increment, such that it will prefetch all deadlists. The performance properties of that patch probably should be re-evaluated. This was caught by CodeQL's cpp/constant-comparison check in an experimental branch where I am testing the security-and-extended queries. It complained about the `i < 128` part of the loop condition always evaluating to the same thing. The standard CodeQL configuration we use missed this because it does not include that check. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14573	2023-03-06 15:28:26 -08:00
Richard Yao	8846139b45	SHA2Init() should use signed assertions when checking an enum The recent `4c5fec01a4` commit caused Coverity to report that ASSERT3U(algotype, >=, SHA256_MECH_INFO_TYPE); is always true. That is because the signed algotype and signed SHA256_MECH_INFO_TYPE values were cast to unsigned types. To fix this, we switch the assertions to use ASSERT3S(), which retains the signedness of the original values for the comparison. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1535300) Closes #14573	2023-03-06 15:26:43 -08:00
Jorgen Lundman	47119d60ef	Restore ASMABI and other Unify work Make sure all SHA2 transform function has wrappers For ASMABI to work, it is required the calling convention is consistent. Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Joergen Lundman <lundman@lundman.net> Closes #14569	2023-03-06 15:24:05 -08:00
Tino Reichardt	620a977f22	Use SECTION_STATIC macro for sha2 x86_64 assembly - instead of ".section .rodata" we should use SECTION_STATIC Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:33 -08:00
Tino Reichardt	f9f9bef22f	Update BLAKE3 for using the new impl handling This commit changes the BLAKE3 implementation handling and also the calls to it from the ztest command. Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:27 -08:00
Tino Reichardt	4c5fec01a4	Add generic implementation handling and SHA2 impl The skeleton file module/icp/include/generic_impl.c can be used for iterating over different implementations of algorithms. It is used by SHA256, SHA512 and BLAKE3 currently. The Solaris SHA2 implementation got replaced with a version which is based on public domain code of cppcrypto v0.10. These assembly files are taken from current openssl master: - sha256-x86_64.S: x64, SSSE3, AVX, AVX2, SHA-NI (x86_64) - sha512-x86_64.S: x64, AVX, AVX2 (x86_64) - sha256-armv7.S: ARMv7, NEON, ARMv8-CE (arm) - sha512-armv7.S: ARMv7, NEON (arm) - sha256-armv8.S: ARMv7, NEON, ARMv8-CE (aarch64) - sha512-armv8.S: ARMv7, ARMv8-CE (aarch64) - sha256-ppc.S: Generic PPC64 LE/BE (ppc64) - sha512-ppc.S: Generic PPC64 LE/BE (ppc64) - sha256-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) - sha512-p8.S: Power8 ISA Version 2.07 LE/BE (ppc64) Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:52:21 -08:00
Tino Reichardt	3e254aaad0	Remove old or redundant SHA2 files We had three sha2.h headers in different places. The FreeBSD version, the Linux version and the generic solaris version. The only assembly used for acceleration was some old x86-64 openssl implementation for sha256 within the icp module. For FreeBSD the whole SHA2 files of FreeBSD were copied into OpenZFS, these files got removed also. Tested-by: Rich Ercolani <rincebrain@gmail.com> Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de> Closes #13741	2023-03-02 13:50:21 -08:00
Alexander Motin	5f42d1dbf2	System-wide speculative prefetch limit. With some pathological access patterns it is possible to make ZFS accumulate almost unlimited amount of speculative prefetch ZIOs. Combined with linear ABD allocations in RAIDZ code, it appears to be possible to exhaust system KVA, triggering kernel panic. Address this by introducing a system-wide counter of active prefetch requests and blocking prefetch distance doubling per stream hits if the number of active requests is higher that ~6% of ARC size. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14516	2023-03-01 15:27:40 -08:00
Richard Yao	4c856fb333	Fix data race between zil_commit() and zil_suspend() openzfsonwindows/openzfs#206 found that it is possible to trip `VERIFY(list_is_empty(&lwb->lwb_itxs))` when a `zil_commit()` is delayed by the scheduler long enough for a parallel `zil_suspend()` operation to exit `zil_commit_impl()`. This is a data race. To prevent this, we introduce a `zilog->zl_suspend_lock` rwlock to ensure that all outstanding `zil_commit()` operations finish before `zil_suspend()` begins and that subsequent operations fallback to `txg_wait_synced()` after `zil_suspend()` has begun. On `PREEMPT_RT` Linux kernels, the `rw_enter()` implementation suffers from writer starvation. This means that a ZIL intensive system can delay `zil_suspend()` indefinitely. This is a pre-existing problem that affects everything that uses rw locks, so it needs to be addressed in the SPL. However, builds against `PREEMPT_RT` Linux kernels are currently broken due to a GPL symbol issue (#11097), so we can safely disregard that issue for now. Reported-by: Arun KV <arun.kv@datacore.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14514	2023-03-01 13:23:09 -08:00
Richard Yao	9fd5fedc67	Remove bad kmem_free() oversight from previous zfsdev_state_list patch I forgot to remove the corresponding kmem_free() from zfs_kmod_fini() in `9a14ce43c3`. Clang's static analyzer did not complain, but the Coverity scan that was run after the patch was merged did. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1535275) Closes #14556	2023-03-01 13:20:53 -08:00
Richard Yao	dd108f5d73	Linux: zfs_fillpage() should handle partial pages from end of file After `89cd2197b9` was merged, Clang's static analyzer began complaining about a dead assignment in `zfs_fillpage()`. Upon inspection, I noticed that the dead assignment was because we are not using the calculated io_len that we should use to avoid asking the DMU to read past the end of a file. This should result in `dmu_buf_hold_array_by_dnode()` calling `zfs_panic_recover()`. This issue predates `89cd2197b9`, but its simplification of zfs_fillpage() eliminated the only use of the assignment to io_len, which made Clang's static analyzer complain about the issue. Also, as a precaution, we add an assertion that io_offset < i_size. If this ever fails, bad things will happen. Otherwise, we are blindly trusting the kernel not to give us invalid offsets. We continue to blindly trust it on non-debug kernels. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14534	2023-03-01 13:19:47 -08:00
Richard Yao	3a7c35119e	Handle unexpected errors in zil_lwb_commit() without ASSERT() We tripped `ASSERT(error == ENOENT \|\| error == EEXIST \|\| error == EALREADY)` in `zil_lwb_commit()` at Klara when doing robustness testing of ZIL against drive power cycles. That assertion presumably exists because when this code was written, the only errors expected from here were EIO, ENOENT, EEXIST and EALREADY, with EIO having its own handling before the assertion. However, upon doing a manual depth first search traversal of the source tree, it turns out that a large number of unexpected errors are possible here. In theory, EINVAL and ENOSPC can come from dnode_hold_impl(). However, most unexpected errors originate in the block layer and come to us from zio_wait() in various ways. One way is ->zl_get_data() -> dmu_buf_hold() -> dbuf_read() -> zio_wait(). From vdev_disk.c on Linux alone, zio_wait() can return the unexpected errors ENXIO, ENOTSUP, EOPNOTSUPP, ETIMEDOUT, ENOSPC, ENOLINK, EREMOTEIO, EBADE, ENODATA, EILSEQ and ENOMEM This was only observed after what have been likely over 1000 test iterations, so we do not expect to reproduce this again to find out what the error code was. However, circumstantial evidence suggests that the error was ENXIO. When ENXIO or any other unexpected error occurs, the `fsync()` or equivalent operation that called zil_commit() will return success, when in fact, dirty data has not been committed to stable storage. This is a violation of the Single UNIX Specification. The code should be able to handle this and any other unknown error by calling `txg_wait_synced()`. In addition to changing the code to call txg_wait_synced() on unexpected errors instead of returning, we modify it to print information about unexpected errors to dmesg. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-By: Wasabi Technology, Inc. Closes #14532	2023-03-01 09:39:41 -08:00
Allan Jude	cbd76b4032	Makefile.bsd: cleanup and sync with FreeBSD xxhash.c was not being compiled, so when FreeBSD's kernel switched to a newer version of ZSTD a few weeks ago, out-of-tree ZFS failed to build Sync module/Makefile.bsd with FreeBSD's sys/modules/zfs/Makefile And restore the alphabetical sort in a number of places Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@klarasystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Klara, Inc. Closes #14508	2023-02-28 17:33:59 -08:00
Richard Yao	ae7e700650	Suppress static analyzer warning in dmu_objset_create_impl_dnstats() Clang's static analyzer claims that dereferencing ds in dmu_objset_create_impl_dnstats() could cause a NULL pointer dereference when a previous NULL check confirms that it is NULL. It is only NULL on the MOS, for which dmu_objset_userused_enabled(os) should always return false, so ds will never be dereferenced when it is NULL. We add an assertion to suppress this warning. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14470	2023-02-28 17:31:58 -08:00
Richard Yao	5cc4950901	Suppress static analyzer warning in dbuf_hold_copy() Clang's static analyzer claims that dbuf_hold_copy() will have a NULL pointer dereference in data->b_data when called by dbuf_hold_impl(). This is impossible because data is dr->dt.dl.dr_data, which is non-NULL whenever db->db_level == 0, which is always the case whenever dbuf_hold_impl() calls dbuf_hold_copy(). We add an assertion to suppress the complaint. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14470	2023-02-28 17:31:50 -08:00
Richard Yao	9a14ce43c3	Statically allocate first node of zfsdev_state_list This avoids a call to kmem_alloc() during module load. It also suppresses a defect report from Clang's static analyzer that claims that we will have a NULL pointer dereference in zfsdev_state_init() because it does not understand that this has already been allocated in zfs_kmod_init(). Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14470	2023-02-28 17:31:41 -08:00
Richard Yao	7fc48f8378	Suppress static analyzer warning in sa_attr_iter() Clang's static analyzer points out that when IS_SA_BONUSTYPE(type) is true and .sa_length is 0 for an attribute, we have a NULL pointer dereference. We suppress this with an IMPLY() statement. This was also identified by Coverity. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1017954) Closes #14470	2023-02-28 17:31:30 -08:00
Richard Yao	4d9bb5514c	Suppress static analyzer warnings in zio_checksum_error_impl() Clang's static analyzer informs us of multiple NULL pointer dereferences involving zio_checksum_error_impl(). The first is a NULL pointer dereference if bp is NULL and ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is false, but bp is NULL implies that ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true, so we add an IMPLY() statement to suppress the report. The second and third are identical, and are duplicated because while the NULL pointer dereference occurs in zio_checksum_gang_verifier(), it is called by zio_checksum_error_impl() and there is a report for each of the two functions. The reports state that when bp is NULL, ci->ci_flags & ZCHECKSUM_FLAG_EMBEDDED is true and checksum is not ZIO_CHECKSUM_LABEL, we also have a NULL pointer dereference. bp is NULL should imply that checksum == ZIO_CHECKSUM_LABEL, so we add an IMPLY() statement to suppress the second report. The two reports are functionally identical. A fourth variation of this was also reported by Coverity. It occurs when checksum == ZIO_CHECKSUM_ZILOG2. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reported-by: Coverity (CID-1524672) Closes #14470	2023-02-28 17:31:08 -08:00
Richard Yao	d634d20d1b	icp: Prevent compilers from optimizing away memset() in gcm_clear_ctx() The recently merged `f58e513f74` was intended to zero sensitive data before exit from encryption functions to harden the code against theoretical information leaks. Unfortunately, the method by which it did that is optimized away by the compiler, so some information still leaks. This was confirmed by counting function calls in disassembly. After studying how the OpenBSD, FreeBSD and Linux kernels handle this, and looking at our disassembly, I decided on a two-factor approach to protect us from compiler dead store elimination passes. The first factor is to stop trying to inline gcm_clear_ctx(). GCC does not actually inline it in the first place, and testing suggests that dead store elimination passes appear to become more powerful in a bad way when inlining is forced, so we recognize that and move gcm_clear_ctx() to a C file. The second factor is to implement an explicit_memset() function based on the technique used by `secure_zero_memory()` in FreeBSD's blake2 implementation, which coincidentally is functionally identical to the one used by Linux. The source for this appears to be a LLVM bug: https://llvm.org/bugs/show_bug.cgi?id=15495 Unlike both FreeBSD and Linux, we explicitly avoid the inline keyword, based on my observations that GCC's dead store elimination pass becomes more powerful when inlining is forced, under the assumption that it will be equally powerful when the compiler does decide to inline function calls. Disassembly of GCC's output confirms that all 6 memset() calls are executed with this patch applied. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14544	2023-02-28 17:28:50 -08:00
George Amanakis	13ff72ba0a	Revert zfeature_active() to static Commit `34ce4c4` made zfeature_active() non-static. This is not required. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14546	2023-02-28 14:03:52 -08:00
John Poduska	73c383f541	Prevent incorrect datasets being mounted During a mount, zpl_mount_impl(), uses sget() with the callback zpl_test_super() to find a super_block with a matching objset, stored in z_os. It does so without taking the teardown lock on the zfsvfs. The problem is that operations like rollback will replace the z_os. And, there is a window where the objset in the rollback is freed, but z_os still points to it. Then, a mount like operation, for instance a clone, can reallocate that exact same pointer and zpl_test_super() will then match the super_block associated with the rollback as opposed to the clone. This fix tests for a match and if so, takes the teardown lock before doing the final match test. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: John Poduska <jpoduska@datto.com> Closes #14518	2023-02-27 16:49:34 -08:00
Richard Yao	bff26b0220	Skip memory allocation when compressing holes Hole detection in the zio compression code allows us to opportunistically skip compression on holes. We can go a step further by not doing memory allocations on holes either. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Sponsored-by: Wasabi Technology, Inc. Closes #14500	2023-02-27 14:41:02 -08:00
Attila Fülöp	f58e513f74	ICP: AES-GCM: Refactor gcm_clear_ctx() Currently the temporary buffer in which decryption takes place isn't cleared on context destruction. Further in some routines we fail to call gcm_clear_ctx() on error exit. Both flaws may result in leaking sensitive data. We follow best practices and zero out the plaintext buffer before freeing the memory holding it. Also move all cleanup into gcm_clear_ctx() and call it on any context destruction. The performance impact should be negligible. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Rob Norris <robn@despairlabs.com> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14528	2023-02-27 14:38:12 -08:00
Mariusz Zaborski	3b9309aabe	Move zap_attribute_t to the heap in dsl_deadlist_merge In the case of a regular compilation, the compiler raises a warning for a dsl_deadlist_merge function, that the stack size is to large. In debug build this can generate an error. Move large structures to heap. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Mariusz Zaborski <mariusz.zaborski@klarasystems.com> Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Closes #14524	2023-02-27 14:27:58 -08:00
George Amanakis	d816bc5ec7	Move dmu_buf_rele() after dsl_dataset_sync_done() Otherwise the dataset may be freed after the last dmu_buf_rele() leading to a panic. Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14522 Closes #14523	2023-02-23 18:14:52 -07:00
Brian Behlendorf	89cd2197b9	Fix buffered/direct/mmap I/O race When a page is faulted in for memory mapped I/O the page lock may be dropped before it has been read and marked up to date. If a buffered read encounters such a page in mappedread() it must wait until the page has been updated. Failure to do so will result in a panic on debug builds and incorrect data on production builds. The critical part of this change is in mappedread() where pages which are not up to date are now handled. Additionally, it includes the following simplifications. - zfs_getpage() and zfs_fillpage() could be passed an array of pages. This could be more efficient if it was used but in practice only a single page was ever provided. These interfaces were simplified to acknowledge that. - update_pages() was modified to correctly set the PG_error bit on a page when it cannot be read by dmu_read(). - Setting PG_error and PG_uptodate was moved to zfs_fillpage() from zpl_readpage_common(). This is consistent with the handling in update_pages() and mappedread(). - Minor additional refactoring to comments and variable declarations to improve readability. - Add a test case to exercise concurrent buffered, direct, and mmap IO to the same file. - Reduce the mmap_sync test case default run time. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #13608 Closes #14498	2023-02-23 10:57:24 -08:00
Richard Yao	7cb67d627c	Fix NULL pointer dereference in zio_ready() Clang's static analyzer correctly identified a NULL pointer dereference in zio_ready() when ZIO_FLAG_NODATA has been set on a zio that is missing a block pointer. The NULL pointer dereference occurs because we have logic intended to disable ZIO_FLAG_NODATA when it has been set on a gang block. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14469	2023-02-23 10:19:08 -08:00
Richard Yao	c9e39da9a4	Use rw_tryupgrade() in dmu_bonus_hold_by_dnode() When dn->dn_bonus == NULL, dmu_bonus_hold_by_dnode() will unlock its read lock on dn->dn_struct_rwlock and grab a write lock. This can be micro-optimized by calling rw_tryupgrade(). Linux will not benefit from this since it does not support rwlock upgrades, but FreeBSD will. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14517	2023-02-22 16:33:23 -08:00
rob-wing	28251d81d7	FreeBSD: don't verify recycled vnode for zfs control directory Under certain loads, the following panic is hit: panic: page fault KDB: stack backtrace: #0 0xffffffff805db025 at kdb_backtrace+0x65 #1 0xffffffff8058e86f at vpanic+0x17f #2 0xffffffff8058e6e3 at panic+0x43 #3 0xffffffff808adc15 at trap_fatal+0x385 #4 0xffffffff808adc6f at trap_pfault+0x4f #5 0xffffffff80886da8 at calltrap+0x8 #6 0xffffffff80669186 at vgonel+0x186 #7 0xffffffff80669841 at vgone+0x31 #8 0xffffffff8065806d at vfs_hash_insert+0x26d #9 0xffffffff81a39069 at sfs_vgetx+0x149 #10 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #11 0xffffffff8065a28c at lookup+0x45c #12 0xffffffff806594b9 at namei+0x259 #13 0xffffffff80676a33 at kern_statat+0xf3 #14 0xffffffff8067712f at sys_fstatat+0x2f #15 0xffffffff808ae50c at amd64_syscall+0x10c #16 0xffffffff808876bb at fast_syscall_common+0xf8 The page fault occurs because vgonel() will call VOP_CLOSE() for active vnodes. For this reason, define vop_close for zfsctl_ops_snapshot. While here, define vop_open for consistency. After adding the necessary vop, the bug progresses to the following panic: panic: VERIFY3(vrecycle(vp) == 1) failed (0 == 1) cpuid = 17 KDB: stack backtrace: #0 0xffffffff805e29c5 at kdb_backtrace+0x65 #1 0xffffffff8059620f at vpanic+0x17f #2 0xffffffff81a27f4a at spl_panic+0x3a #3 0xffffffff81a3a4d0 at zfsctl_snapshot_inactive+0x40 #4 0xffffffff8066fdee at vinactivef+0xde #5 0xffffffff80670b8a at vgonel+0x1ea #6 0xffffffff806711e1 at vgone+0x31 #7 0xffffffff8065fa0d at vfs_hash_insert+0x26d #8 0xffffffff81a39069 at sfs_vgetx+0x149 #9 0xffffffff81a39c54 at zfsctl_snapdir_lookup+0x1e4 #10 0xffffffff80661c2c at lookup+0x45c #11 0xffffffff80660e59 at namei+0x259 #12 0xffffffff8067e3d3 at kern_statat+0xf3 #13 0xffffffff8067eacf at sys_fstatat+0x2f #14 0xffffffff808b5ecc at amd64_syscall+0x10c #15 0xffffffff8088f07b at fast_syscall_common+0xf8 This is caused by a race condition that can occur when allocating a new vnode and adding that vnode to the vfs hash. If the newly created vnode loses the race when being inserted into the vfs hash, it will not be recycled as its usecount is greater than zero, hitting the above assertion. Fix this by dropping the assertion. FreeBSD-issue: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=252700 Reviewed-by: Andriy Gapon <avg@FreeBSD.org> Reviewed-by: Mateusz Guzik <mjguzik@gmail.com> Reviewed-by: Alek Pinchuk <apinchuk@axcient.com> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Rob Wing <rob.wing@klarasystems.com> Co-authored-by: Rob Wing <rob.wing@klarasystems.com> Submitted-by: Klara, Inc. Sponsored-by: rsync.net Closes #14501	2023-02-21 17:26:33 -08:00
Allan Jude	1d56c6d017	Fix per-jail zfs.mount_snapshot setting When jail.conf set the nopersist flag during startup, it was incorrectly destroying the per-jail ZFS settings. Reported-by: Martin Matuska <mm@FreeBSD.org> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Signed-off-by: Allan Jude <allan@klarasystems.com> Sponsored-by: Modirum MDPay Sponsored-by: Klara, Inc. Closes #14509	2023-02-21 17:23:01 -08:00
George Amanakis	0f32b1f728	Partially revert `eee9362a7` With commit `34ce4c42f` applied, there is no need for `eee9362a7`. Revert that aside from the test. All tests introduced in those commits pass. Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #14502	2023-02-21 09:36:22 -08:00
Richard Yao	9f08b6e31f	Sync thread should avoid holding the spa config write lock when possible spa_sync() currently grabs the write lock due to an old hack that is documented by a comment: We need the write lock here because, for aux vdevs, calling vdev_config_dirty() modifies sav_config. This is ugly and will become unnecessary when we eliminate the aux vdev wart by integrating all vdevs into the root vdev tree. This has lead to deadlocks in rare edge cases from holding the write lock. We can reduce incidence of these deadlocks by not grabbing the write lock on pools without auxillary vdevs. Sponsored-By: Wasabi Technology, Inc. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@klarasystems.com> Closes #14282	2023-02-16 14:10:52 -08:00
Paul Dagnelie	dc72c60ec1	zfs redact fails when dnodesize=auto Add handling to dmu_object_next for the case where *objectp == 0. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Paul Dagnelie <pcd@delphix.com> Closes #14479	2023-02-16 09:23:39 -08:00
Richard Yao	f04cb31e7c	Suppress Clang static analyzer complaint in zfs_replay_create() Clang's static analyzer incorrectly complains about an undefined value here when lr->lr_common.lrc_txtype == TX_SYMLINK and txtype == TX_CREATE. This is impossible, because of this line: txtype = (lr->lr_common.lrc_txtype & ~TX_CI((uint64_t)0x1 << 63)); Changing the code to compare against txtype suppresses the report. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14472	2023-02-14 11:05:41 -08:00
Brian Behlendorf	3fc92adc40	Linux: use filemap_range_has_page() As of the 4.13 kernel filemap_range_has_page() can be used to check if there is a page mapped in a given file range. When available this interface should be used which eliminates the need for the zp->z_is_mapped boolean. Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14493	2023-02-14 11:04:34 -08:00
Richard Yao	ab672133a9	Give strlcat() full buffer lengths rather than smaller buffer lengths strlcat() is supposed to be given the length of the destination buffer, including the existing contents. Unfortunately, I had been overzealous when I wrote `a51288aabb`, since I gave it the length of the destination buffer, minus the existing contents. This likely caused a regression on large strings. On the topic of being overzealous, the use of strlcat() in dmu_send_estimate_fast() was unnecessary because recv_clone_name is a fixed length string. We continue using strlcat() mostly as defensive programming, in case the string length is ever changed, even though it is unnecessary. Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14476	2023-02-14 11:03:42 -08:00
Rich Ercolani	cfd57573ff	quick fix for lingering snapdir unmount problems Unfortunately, even after `e79b6807`, I still, much more rarely, tripped asserts when playing with many ctldir mounts at once. Since this appears to happen if we dispatched twice too fast, just ignore it. We don't actually need to do anything if someone already started doing it for us. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Rich Ercolani <rincebrain@gmail.com> Closes #14462	2023-02-13 16:40:13 -08:00
George Amanakis	34ce4c42ff	Fix a race condition in dsl_dataset_sync() when activating features The zio returned from arc_write() in dmu_objset_sync() uses zio_nowait(). However we may reach the end of dsl_dataset_sync() which checks if we need to activate features in the filesystem without knowing if that zio has even run through the ZIO pipeline yet. In that case we will flag features to be activated in dsl_dataset_block_born() but dsl_dataset_sync() has already completed its run and those features will not actually be activated. Mitigate this by moving the feature activation code in dsl_dataset_sync_done(). Also add new ASSERTs in dsl_scan_visitbp() checking if a block contradicts any filesystem flags. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Ryan Moeller <ryan@iXsystems.com> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: George Amanakis <gamanakis@gmail.com> Closes #13816	2023-02-13 16:37:46 -08:00
Prakash Surya	13312e2fa1	Reduce need for contiguous memory for ioctls We've had cases where we trigger an OOM despite having memory freely available on the system. For example, here, we had about 21GB free: kernel: Node 0 Normal: 24187584kB (UME) 15495338kB (UE) 016kB 032kB 064kB 0128kB 0256kB 0512kB 01024kB 02048kB 0*4096kB = 22071296kB The problem being, all the memory is in 4K and 8K contiguous regions, but the allocation request was for a 16K contiguous region: kernel: SafeExecutors-4 invoked oom-killer: gfp_mask=0x42dc0(GFP_KERNEL\|__GFP_NOWARN\|__GFP_COMP\|__GFP_ZERO), order=2, oom_score_adj=0 The offending allocation came from this call trace: kernel: Call Trace: kernel: dump_stack+0x57/0x7a kernel: dump_header+0x4f/0x1e1 kernel: oom_kill_process.cold.33+0xb/0x10 kernel: out_of_memory+0x1ad/0x490 kernel: __alloc_pages_slowpath+0xd55/0xe40 kernel: __alloc_pages_nodemask+0x2df/0x330 kernel: kmalloc_large_node+0x42/0x90 kernel: __kmalloc_node+0x25a/0x320 kernel: ? spl_kmem_free_impl+0x21/0x30 [spl] kernel: spl_kmem_alloc_impl+0xa5/0x100 [spl] kernel: spl_kmem_zalloc+0x19/0x20 [spl] kernel: zfsdev_ioctl+0x2b/0xe0 [zfs] kernel: do_vfs_ioctl+0xa9/0x640 kernel: ? __audit_syscall_entry+0xdd/0x130 kernel: ksys_ioctl+0x67/0x90 kernel: __x64_sys_ioctl+0x1a/0x20 kernel: do_syscall_64+0x5e/0x200 kernel: entry_SYSCALL_64_after_hwframe+0x44/0xa9 kernel: RIP: 0033:0x7fdca3674317 The problem is, for each ioctl that ZFS makes, it has to allocate a zfs_cmd_t structure, which is 13744 bytes in size (on my system): sdb> sizeof zfs_cmd (size_t)13744 This size, coupled with the fact that we currently allocate it with kmem_zalloc, means we need a 16K contiguous region of memory to satisfy the request. The solution taken by this change, is to use "vmem" instead of "kmem" to do the allocation, such that we don't necessarily need a contiguous 16K memory region to satisfy the allocation. Arguably, a better solution would be not to require such a large allocation to begin with (e.g. reduce the size of the zfs_cmd_t structure), but that'd be a much larger change than this "one liner". Thus, I've opted for this approach for now; we can always circle back and attempt to reduce the size of the structure in the future. Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Reviewed-by: Mark Maybee <mark.maybee@delphix.com> Reviewed-by: Don Brady <don.brady@delphix.com> Signed-off-by: Prakash Surya <prakash.surya@delphix.com> Closes #14474	2023-02-13 16:35:59 -08:00
Alexander Motin	87a4dfa561	Improve arc_read() error reporting Debugging reported NULL de-reference panic in dnode_hold_impl() I found that for certain types of errors arc_read() may only return error code, but not properly report it via done and pio arguments. Lack of done calls may result in reference and/or memory leaks in higher level code. Lack of error reporting via pio may result in unnoticed errors there. For example, dbuf_read(), where dbuf_read_impl() ignores arc_read() return, relies completely on the pio mechanism and missed the errors. This patch makes arc_read() to always call done callback and always propagate errors to parent zio, if either is provided. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14454	2023-02-13 13:21:53 -08:00
Richard Yao	66953686c0	Add assertion and make variables unsigned in abd_alloc_chunks() Clang's static analyzer pointed out that if alloc_pages >= nr_pages before the loop, the value of page will be undefined and will be used anyway. This should not be possible, but as cleanup, we add an assertion. We also recognize that the local variables should be unsigned in the first place, so we make them unsigned. This is not enough to avoid the need for the assertion, since there is still the case that alloc_pages == nr_pages and nr_pages == 0, which the assertion implicitly checks. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:10:50 -08:00
Richard Yao	cfb49616cd	Cleanup: spa vdev processing should check NULL pointers The PVS Studio 2016 FreeBSD kernel report stated: \contrib\opensolaris\uts\common\fs\zfs\spa.c (1341): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1341, 1342. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1355): error V595: The 'spa->spa_l2cache.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1355, 1357. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1398): error V595: The 'spa->spa_spares.sav_vdevs' pointer was utilized before it was verified against nullptr. Check lines: 1398, 1408. \sys\cddl\contrib\opensolaris\uts\common\fs\zfs\spa.c (1583): error V595: The 'oldvdevs' pointer was utilized before it was verified against nullptr. Check lines: 1583, 1595. In practice, all of these uses were safe because a NULL pointer implied a 0 vdev count, which kept us from iterating over vdevs. However, rearranging the code to check the pointer first is not a terrible micro-optimization and makes it more readable, so let us do that. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:09:28 -08:00
Richard Yao	3a7d2a0ce0	zfs_get_temporary_prop() should not pass NULL to strcpy() `dsl_dir_activity_in_progress()` can call `zfs_get_temporary_prop()` with the forth value set to NULL, which will pass NULL to `strcpy()` when there is a match Clang's static analyzer caught this with the help of CodeChecker for Cross Translation Unit analysis. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Brian Atkinson <batkinson@lanl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14456	2023-02-06 11:08:57 -08:00
Matthew Ahrens	14872aaa4f	EIO caused by encryption + recursive gang Encrypted blocks can not have 3 DVAs, because they use the space of the 3rd DVA for the IV+salt. zio_write_gang_block() takes this into account, setting `gbh_copies` to no more than 2 in this case. Gang members BP's do not have the X (encrypted) bit set (nor do they have the DMU level and type fields set), because encryption is not handled at this level. The gang block is reassembled, and then encryption (and compression) are handled. To check if this gang block is encrypted, the code in zio_write_gang_block() checks `pio->io_bp`. This is normally fine, because the block that's being ganged is typically the encrypted BP. The problem is that if there is "recursive ganging", where a gang member is itself a gang block, then when zio_write_gang_block() is called to create a gang block for a gang member, `pio->io_bp` is the gang member's BP, which doesn't have the X bit set, so the number of DVA's is not restricted to 2. It should instead be looking at the the "gang leader", i.e. the top-level gang block, to determine how many DVA's can be used, to avoid a "NDVA's inversion" (where a child has more DVA's than its parent). gang leader BP: X (encrypted) bit set, 2 DVA's, IV+salt in 3rd DVA's space: ``` DVA[0]=<1:...:100400> DVA[1]=<0:...:100400> salt=... iv=... [L0 ZFS plain file] fletcher4 uncompressed encrypted LE gang unique double size=100000L/100000P birth=... fill=1 cksum=... ``` leader's GBH contains a BP with gang bit set and 3 DVA's: ``` DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> [L0 unallocated] fletcher4 uncompressed unencrypted LE contiguous unique double size=55600L/55600P birth=... fill=0 cksum=... DVA[0]=<1:...:55600> DVA[1]=<0:...:55600> DVA[2]=<1:...:200> [L0 unallocated] fletcher4 uncompressed unencrypted LE gang unique double size=55400L/55400P birth=... fill=0 cksum=... ``` On nondebug bits, having the 3rd DVA in the gang block works for the most part, because it's true that all 3 DVA's are available in the gang member BP (in the GBH). However, for accounting purposes, gang block DVA's ASIZE include all the space allocated below them, i.e. the 512-byte gang block header (GBH) as well as the gang members below that. We see that above where the gang leader BP is 1MB logical (and after compression: 0x`100000P`), but the ASIZE of each DVA is 2 sectors (1KB) more than 1MB (0x`100400`). Since thre are 3 copies of a block below it, we increment the ATIME of the 3rd DVA of the gang leader by the space used by the 3rd DVA of the child (1 sector, in this case). But there isn't really a 3rd DVA of the parent; the salt is stored in place of the 3rd DVA's ASIZE. So when zio_write_gang_member_ready() increments the parent's BP's `DVA[2]`'s ASIZE, it's actually incrementing the parent's salt. When we later try to read the encrypted recursively-ganged block, the salt doesn't match what we used to write it, so MAC verification fails and we get an EIO. ``` zio_encrypt(): encrypted 515/2/0/403 salt: 25 25 bb 9d ad d6 cd 89 zio_decrypt(): decrypting 515/2/0/403 salt: 26 25 bb 9d ad d6 cd 89 ``` This commit addresses the problem by not increasing the number of copies of the GBH beyond 2 (even for non-encrypted blocks). This simplifies the logic while maintaining the ability to traverse all metadata (including gang blocks) even if one copy is lost. (Note that 3 copies of the GBH will still be created if requested, e.g. for `copies=3` or MOS blocks.) Additionally, the code that increments the parent's DVA's ASIZE is made to check the parent DVA's NDVAS even on nondebug bits. So if there's a similar bug in the future, it will cause a panic when trying to write, rather than corrupting the parent BP and causing an error when reading. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Matthew Ahrens <mahrens@delphix.com> Caused-by: #14356 Closes #14440 Closes #14413	2023-02-06 09:37:06 -08:00
Jorgen Lundman	dffd40b3b6	Unify assembly files with macOS The remaining changes needed to make the assembly files work with macOS. Reviewed-by: Attila Fülöp <attila@fueloep.org> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Jorgen Lundman <lundman@lundman.net> Closes #14451	2023-02-06 09:27:55 -08:00
Allan Jude	c799866b97	Resolve WS-2021-0184 vulnerability in zstd Pull in d40f55cd950919d7eac951b122668e55e33e5202 from upstream Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Allan Jude <allan@klarasystems.com> Closes #14439	2023-02-02 15:12:51 -08:00
Brian Behlendorf	973934b965	Increase default zfs_rebuild_vdev_limit to 64MB When testing distributed rebuild performance with more capable hardware it was observed than increasing the zfs_rebuild_vdev_limit to 64M reduced the rebuild time by 17%. Beyond 64MB there was some improvement (~2%) but it was not significant when weighed against the increased memory usage. Memory usage is capped at 1/4 of arc_c_max. Additionally, vr_bytes_inflight_max has been moved so it's updated per-metaslab to allow the size to be adjust while a rebuild is running. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-01-27 10:02:24 -08:00
Brian Behlendorf	c0aea7cf4e	Increase default zfs_scan_vdev_limit to 16MB For HDD based pools the default zfs_scan_vdev_limit of 4M per-vdev can significantly limit the maximum scrub performance. Increasing the default to 16M can double the scrub speed from 80 MB/s per disk to 160 MB/s per disk. This does increase the memory footprint during scrub/resilver but given the performance win this is a reasonable trade off. Memory usage is capped at 1/4 of arc_c_max. Note that number of outstanding I/Os has not changed and is still limited by zfs_vdev_scrub_max_active. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Nguyen <tony.nguyen@delphix.com> Reviewed-by: Alexander Motin <mav@FreeBSD.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14428	2023-01-27 10:01:13 -08:00
Alexander Motin	dc5c8006f6	Prefetch on deadlists merge During snapshot deletion ZFS may issue several reads for each deadlist to merge them into next snapshot's or pool's bpobj. Number of the dead lists increases with number of snapshots. On HDD pools it may take significant time during which sync thread is blocked. This patch introduces prescient prefetch of required blocks for up to 128 deadlists ahead. Tests show reduction of time required to delete dataset with 720 snapshots with randomly overwritten file on wide HDD pool from 75-85 to 22-28 seconds. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Issue #14276 Closes #14402	2023-01-25 11:30:24 -08:00
Brian Behlendorf	c85ac731a0	Improve resilver ETAs When resilvering the estimated time remaining is calculated using the average issue rate over the current pass. Where the current pass starts when a scan was started, or restarted, if the pool was exported/imported. For dRAID pools in particular this can result in wildly optimistic estimates since the issue rate will be very high while scanning when non-degraded regions of the pool are scanned. Once repair I/O starts being issued performance drops to a realistic number but the estimated performance is still significantly skewed. To address this we redefine a pass such that it starts after a scanning phase completes so the issue rate is more reflective of recent performance. Additionally, the zfs_scan_report_txgs module option can be set to reset the pass statistics more often. Reviewed-by: Akash B <akash-b@hpe.com> Reviewed-by: Tony Hutter <hutter2@llnl.gov> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes #14410	2023-01-25 11:28:54 -08:00
Coleman Kane	9cd71c8604	linux 6.2 compat: zpl_set_acl arg2 is now struct dentry Linux 6.2 changes the second argument of the set_acl operation to be a "struct dentry " rather than a "struct inode ". The inode* parameter is still available as dentry->d_inode, so adjust the call to the _impl function call to dereference and pass that pointer to it. Also document that the get_acl -> get_inode_acl member name change from commit `884a693` was an API change also introduced in Linux 6.2. Reviewed-by: Tony Hutter <hutter2@llnl.gov> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Coleman Kane <ckane@colemankane.org> Closes #14415	2023-01-24 11:20:50 -08:00
Alexander Motin	0f740a4f1d	Introduce minimal ZIL block commit delay Despite all optimizations, tests on actual hardware show that FreeBSD kernel can't sleep for less then ~2us. Similar tests on Linux show ~50us delay at least from nanosleep() (haven't tested inside kernel). It means that on very fast log device ZIL may not be able to satisfy zfs_commit_timeout_pct block commit timeout, increasing log latency more than desired. Handle that by introduction of zil_min_commit_timeout parameter, specifying minimal timeout value where additional delays to aggregate writes may be skipped. Also skip delays if the LWB is more than 7/8 full, that often happens if I/O sizes are constant and match one of LWB sizes. Both things are applied only if there were no already outstanding log blocks, that may indicate single-threaded workload, that by definition can not benefit from the commit delays. While there, add short time moving average to zl_last_lwb_latency to make it more stable. Tests of single-threaded 4KB writes to NVDIMM SLOG on FreeBSD show IOPS increase by 9% instead of expected 5%. For zfs_commit_timeout_pct of 1 there IOPS increase by 5.5% instead of expected 1%. Reviewed-by: Allan Jude <allan@klarasystems.com> Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Reviewed-by: Prakash Surya <prakash.surya@delphix.com> Signed-off-by: Alexander Motin <mav@FreeBSD.org> Sponsored by: iXsystems, Inc. Closes #14418	2023-01-24 09:20:32 -08:00
Attila Fülöp	037e4f2536	x86 asm: Replace .align with .balign The .align directive used to align storage locations is ambiguous. On some platforms and assemblers it takes a byte count, on others the argument is interpreted as a shift value. The current usage expects the first interpretation. Replace it with the unambiguous .balign directive which always expects a byte count, regardless of platform and assembler. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14422	2023-01-24 09:04:39 -08:00
Attila Fülöp	58ca7b1011	IPC: blake3 x86 asm: fix placement of .size directives The .size directive used by the SET_SIZE C macro uses the special dot symbol to calculate the size of a function. The dot symbol refers to the current address, so for the calculation to be meaningful the SET_SIZE macro must be placed immediately after the end of the function the size is calculated for. Reviewed-by: Jorgen Lundman <lundman@lundman.net> Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de> Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Signed-off-by: Attila Fülöp <attila@fueloep.org> Closes #14422	2023-01-24 09:03:31 -08:00
David Hedberg	37a27b4306	Wait for txg sync if the last DRR_FREEOBJECTS might result in a hole If we receive a DRR_FREEOBJECTS as the first entry in an object range, this might end up producing a hole if the freed objects were the only existing objects in the block. If the txg starts syncing before we've processed any following DRR_OBJECT records, this leads to a possible race where the backing arc_buf_t gets its psize set to 0 in the arc_write_ready() callback while still being referenced from a dirty record in the open txg. To prevent this, we insert a txg_wait_synced call if the first record in the range was a DRR_FREEOBJECTS that actually resulted in one or more freed objects. Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: David Hedberg <david.hedberg@findity.com> Sponsored by: Findity AB Closes #11893 Closes #14358	2023-01-23 13:19:43 -08:00
Richard Yao	73968defdd	Reject streams that set ->drr_payloadlen to unreasonably large values In the zstream code, Coverity reported: "The argument could be controlled by an attacker, who could invoke the function with arbitrary values (for example, a very high or negative buffer size)." It did not report this in the kernel. This is likely because the userspace code stored this in an int before passing it into the allocator, while the kernel code stored it in a uint32_t. However, this did reveal a potentially real problem. On 32-bit systems and systems with only 4GB of physical memory or less in general, it is possible to pass a large enough value that the system will hang. Even worse, on Linux systems, the kernel memory allocator is not able to support allocations up to the maximum 4GB allocation size that this allows. This had already been limited in userspace to 64MB by `ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to protect systems. After some discussion, we settle on 256MB as a hard upper limit. Attempting to receive a stream that requires more memory than that will result in E2BIG being returned to user space. Reported-by: Coverity (CID-1529836) Reported-by: Coverity (CID-1529837) Reported-by: Coverity (CID-1529838) Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu> Closes #14285	2023-01-23 13:16:22 -08:00

1 2 3 4 5 ...

4150 Commits