Block Cloning allows to manually clone a file (or a subset of its
blocks) into another (or the same) file by just creating additional
references to the data blocks without copying the data itself.
Those references are kept in the Block Reference Tables (BRTs).
The whole design of block cloning is documented in module/zfs/brt.c.
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Christian Schwarz <christian.schwarz@nutanix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Rich Ercolani <rincebrain@gmail.com>
Signed-off-by: Pawel Jakub Dawidek <pawel@dawidek.net>
Closes#13392
The current loop triggers a complaint that we are using an array offset
prior to a range check from cpp/offset-use-before-range-check when we
are actually calculating maximum and minimum values. I was about to file
a false positive report with CodeQL, but after looking at how the code
is structured, I really cannot blame CodeQL for mistaking this for a
range check.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14575
CodeQL's cpp/constant-comparison query from its security-and-extended
query set reported 4 instances where we have comparions that always
evaluate the same way.
In `draid_config_by_type()`, we have an early `if (nparity == 0)` check
that returns `EINVAL`, making a later `if (nparity == 0 || nparity >
VDEV_DRAID_MAXPARITY)` partially redundant. The later check prints an
error message when parity is 0, but the early check does not. This is
not useful feedback, so we move the later check to the place where the
early check runs to replace the early check.
In `perform_thread_merge()`, we return when `num_threads == 0`. After
that block, we do `if (num_threads > 0) {`, which will always be true.
We remove the `if` statement.
In `sa_modify_attrs()`, we have a loop condition that is `k != 2`, but
at the end of the loop, we have `if (k == 0 && hdl->sa_spill)` followed
by an else that does a break. The result is that k != 2 will never be
evaluated when it is false. We drop the comparison.
In `zap_leaf_array_read()`, we have a for loop condition that is `i <
ZAP_LEAF_ARRAY_BYTES && len > 0`. However, that loop itself is in a loop
that is `while (len > 0)` and while the value of len is decremented
inside the loop, when `len == 0`, it will return, such that `len > 0`
inside the loop condition will always be true. We drop that part of the
condition.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14575
Coverity reported a TOCTOU race in `zpool_do_labelclear()`. This is not
believed to be a real security issue, but fixing it reduces the number
of syscalls we do and will prevent other static analyzers from
complaining about this.
The code is expected to be equivalent. However, under rare
circumstances, such as ELOOP, ENAMETOOLONG, ENOMEM, ENOTDIR and
EOVERFLOW, we will display the error message that we currently display
for the `open()` syscall rather than the one that we currently display
for the `stat()` syscall. This is considered to be an improvement.
Reported-by: Coverity (CID-1524188)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14575
Traditionally ARC adaptation was limited to MRU/MFU distribution. But
for years people with metadata-centric workload demanded mechanisms to
also manage data/metadata distribution, that in original ZFS was just
a FIFO. As result ZFS effectively got separate states for data and
metadata, minimum and maximum metadata limits etc, but it all required
manual tuning, was not adaptive and in its heart remained a bad FIFO.
This change removes most of existing eviction logic, rewriting it from
scratch. This makes MRU/MFU adaptation individual for data and meta-
data, same as the distribution between data and metadata themselves.
Since most of required states separation was already done, it only
required to make arcs_size state field specific per data/metadata.
The adaptation logic is still based on previous concept of ghost hits,
just now it balances ARC capacity between 4 states: MRU data, MRU
metadata, MFU data and MFU metadata. To simplify arc_c changes instead
of arc_p measured in bytes, this code uses 3 variable arc_meta, arc_pd
and arc_pm, representing ARC balance between metadata and data, MRU and
MFU for data, and MRU and MFU for metadata respectively as 32-bit fixed
point fractions. Since we care about the math result only when need to
evict, this moves all the logic from arc_adapt() to arc_evict(), that
reduces per-block overhead, since per-block operations are limited to
stats collection, now moved from arc_adapt() to arc_access() and using
cheaper wmsums. This also allows to remove ugly ARC_HDR_DO_ADAPT flag
from many places.
This change also removes number of metadata specific tunables, part of
which were actually not functioning correctly, since not all metadata
are equal and some (like L2ARC headers) are not really evictable.
Instead it introduced single opaque knob zfs_arc_meta_balance, tuning
ARC's reaction on ghost hits, allowing administrator give more or less
preference to metadata without setting strict limits.
Some of old code parts like arc_evict_meta() are just removed, because
since introduction of ABD ARC they really make no sense: only headers
referenced by small number of buffers are not evictable, and they are
really not evictable no matter what this code do. Instead just call
arc_prune_async() if too much metadata appear not evictable.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Closes#14359
The intent is that this is like ENOTSUP, but specifically for when
something can't be done because we have no support for the requested
crypto parameters; eg unlocking a dataset or receiving a stream
encrypted with a suite we don't support.
Its not intended to be recoverable without upgrading ZFS itself.
If the request could be made to work by enabling a feature or modifying
some other configuration item, then some other code should be used.
load-key: In the future we might have more crypto suites (ie new values
for the `encryption` property. Right now trying to load a key on such
a future crypto suite will look up suite parameters off the end of the
crypto table, resulting in misbehaviour and/or crashes (or, with debug
enabled, trip the assertion in `zio_crypt_key_unwrap`).
Instead, lets check the value we got from the dataset, and if we can't
handle it, abort early.
recv: When receiving a raw stream encrypted with an unknown crypto
suite, `zfs recv` would report a generic `invalid backup stream`
(EINVAL). While technically correct, its not super helpful, so lets
ship a more specific error code and message.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#14577
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14585Closes#14592
This is tripping LeakSanitizer, which causes zloop test failures on pull
requests.
Reviewed-by: Tino Reichardt <milky-zfs@mcmilk.de>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14583
This commit changes the BLAKE3 implementation handling and
also the calls to it from the ztest command.
Tested-by: Rich Ercolani <rincebrain@gmail.com>
Tested-by: Sebastian Gottschall <s.gottschall@dd-wrt.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tino Reichardt <milky-zfs@mcmilk.de>
Closes#13741
The approach is straightforward: for dataset ops, if a key was offered,
find the encryption root and the various encryption parameters, derive a
wrapping key if necessary, and then unlock the encryption root. After
that all the regular dataset ops will return unencrypted data, and
that's kinda the whole thing.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#11551Closes#12707Closes#14503
We improve the error message of zfs redact by checking if the target
snapshot exists, and if all the redaction snapshots exist. As a
future improvement we could iterate over every snapshot provided and
use that to determine which one specifically doesn't exist.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#11426Closes#14496
The leading zeroes are part of the checksum so we should show them.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Norris <robn@despairlabs.com>
Closes#14464
With the persistent error log feature we need to account for
spa_errlog_{scrub, last} containing mappings to other error log objects,
which need to be marked as in-use as well.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: George Amanakis <gamanakis@gmail.com>
Closes#14442Closes#14434
zfs_setproctitle_init() is stubbed out on FreeBSD.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Rob Wing <rob.fx907@gmail.com>
Closes#14441
After commit 19d3961, progress reporting (-v) with replication flag
enabled does not report the progress on the console. This commit
fixes the issue by updating the logic to check for pa->progress
instead of pa_verbosity in send_progress_thread().
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14448
When resilvering the estimated time remaining is calculated using
the average issue rate over the current pass. Where the current
pass starts when a scan was started, or restarted, if the pool
was exported/imported.
For dRAID pools in particular this can result in wildly optimistic
estimates since the issue rate will be very high while scanning
when non-degraded regions of the pool are scanned. Once repair
I/O starts being issued performance drops to a realistic number
but the estimated performance is still significantly skewed.
To address this we redefine a pass such that it starts after a
scanning phase completes so the issue rate is more reflective of
recent performance. Additionally, the zfs_scan_report_txgs
module option can be set to reset the pass statistics more often.
Reviewed-by: Akash B <akash-b@hpe.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14410
In the zstream code, Coverity reported:
"The argument could be controlled by an attacker, who could invoke the
function with arbitrary values (for example, a very high or negative
buffer size)."
It did not report this in the kernel. This is likely because the
userspace code stored this in an int before passing it into the
allocator, while the kernel code stored it in a uint32_t.
However, this did reveal a potentially real problem. On 32-bit systems
and systems with only 4GB of physical memory or less in general, it is
possible to pass a large enough value that the system will hang. Even
worse, on Linux systems, the kernel memory allocator is not able to
support allocations up to the maximum 4GB allocation size that this
allows.
This had already been limited in userspace to 64MB by
`ZFS_SENDRECV_MAX_NVLIST`, but we need a hard limit in the kernel to
protect systems. After some discussion, we settle on 256MB as a hard
upper limit. Attempting to receive a stream that requires more memory
than that will result in E2BIG being returned to user space.
Reported-by: Coverity (CID-1529836)
Reported-by: Coverity (CID-1529837)
Reported-by: Coverity (CID-1529838)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14285
Introduce four new vdev properties:
checksum_n
checksum_t
io_n
io_t
These properties can be used for configuring the thresholds of zed's
diagnosis engine and are interpeted as <N> events in T <seconds>.
When this property is set to a non-default value on a top-level vdev,
those thresholds will also apply to its leaf vdevs. This behavior can be
overridden by explicitly setting the property on the leaf vdev.
Note that, these properties do not persist across vdev replacement. For
this reason, it is advisable to set the property on the top-level vdev
instead of the leaf vdev.
The default values for zed's diagnosis engine (10 events, 600 seconds)
remains unchanged.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology LLC
Closes#13805
This allows parsing of zfs send progress by checking the process
title.
Doing so requires some changes to the send code in libzfs_sendrecv.c;
primarily these changes move some of the accounting around, to allow
for the code to be verbose as normal, or set the process title. Unlike
BSD, setproctitle() isn't standard in Linux; thus, borrowed it from
libbsd with slight modifications.
Authored-by: Sean Eric Fagan <sef@FreeBSD.org>
Co-authored-by: Ryan Moeller <ryan@iXsystems.com>
Co-authored-by: Ameer Hamza <ahamza@ixsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14376
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes#14310
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes#14310
is_vdev_cb() uses string comparison to find a matching vdev and
will fallback to comparing the guid via a string. These changes
drop the string comparison and compare the guids instead.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Allan Jude <allan@klarasystems.com>
Signed-off-by: Rob Wing <rob.wing@klarasystems.com>
Co-authored-by: Rob Wing <rob.wing@klarasystems.com>
Sponsored-by: Seagate Technology
Submitted-by: Klara, Inc.
Closes#14311
ECHRNG is returned when the channel program encounters a runtime
error. For example, this can happen when a snapshot doesn't exist.
We handle this error the same way as the existing EEXIST and ENOENT
error checks.
Additionally, improve the internal debug message to include the
error describing why a pool couldn't be opened.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14351
It is possible for ztest_dsl_prop_set_uint64() to fail with ENOSPC
and this needs to be handled consistently.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14351
There's no need to so aggressively test splitting a pool. Reduce
the occurence of this test to once every 10 seconds.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14351
Commit c23738c70e modified the expected
behavior of attach to prevent hot spares from being used as special
vdev replacements. We update ztest's expectations accordingly to
prevent it from failing when testing the updated behavior.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14351
This commit supports for spare vdev hotplug. The
spare vdev associated with all the pools will be
marked as "Removed" when the drive is physically
detached and will become "Available" when the
drive is reattached. Currently, the spare vdev
status does not change on the drive removal and
the same is the case with reattachment.
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14295
Recent ARC commits added new statistic counters, such as iohits,
uncached state, etc. Represent those. Also some of previously
reported numbers were confusing or even made no sense. Cleanup
and restructure existing reports.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Alexander Motin <mav@FreeBSD.org>
Sponsored by: iXsystems, Inc.
Issue #14115
Issue #14123
Issue #14243Closes#14320
There is a lock order inversion deadlock between `spa_errlog_lock` and
`dp_config_rwlock`:
A thread in `spa_delete_dataset_errlog()` is running from a sync task.
It is holding the `dp_config_rwlock` for writer (see
`dsl_sync_task_sync()`), and waiting for the `spa_errlog_lock`.
A thread in `dsl_pool_config_enter()` is holding the `spa_errlog_lock`
(see `spa_get_errlog_size()`) and waiting for the `dp_config_rwlock` (as
reader).
Note that this was introduced by #12812.
This commit address this by defining the lock ordering to be
dp_config_rwlock first, then spa_errlog_lock / spa_errlist_lock.
spa_get_errlog() and spa_get_errlog_size() can acquire the locks in this
order, and then process_error_block() and get_head_and_birth_txg() can
verify that the dp_config_rwlock is already held.
Additionally, a buffer overrun in `spa_get_errlog()` is corrected. Many
code paths didn't check if `*count` got to zero, instead continuing to
overwrite past the beginning of the userspace buffer at `uaddr`.
Tested by having some errors in the pool (via `zinject -t data
/path/to/file`), one thread running `zpool iostat 0.001`, and another
thread runs `zfs destroy` (in a loop, although it hits the first time).
This reproduces the problem easily without the fix, and works with the
fix.
Reviewed-by: Mark Maybee <mark.maybee@delphix.com>
Reviewed-by: George Amanakis <gamanakis@gmail.com>
Reviewed-by: George Wilson <gwilson@delphix.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Matthew Ahrens <mahrens@delphix.com>
Closes#14239Closes#14289
This reverts commit d30db519af. With
this change applied zloop.sh fails reliably with the following ASSERT.
zio_wait(zio_claim(NULL, zcb->zcb_spa, refcnt ? 0 : spa_min_claim_txg(
zcb->zcb_spa), bp, NULL, NULL, ZIO_FLAG_CANFAIL)) == 0 (0x2 == 0x0)
ASSERT at cmd/zdb/zdb.c:5452:zdb_count_block()
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#14306
If the fields to be listed and sorted by are constrained to those
populated by dsl_dataset_fast_stat(), then zfs list is much faster,
as it does not need to open each objset and reads its properties.
A previous optimization by Pawel Dawidek
(0cee24064a) took advantage
of this to make listing snapshot names sorted only by name much faster.
However, it was limited to `-o name -s name`, this work extends this
optimization to work with:
- name
- guid
- createtxg
- numclones
- inconsistent
- redacted
- origin
and could be further extended to any other properties supported by
dsl_dataset_fast_stat() or similar, that do not require extra locking
or reading from disk.
This was committed before (9a9e2e343dfa2af28bf7910de77ae73aa006de62),
but was reverted due to a regression when used with an older kernel.
If the kernel does not populate zc->zc_objset_stats, we now fallback
to getting the properties via the slower interface, to avoid problems
with newer userland and older kernels.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Allan Jude <allan@klarasystems.com>
Closes#14110
Outgoing mails for ZFS pool events include the pool GUID,
but not the actual pool name. Let's change this for better
readability, as it is already done in the mails for finished
pool resilvers.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Reviewed-by: George Melikov <mail@gmelikov.ru>
Reviewed-by Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Marcel Menzel <mail@mcl.gg>
Closes#14272
Coverity has long complained about the checksum being uninitialized if
an END record is processed before its BEGIN record. This should not
happen, but there was no code to check for it. I had left this unfixed
since it was a low priority issue, but then
9f4ede63d2 added another instance of this.
I am making an effort to "hold the line" to keep new coverity defect
reports from going unaddressed, so I find myself forced to fix this much
earlier than I had originally planned to address it.
The solution is to maintain a counter and a flag. Then use VERIFY
statements to verify the following runtime constraints:
* Every record either has a corresponding BEGIN record, is a BEGIN
record or is the end of stream END record for replication streams.
* BEGIN records cannot be nested. i.e. There must be an END record
before another BEGIN record may be seen.
Failure to meet these constraints will cause the program to exit.
This is sufficient to ensure that the checksum is never accessed when
uninitialized.
Reported-by: Coverity (CID 1524578)
Reported-by: Coverity (CID 1524633)
Reported-by: Coverity (CID 1527295)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14176
CodeQL pointed out that for extreme floating point values, `sprintf()`
will overwrite a 32 character buffer. It cited 1e304 as an example,
which causes `sprintf()` to print 308 characters.
In practice, the numbers should never exceed 100, so this should not
happen. To silence the warning and also handle unexpected situations, we
change the code to use `snprintf()`.
This was missed during my audit of our use of `sprintf()`, since I did
not think to consider extreme floating point representations. It also
really should not happen, so this change is purely defensive
programming.
This was found by CodeQL's cpp/overrunning-write-with-float check.
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14264
This was written by Jeff Bonick and was committed to OpenSolaris on
November 1, 2009. It appears that Jeff meant to continue the outer loop
iteration when `ddp->ddp_phys_birth == 0`, but put his check inside the
inner loop. This causes a pointer to uninitialized memory to be passed
to ddt_lookup() inside a VERIFY() statement whenever that condition is
true.
Reported-by: Coverity (CID 1524462)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14264
888914486e introduced this regression.
I used cscope to verify that there are no other instances of this in the
codebase. This is the one of the few bugs that are extremely easy to
identify using cscope.
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14264
The ZPOOL_SCRIPTS_PATH environment variable can be passed here. This
allows for arbitrarily long strings to be passed to sprintf(), which can
overflow the buffer.
I missed this in my earlier audit of the codebase. CodeQL's
cpp/unbounded-write check caught this.
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14264
This is not in performance critical code, but static analyzers will
complain about it, so lets switch to pass by pointer here.
Reported-by: Coverity (CID-1524384)
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14263
Instead, it should include the proper header.
CodeQL caught this.
Reviewed-by: Damian Szuberski <szuberskidamian@gmail.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14263
Squelch false positives reported by GCC 12 with UBSan.
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: szubersk <szuberskidamian@gmail.com>
Closes#14150
svb is declared on the stack. We then set parts of svb.svb_dva with
DVA_SET_VDEV(), DVA_SET_OFFSET() and DVA_SET_ASIZE(). However, the DVA
contains other fields for pad, GRID and G. When setting the fields we
use, we technically read uninitialized bits from the fields we do not
use. This makes Coverity and Clang's Static Analyzer complain.
Presumably, other static analyzers might complain too.
There is no real bug here, but we are still technically reading
undefined data and unless we stop doing that, static analyzers will
complain about it in perpetuum and this could obscure real issues. We
silence the static analyzer complaints by using a 0 struct initializer.
Reported by: Coverity (CID 1524627)
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14210
If the attached disk already contains a vdev GUID, it
means the disk is not clean. In such a scenario, the
physical path would be a match that makes the disk
faulted when trying to online it. So, we would only
want to proceed if either GUID matches with the last
attached disk or the disk is in a clean state.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Tony Hutter <hutter2@llnl.gov>
Signed-off-by: Ameer Hamza <ahamza@ixsystems.com>
Closes#14181
As new compression algorithms are added to ZFS, it could be useful for
people to recompress data with new algorithms. There is currently no
mechanism to do this aside from copying the data manually into a new
filesystem with the new algorithm enabled. This tool allows the
transformation to happen through zfs send, allowing it to be done
efficiently to remote systems and in an incremental fashion.
A new zstream command is added that decompresses WRITE records and
then recompresses them with a provided algorithm, and then re-emits
the modified send stream. It may also be possible to re-compress
embedded block pointers, but that was not attempted for the initial
version.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Signed-off-by: Paul Dagnelie <pcd@delphix.com>
Closes#14106
This allows for printing a machine-readable, accurate to the second,
hold creation time in the form of a unix epoch timestamp.
Additionally, updates relevant documentation and man pages accordingly.
Reviewed-by: Allan Jude <allan@klarasystems.com>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Mohamed Tawfik <m_tawfik@aucegypt.edu>
Closes#13690Closes#14152
* The complaint in ztest_replay_write() is only possible if something
went horribly wrong. An assertion will silence this and if it goes
off, we will know that something is wrong.
* The complaint in spa_estimate_metaslabs_to_flush() is not impossible,
but seems very unlikely. We resolve this by passing the value from
the `MIN()` that does not go to infinity when the variable is zero.
There was a third report from Clang's scan-build, but that was a
definite false positive and disappeared when checked again through
Clang's static analyzer with Z3 refution via CodeChecker.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14124
Cast the integer type to (u)intptr_t before casting to "void *". In
CHERI C/C++ we warn on bare casts from integers to pointers to catch
attempts to create pointers our of thin air. We allow the warning to be
supressed with a suitable cast through (u)intptr_t.
Reviewed-by: Matthew Ahrens <mahrens@delphix.com>
Reviewed-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Signed-off-by: Brooks Davis <brooks.davis@sri.com>
Closes#14131
`snprintf()` is meant to protect against buffer overflows, but operating
on the buffer using its return value, possibly by calling it again, can
cause a buffer overflow, because it will return how many characters it
would have written if it had enough space even when it did not. In a
number of places, we repeatedly call snprintf() by successively
incrementing a buffer offset and decrementing a buffer length, by its
return value. This is a potentially unsafe usage of `snprintf()`
whenever the buffer length is reached. CodeQL complained about this.
To fix this, we introduce `kmem_scnprintf()`, which will return 0 when
the buffer is zero or the number of written characters, minus 1 to
exclude the NULL character, when the buffer was too small. In all other
cases, it behaves like snprintf(). The name is inspired by the Linux and
XNU kernels' `scnprintf()`. The implementation was written before I
thought to look at `scnprintf()` and had a good name for it, but it
turned out to have identical semantics to the Linux kernel version.
That lead to the name, `kmem_scnprintf()`.
CodeQL only catches this issue in loops, so repeated use of snprintf()
outside of a loop was not caught. As a result, a thorough audit of the
codebase was done to examine all instances of `snprintf()` usage for
potential problems and a few were caught. Fixes for them are included in
this patch.
Unfortunately, ZED is one of the places where `snprintf()` is
potentially used incorrectly. Since using `kmem_scnprintf()` in it would
require changing how it is linked, we modify its usage to make it safe,
no matter what buffer length is used. In addition, there was a bug in
the use of the return value where the NULL format character was not
being written by pwrite(). That has been fixed.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14098
Assertions are meant to check assumptions, but the way that this
assertion is written does not check an assumption, since it is provably
always true. Removing the assertion will cause a compiler warning (made
into an error by -Werror) about printing up to 512 bytes to a 256-byte
buffer, so instead, we change the assertion to verify the assumption
that we never do a snprintf() that is truncated to avoid overrunning the
256-byte buffer.
This was caught by an audit of the codebase to look for misuse of
`snprintf()` after CodeQL reported that we had misused `snprintf()`. An
explanation of how snprintf() can be misused is here:
https://www.redhat.com/en/blog/trouble-snprintf
This particular instance did not misuse `snprintf()`, but it was caught
by the audit anyway.
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <richard.yao@alumni.stonybrook.edu>
Closes#14098
Implement support for Linux's RENAME_* flags (for renameat2). Aside from
being quite useful for userspace (providing race-free ways to exchange
paths and implement mv --no-clobber), they are used by overlayfs and are
thus required in order to use overlayfs-on-ZFS.
In order for us to represent the new renameat2(2) flags in the ZIL, we
create two new transaction types for the two flags which need
transactional-level support (RENAME_EXCHANGE and RENAME_WHITEOUT).
RENAME_NOREPLACE does not need any ZIL support because we know that if
the operation succeeded before creating the ZIL entry, there was no file
to be clobbered and thus it can be treated as a regular TX_RENAME.
Reviewed-by: Ryan Moeller <ryan@iXsystems.com>
Reviewed-by: Alexander Motin <mav@FreeBSD.org>
Reviewed-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed-by: Pavel Snajdr <snajpa@snajpa.net>
Signed-off-by: Aleksa Sarai <cyphar@cyphar.com>
Closes#12209Closes#14070