illumos/illumos-gate@b11fe8c014b11fe8c014https://www.illumos.org/issues/8508
Mounting a zpool on a 32-bit system triggers a panic in spa_history_log_version
() due to a type format mismatch for ZPL_VERSION. ZPL_VERSION is a unsigned
long long, but the format expects an integer. On 64-bit platforms this may not
be an issue due to word size and alignment. On 32-bit platforms a word size is
half that of a long long, causing the second word of the long long to be seen
as the string pointer for utsname.nodename.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Justin Hibbits <chmeeedalf@gmail.com>
illumos/illumos-gate@d28671a3b0d28671a3b0https://www.illumos.org/issues/8373
The code that writes ZIL blocks uses dmu_tx_assign(TXG_WAIT) to assign a
transaction to a transaction group.
That seems to be logically incorrect as writing of the ZIL block does not
introduce any new dirty data.
Also, when there is a lot of dirty data, the call can introduce significant
delays into the ZIL commit path,
thus affecting all synchronous writes. Additionally, ARC throttling may affect
the ZIL writing.
We probably need a new mechanism similar to dmu_tx_create_assigned to assign
ZIL transactions.
(Ab)using TXG_WAITED does not seem to be sufficient.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@79c2b812ee79c2b812eehttps://www.illumos.org/issues/8491
The zpool checkpoint feature in DxOS added a new field in the uberblock.
The Multi-Modifier Protection Pull Request from ZoL adds two new fields in the
uberblock (Reference: https://github.com/zfsonlinux/zfs/pull/6279).
As these two changes come from two different sources and once upstreamed and
deployed will introduce an incompatibility with each other we want
to upstream a change that will reserve the padding for both of them so
integration goes smoothly and everyone gets both features.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Olaf Faaland <faaland1@llnl.gov>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
illumos/illumos-gate@267ae6c3a8267ae6c3a8https://www.illumos.org/issues/7915
l2arc_evict() is strictly serialized with respect to l2arc_write_buffers() and
l2arc_write_done().
Normally, l2arc_evict() and l2arc_write_buffers() are called from the same
thread, so they can not be concurrent.
Also, l2arc_write_buffers() uses zio_wait() on the parent zio of all cache zio-
s.
That ensures that l2arc_write_done() is completed before l2arc_write_buffers()
returns.
Finally, if a cache device is removed, then l2arc_evict() is called under
SCL_ALL in the exclusive mode.
That ensures that it can not be concurrent with the normal L2ARC accesses to
the device (including writing and evicting buffers).
Given the above, some checks and actions in l2arc_evict() do not make sense.
For instance, it must never encounter the write head header let alone remove it
from the buffer list.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Approved by: Matthew Ahrens <mahrens@delphix.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@dcb6872c56dcb6872c56https://www.illumos.org/issues/8126
The sync thread is concurrently modifying dn_phys->dn_nlevels
while dbuf_dirty() is trying to assert something about it, without
holding the necessary lock. We need to move this assertion further down
in the function, after we have acquired the dn_struct_rwlock.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@4923c69fdd4923c69fddhttps://www.illumos.org/issues/8067
Add an option to zdb to print a literal embedded block pointer supplied on the
command line:
zdb -E [-A] word0:word1:...:word15
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@9b195260e29b195260e2https://www.illumos.org/issues/8426
abd_copy_from_buf and abd_cmp_buf do not modify their void *buf arguments, so
qualify them with const.
abd_copy_from_buf_off and abd_cmp_buf_off already had that type for the
corresponding arguments.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@77b171372e77b171372ehttps://www.illumos.org/issues/7600
At present, the kernel side code seems to blindly rollback to whatever happens
to be the latest snapshot at the time when the rollback task is processed.
The expected target's name should be passed to the kernel driver and the sync
task should validate that the target exists and that it is the latest snapshot
indeed.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@42418f9e7342418f9e73https://www.illumos.org/issues/8377
The problem is that when dsl_bookmark_destroy_check() is executed from open
context (the pre-check), it fills in dbda_success based on the existence of the
bookmark.
But the bookmark (or containing filesystem as in this case) can be destroyed
before we get to syncing context. When we re-run dsl_bookmark_destroy_check()
in syncing
context, it will not add the deleted bookmark to dbda_success, intending for
dsl_bookmark_destroy_sync() to not process it. But because the bookmark is
still in dbda_success
from the open-context call, we do try to destroy it.
The fix is that dsl_bookmark_destroy_check() should not modify dbda_success
when called from open context.
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@b7edcb9408b7edcb9408https://www.illumos.org/issues/8378
The problem is that zfs_get_data() supplies a stale zgd_bp to dmu_sync(), which
we then nopwrite against.
zfs_get_data() doesn't hold any DMU-related locks, so after it copies db_blkptr
to zgd_bp, dbuf_write_ready()
could change db_blkptr, and dbuf_write_done() could remove the dirty record.
dmu_sync() then sees the stale
BP and that the dbuf it not dirty, so it is eligible for nop-writing.
The fix is for dmu_sync() to copy db_blkptr to zgd_bp after acquiring the
db_mtx. We could still see a stale
db_blkptr, but if it is stale then the dirty record will still exist and thus
we won't attempt to nopwrite.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@16a7e5ac1116a7e5ac11https://www.illumos.org/issues/7910
It seems that the change in issue #6950 resurrected the problem that was
earlier fixed by the change in issue #5219.
Please also see the following FreeBSD bug report: https://bugs.freebsd.org/
bugzilla/show_bug.cgi?id=216178
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@5e2a0747255e2a074725https://www.illumos.org/issues/8416
A C++ compiler fails to compile abd_is_linear(), which is an inline function
defined in abd.h, with the following error:
error: cannot initialize return object of type 'boolean_t' with an
rvalue of type 'bool'
That happens because a bool can not be converted to an enum in C++.
That's a problem because abd.h can be visible through other header files that a
C++ program that works with ZFS can include.
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@2889ec41c02889ec41c0https://www.illumos.org/issues/8311
Description:
There was a misunderstanding about the enforcement details of the "Read-only"
flag introduced for SMB/CIFS compatibility, way back in 2007 in the Sun PSARC
2007/315 case.
The original authors thought enforcement of the READONLY flag should work
similarly as the IMMUTABLE flag. Unfortunately, that enforcement is
incompatible with the expectations of Windows applications using this feature
through the SMB service. Applications assume (and the MS File System Algorithms
MS-FSA confirms they should) that an SMB client can:
(a) Open an SMB handle on a file with read/write access,
(b) Set the DOS attributes to include the READONLY flag,
(c) continue to have write access via that handle.
This access model is essentially the same as a Unix/POSIX application that
creates a file (with read/write access), uses fchmod() to change the file mode
to something not granting write access (i.e. 0444), and then continues to write
that file using the open handle it got before the mode change.
Currently, the SMB server works-around this problem in a way that will become
difficult to maintain as we implement support for SMB3 persistent handles, so
SMB depends on this fix.
I've written a test program that can be used to demonstrate this problem, and
added it to zfs-tests (tests/functional/acl/cifs/cifs_attr_004_pos).
It currently fails, but will pass when this problem fixed.
Steps to Reproduce:
Run the test program on a ZFS file system.
Expected Results:
Pass
Actual Results:
Fail.
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Prakash Surya <prakash.surya@delphix.com>
Author: Gordon Ross <gwr@nexenta.com>
illumos/illumos-gate@403a8da73c403a8da73chttps://www.illumos.org/issues/5220
There are disk devices that have logical sector size larger than 512B, for
example 4KB. That is, their physical sector size is larger than 512B and they
do not provide emulation for 512B sector sizes. For such devices both a data
offset and a data size must be properly aligned. L2ARC should arrange that
because it uses physical I/O.
zio_vdev_io_start() performs a necessary transformation if io_size is not
aligned to vdev_ashift, but that is done only for logical I/O. Something
similar should be done in L2ARC code.
* a temporary write buffer should be allocated if the original buffer is
not going to be compressed and its size is not aligned
* size of a temporary compression buffer should be ashift aligned
* for the reads, if a size of a target buffer is not sufficiently large and
it is not aligned then a temporary read buffer should be allocated
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@a4b8c9aa65a4b8c9aa65https://www.illumos.org/issues/8264
Oddly there is a lzc_clone function, but no lzc_promote function.
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@kebe.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Andrew Stormont <astormont@racktopsystems.com>
illumos/illumos-gate@0255edcc850255edcc85https://www.illumos.org/issues/8056
The send size estimate for a zvol can be too low, if the size of the record
headers (dmu_replay_record_t's) is a significant portion of the size.
This is typically the case when the data is highly compressible, especially
with embedded blocks.
The problem is that dmu_adjust_send_estimate_for_indirects() assumes that
blocks are the size of the "recordsize" property (128KB).
However, for zvols, the blocks are the size of the "volblocksize" property
(8KB). Therefore, we estimate that there will be 16x less record headers than
there really will be.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@dbfd9f9300dbfd9f9300https://www.illumos.org/issues/8156
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.
It isn't necessary to hold the lock, because if we make the wrong choice
occasionally, nothing bad will happen.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5b062782535b06278253https://www.illumos.org/issues/8005
RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.
To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks. The contents of these pad sectors are not used, so we do not
need to read or write these sectors. However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors. If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance. The gap-filling performance improvement was introduced in
July 2009.
Writing the optional zio is done by the io aggregation code in
vdev_queue.c. The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB. For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.
The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks). It
cannot occur with the default recordsize=128KB. If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.
The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@adaec86ad2adaec86ad2https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5f10ef697f5f10ef697fhttps://www.illumos.org/issues/6396
LVM = SVM = Solaris Volume Manager
dead code and not using with ZFS based platform.
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
illumos/illumos-gate@c5ee46810fc5ee46810fhttps://www.illumos.org/issues/7578
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alexander Motin <mav@FreeBSD.org>
8100 8021 seems to cause random BAD TRAP: type=d (#gp General protection)
illumos/illumos-gate@770499e185770499e185https://www.illumos.org/issues/8021
The ARC buf data project (known simply as "ABD" since its genesis in the ZoL
community) changes the way the ARC allocates `b_pdata` memory from using linear
`void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This
improves ZFS's performance by helping to defragment the address space occupied
by the ARC, in particular for cases where compressed ARC is enabled. It could
also ease future work to allocate pages directly from `segkpm` for minimal-
overhead memory allocations, bypassing the `kmem` subsystem.
This is essentially the same change as the one which recently landed in ZFS on
Linux, although they made some platform-specific changes while adapting this
work to their codebase:
1. Implemented the equivalent of the `segkpm` suggestion for future work
mentioned above to bypass issues that they've had with the Linux kernel memory
allocator.
2. Changed the internal representation of the ABD's scatter/gather list so it
could be used to pass I/O directly into Linux block device drivers. (This
feature is not available in the illumos block device interface yet.)
https://www.illumos.org/issues/8100
My supermicro system is getting random BAD TRAP: type=d (#gp General
protection) at about the stage where ZFS filesystems are mounted - usually
console login prompt is already present but the services are still starting.
After backing out 8021, the boot is completed and no panics do occur.
Machine does dump, however savecore fails:
savecore: bad magic number baddcafe
I can get more data out with boot -k, if needed.
# psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (0-7)
The core has 2 virtual processors (0 4)
The core has 2 virtual processors (1 5)
The core has 2 virtual processors (2 6)
The core has 2 virtual processors (3 7)
x86 (GenuineIntel 306C3 family 6 model 60 step 3 clock 3500 MHz)
Intel(r) Xeon(r) CPU E3-1246 v3 @ 3.50GHz
# prtconf -m
32657
$ zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Reviewed by: John Kennedy john.kennedy@delphix.com
Reviewed by: Prakash Surya prakash.surya@delphix.com
Reviewed by: Prashanth Sreenivasa pks@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Chris Williamson chris.williamson@delphix.com
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Dan Kimmel <dan.kimmel@delphix.com>
illumos/illumos-gate@bc83969fdbbc83969fdbhttps://www.illumos.org/issues/8265
Reserve bit 23 in the zfs send stream flags for the large
dnode feature which has been implemented for Linux.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brian Behlendorf <behlendorf1@llnl.gov>
illumos/illumos-gate@2d2f193a212d2f193a21https://www.illumos.org/issues/8166
If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged. When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started. The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.
The fix is to never clear the DTL of offline devices. Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.
The problem can be worked around by running "zpool scrub" after
"zpool online".
See also https://github.com/zfsonlinux/zfs/issues/5806
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@7855d95b307855d95b30https://www.illumos.org/issues/7446
Since we support whole-disk configuration for boot pool, we also will need
whole disk support with UEFI boot and for this, zpool create should create efi-
system partition.
I have borrowed the idea from oracle solaris, and introducing zpool create -
B switch to provide an way to specify that boot partition should be created.
However, there is still an question, how big should the system partition be.
For time being, I have set default size 256MB (thats minimum size for FAT32
with 4k blocks). To support custom size, the set on creation "bootsize"
property is created and so the custom size can be set as: zpool create B -
o bootsize=34MB rpool c0t0d0
After pool is created, the "bootsize" property is read only. When -B switch is
not used, the bootsize defaults to 0 and is shown in zpool get output with
value ''. Older zfs/zpool implementations are ignoring this property.
https://www.illumos.org/rb/r/219/
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>
illumos/illumos-gate@40713f2b2440713f2b24https://www.illumos.org/issues/8070
Add some ZFS comments left by various developers at different times
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@ade42b557aade42b557ahttps://www.illumos.org/issues/8064
It's currently nearly impossible to trace what process places a hold on
a vnode, as the only ways holds are place is via the `VN_HOLD()` and
`VN_HOLD_CALLER()` macros, which inline the bumping of `v_count`. Adding
static DTrace probes to these macros would enable tracing of where
specific vnode references come from.
For completeness and symmetry, a similar static probe should be added to
`vn_rele()` and `vn_rele_dnlc()`.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Sebastien Roy <seb@delphix.com>
illumos/illumos-gate@b7b2590dd9b7b2590dd9https://www.illumos.org/issues/8063
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5f368aef865f368aef86https://www.illumos.org/issues/7786
Currently, vdev_online() will only post sysevent if previous state was
"offline". It should also post the event when the state changes from "removed"
or "faulted" to "healthy" or "degraded".
This will fix the following scenario:
- pull disk from slot A
- check that hotspare has taken its place (if available)
- insert disk into slot B
- check that hotspare moved back to "avail" state (if spare was used)
The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail
to update the vdev FRU.
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Approved by: Albert Lee <trisk@forkgnu.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
illumos/illumos-gate@def4fac588def4fac588https://www.illumos.org/issues/8025
dbuf_read() creates a zio_root() to track and wait for all the zio's
that may happen as part of this call. However, if the blkptr_t for
this buffer is NULL or a hole, we will not create any more zio's, so
this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@b67dde11a7b67dde11a7https://www.illumos.org/issues/5814
Lets pull in this patch from freebsd:
http://svnweb.freebsd.org/base?view=revision&revision=271781
bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
If bpobj_space() returned non-zero here, the sublist would have been
left open, along with the bonus buffer hold it requires. This call
does not invoke any calls to bpobj_close() itself.
This bug doesn't have any known vector, but was found on inspection.
MFC after: 1 week
Sponsored by: Spectra Logic
Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc)
MFSpectraBSD: r1050998 on 2014/03/26
Fix bpobj_iterate_impl() to properly call bpobj_close() if bpobj_space()
returns an error.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Will Andrews <will@freebsd.org>
illumos/illumos-gate@af868f46a5af868f46a5https://www.illumos.org/issues/6914
This change allows the kernel to use more virtual address space. This will
allow us to devote 1.5x physmem for the zio arena, and an additional 1.5x
physmem for the kernel heap.
We saw a hang when unable to find any 128K contiguous memory segments. Looking
at the core file we see many threads in stacks similar to this:
> ffffff68c9c87c00::findstack -v
stack pointer for thread ffffff68c9c87c00: ffffff02cd63d8b0
[ ffffff02cd63d8b0 _resume_from_idle+0xf4() ]
ffffff02cd63d8e0 swtch+0x141()
ffffff02cd63d920 cv_wait+0x70(ffffff6009b1b01e, ffffff6009b1b020)
ffffff02cd63da50 vmem_xalloc+0x640(ffffff6009b1b000, 20000, 1000, 0, 0, 0, 0, ffffff0200000004)
ffffff02cd63dac0 vmem_alloc+0x135(ffffff6009b1b000, 20000, 4)
ffffff02cd63db60 segkmem_xalloc+0x171(ffffff6009b1b000, 0, 20000, 4, 0, fffffffffb885fe0, fffffffffbcefa10)
ffffff02cd63dbc0 segkmem_alloc_vn+0x4a(ffffff6009b1b000, 20000, 4, fffffffffbcefa10)
ffffff02cd63dbf0 segkmem_zio_alloc+0x20(ffffff6009b1b000, 20000, 4)
ffffff02cd63dd20 vmem_xalloc+0x5b1(ffffff6009b1c000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff02cd63dd90 vmem_alloc+0x135(ffffff6009b1c000, 20000, 4)
ffffff02cd63de20 kmem_slab_create+0x8d(ffffff605fd37008, 4)
ffffff02cd63de80 kmem_slab_alloc+0x11e(ffffff605fd37008, 4)
ffffff02cd63dee0 kmem_cache_alloc+0x233(ffffff605fd37008, 4)
ffffff02cd63df10 zio_data_buf_alloc+0x5b(20000)
ffffff02cd63df70 arc_get_data_buf+0x92(ffffff6265a70588, 20000, ffffff901fd796f8)
ffffff02cd63dfb0 arc_buf_alloc_impl+0x9c(ffffff6265a70588, ffffff6d233ab0b8)
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@0c94e1af670c94e1af67https://www.illumos.org/issues/7256
error = dmu_sync(zio, lr->lr_common.lrc_txg,
zfs_get_done, zgd);
ASSERT(error || lr->lr_length <= zp->z_blksz);
It's possible, although extremely rare, that the zfs_get_done() callback is
executed before dmu_sync() returns.
In that case the znode's range lock is dropped and the znode is unreferenced.
Thus, the assertion can access some invalid or wrong data via the zp pointer.
size variable caches the correct value of z_blksz and can be safely used here.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
illumos/illumos-gate@80e10fd0d280e10fd0d2https://www.illumos.org/issues/5379
The following is based on a review of the illumos code and on a similar problem
reported for FreeBSD where the relevant code is different.
Looking at this block of code http://src.illumos.org/source/xref/illumos-gate/
usr/src/uts/common/fs/zfs/zfs_vnops.c#4187 I see code to set up an
sa_bulk_attr_t object, I see code to set up mtime and ctime values, but I do
not see code to actually apply the attributes...
I would expect there to be a call to sa_bulk_update(), there is such a call in
zfs_write() for instance.
mmap_write.c [Magnifier] - demo (1.42 KB) Andriy Gapon, 2015-11-11 01:53 PM
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
illumos/illumos-gate@b127fe3c05b127fe3c05https://www.illumos.org/issues/6101
lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to
create a filesystem as a child of a volume, instead it proceeds to a crash.
A crash stack obtained on FreeBSD:
page fault while in kernel mode
zap_leaf_lookup()
fzap_lookup()
zap_lookup_norm()
zap_lookup()
zfs_get_zplprop()
zfs_fill_zplprops_impl()
zfs_ioc_create()
zfsdev_ioctl()
devfs_ioctl_f()
kern_ioctl()
sys_ioctl()
This crash happened with a kernel without debugging assertions.
The immediate cause of crash appears to an attempt to interpret a zvol object
as a zap object.
For filesystems:
#define MASTER_NODE_OBJ 1
For zvols:
#define ZVOL_OBJ 1ULL
#define ZVOL_ZAP_OBJ 2ULL
So, I see two problems here:
1. an attempt to create a filesystem under a zvol should be rejected as
early as possible, maybe in zfs_fill_zplprops()
2. maybe zap_lookup / zap_lockdir should reject objects that are not of one
of the zap object types
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@7f0bdb42577f0bdb4257https://www.illumos.org/issues/8061
sa_find_idx_tab() is declared as taking and returning "void *" parameters.
These can be declared to be the specific types.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@6b036259816b03625981https://www.illumos.org/issues/8026
zfs_throttle_delay and zfs_throttle_resolution became disused since the new
write throttling mechanism was introduced.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@313ae1e182313ae1e182https://www.illumos.org/issues/8027
dsl_pool_dirty_delta() should not wake up waiters when dp->dp_dirty_total ==
zfs_dirty_data_max, because they wait for dp_dirty_total to fall strictly below
the threshold.
It's probably very rare for that condition to occur, but it's better to have
more accurate code.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@c040c10cddc040c10cddhttps://www.illumos.org/issues/7885
When a member of a RAIDZ has been replaced with a device smaller than the
original, then the top level vdev can report its expand size as 16.0E.
The reduced child asize causes the RAIDZ to have a vdev_asize lower than its
vdev_max_asize which then results in an underflow during the calculation of the
parents expand size.
Also for RAIDZ vdevs the sum of their child vdev_min_asize could be smaller
than the parents vdev_min_size.
Fixed by: https://github.com/openzfs/openzfs/pull/296
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Steven Hartland <steven.hartland@multiplay.co.uk>
illumos/illumos-gate@94c2d0eb2294c2d0eb22https://www.illumos.org/issues/7968
spa_sync() iterates over all the dirty dnodes and processes each of them by
calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
or removed a lot of files), the single thread of spa_sync() calling dnode_sync
() can become a bottleneck. Additionally, if many dnodes are dirtied
concurrently in open context (e.g. due to concurrent file creation), the
os_lock will experience lock contention via dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
use separate threads to process each of the sublists in the multilist.
On the concurrent file creation microbenchmark, the performance improvement
from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
reward, once the other bottlenecks are addressed, fixing this bug will provide
a medium-large performance gain and require a medium amount of effort to
implement.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@10fbdecb0510fbdecb05https://www.illumos.org/issues/7970
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@411be58a6e411be58a6ehttps://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: https://github.com/zfsonlinux/zfs/commit/
0eef1bde31d67091d3deed23fe2394f5a8bf2276
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@b0c42cd470b0c42cd470https://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: https://github.com/zfsonlinux/zfs/commit/
0eef1bde31d67091d3deed23fe2394f5a8bf2276
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>
illumos/illumos-gate@a3905a4592a3905a4592https://www.illumos.org/issues/7869
The issue fixed by this patch is a race condition in the deadlist code.
A thread executing an administrative command that uses
`dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to
protect the access of all its entries that the deadlist contains in an
avl tree.
Sync threads trying to insert a new entry in the deadlist
(through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the
deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our
sentinel value), we close and reopen it. Between these two operations,
it is possible for the `dsl_deadlist_space_range()` thread to dereference
that bpobj which is `NULL` during that window.
Threads should hold the a deadlist's `dl_lock` when they manipulate its
internal data so scenarios like the one above are avoided. In addition,
threads should also hold the bpobj lock whenever they are allocating the
subobj list of a bpobj, and not just when they actually insert the subobj
to the list. This way we can avoid potential memory leaks.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
illumos/illumos-gate@61e255ce7261e255ce72https://www.illumos.org/issues/7793
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@48bbca816848bbca8168https://www.illumos.org/issues/7812
This change removes all gendered language that did not refer specifically
to an individual person or pet. The convention taken was to use
variations on "they" when referring to users and/or human beings, while
using "it" when referring to code, functions, and/or libraries.
Additionally, we took the liberty to fix up any whitespace issues that
were found in any files that were already being modified.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>