illumos/illumos-gate@5e2a0747255e2a074725https://www.illumos.org/issues/8416
A C++ compiler fails to compile abd_is_linear(), which is an inline function
defined in abd.h, with the following error:
error: cannot initialize return object of type 'boolean_t' with an
rvalue of type 'bool'
That happens because a bool can not be converted to an enum in C++.
That's a problem because abd.h can be visible through other header files that a
C++ program that works with ZFS can include.
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@2889ec41c02889ec41c0https://www.illumos.org/issues/8311
Description:
There was a misunderstanding about the enforcement details of the "Read-only"
flag introduced for SMB/CIFS compatibility, way back in 2007 in the Sun PSARC
2007/315 case.
The original authors thought enforcement of the READONLY flag should work
similarly as the IMMUTABLE flag. Unfortunately, that enforcement is
incompatible with the expectations of Windows applications using this feature
through the SMB service. Applications assume (and the MS File System Algorithms
MS-FSA confirms they should) that an SMB client can:
(a) Open an SMB handle on a file with read/write access,
(b) Set the DOS attributes to include the READONLY flag,
(c) continue to have write access via that handle.
This access model is essentially the same as a Unix/POSIX application that
creates a file (with read/write access), uses fchmod() to change the file mode
to something not granting write access (i.e. 0444), and then continues to write
that file using the open handle it got before the mode change.
Currently, the SMB server works-around this problem in a way that will become
difficult to maintain as we implement support for SMB3 persistent handles, so
SMB depends on this fix.
I've written a test program that can be used to demonstrate this problem, and
added it to zfs-tests (tests/functional/acl/cifs/cifs_attr_004_pos).
It currently fails, but will pass when this problem fixed.
Steps to Reproduce:
Run the test program on a ZFS file system.
Expected Results:
Pass
Actual Results:
Fail.
Reviewed by: Sanjay Nadkarni <sanjay.nadkarni@nexenta.com>
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Approved by: Prakash Surya <prakash.surya@delphix.com>
Author: Gordon Ross <gwr@nexenta.com>
illumos/illumos-gate@403a8da73c403a8da73chttps://www.illumos.org/issues/5220
There are disk devices that have logical sector size larger than 512B, for
example 4KB. That is, their physical sector size is larger than 512B and they
do not provide emulation for 512B sector sizes. For such devices both a data
offset and a data size must be properly aligned. L2ARC should arrange that
because it uses physical I/O.
zio_vdev_io_start() performs a necessary transformation if io_size is not
aligned to vdev_ashift, but that is done only for logical I/O. Something
similar should be done in L2ARC code.
* a temporary write buffer should be allocated if the original buffer is
not going to be compressed and its size is not aligned
* size of a temporary compression buffer should be ashift aligned
* for the reads, if a size of a target buffer is not sufficiently large and
it is not aligned then a temporary read buffer should be allocated
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Approved by: Dan McDonald <danmcd@joyent.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@a4b8c9aa65a4b8c9aa65https://www.illumos.org/issues/8264
Oddly there is a lzc_clone function, but no lzc_promote function.
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Dan McDonald <danmcd@kebe.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Andrew Stormont <astormont@racktopsystems.com>
illumos/illumos-gate@0255edcc850255edcc85https://www.illumos.org/issues/8056
The send size estimate for a zvol can be too low, if the size of the record
headers (dmu_replay_record_t's) is a significant portion of the size.
This is typically the case when the data is highly compressible, especially
with embedded blocks.
The problem is that dmu_adjust_send_estimate_for_indirects() assumes that
blocks are the size of the "recordsize" property (128KB).
However, for zvols, the blocks are the size of the "volblocksize" property
(8KB). Therefore, we estimate that there will be 16x less record headers than
there really will be.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@dbfd9f9300dbfd9f9300https://www.illumos.org/issues/8156
dbuf_evict_notify() holds the dbuf_evict_lock while checking if it should do
the eviction itself (because the evict thread is not able to keep up).
This can result in massive lock contention.
It isn't necessary to hold the lock, because if we make the wrong choice
occasionally, nothing bad will happen.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5b062782535b06278253https://www.illumos.org/issues/8005
RAID-Z requires that space be allocated in multiples of P+1 sectors,
because this is the minimum size block that can have the required amount
of parity. Thus blocks on RAIDZ1 must be allocated in a multiple of 2
sectors; on RAIDZ2 multiple of 3; and on RAIDZ3 multiple of 4. A sector
is a unit of 2^ashift bytes, typically 512B or 4KB.
To satisfy this constraint, the allocation size is rounded up to the
proper multiple, resulting in up to 3 "pad sectors" at the end of some
blocks. The contents of these pad sectors are not used, so we do not
need to read or write these sectors. However, some storage hardware
performs much worse (around 1/2 as fast) on mostly-contiguous writes
when there are small gaps of non-overwritten data between the writes.
Therefore, ZFS creates "optional" zio's when writing RAID-Z blocks that
include pad sectors. If writing a pad sector will fill the gap between
two (required) writes, we will issue the optional zio, thus doubling
performance. The gap-filling performance improvement was introduced in
July 2009.
Writing the optional zio is done by the io aggregation code in
vdev_queue.c. The problem is that it is also subject to the limit on
the size of aggregate writes, zfs_vdev_aggregation_limit, which is by
default 128KB. For a given block, if the amount of data plus padding
written to a leaf device exceeds zfs_vdev_aggregation_limit, the
optional zio will not be written, resulting in a ~2x performance
degradation.
The problem occurs only for certain values of ashift, compressed block
size, and RAID-Z configuration (number of parity and data disks). It
cannot occur with the default recordsize=128KB. If compression is
enabled, all configurations with recordsize=1MB or larger will be
impacted to some degree.
The problem notably occurs with recordsize=1MB, compression=off, with 10
disks in a RAIDZ2 or RAIDZ3 group (with 512B or 4KB sectors). Therefore
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@adaec86ad2adaec86ad2https://www.illumos.org/issues/8155
When writing pre-compressed buffers, arc_write() requires that the compression
algorithm used to compress the buffer matches the compression algorithm
requested by the zio_prop_t, which is set by dmu_write_policy().
This makes dmu_write_policy() and its callers a bit more complicated.
We can simplify this by making arc_write() trust the caller to supply the type
of pre-compressed buffer that it wants to write, and override the compression
setting in the zio_prop_t.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5f10ef697f5f10ef697fhttps://www.illumos.org/issues/6396
LVM = SVM = Solaris Volume Manager
dead code and not using with ZFS based platform.
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Toomas Soome <tsoome@me.com>
Approved by: Hans Rosenfeld <rosenfeld@grumpf.hope-2000.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
illumos/illumos-gate@c5ee46810fc5ee46810fhttps://www.illumos.org/issues/7578
After some ZIL changes 6 years ago zil_slog_limit got partially broken
due to zl_itx_list_sz not updated when async itx'es upgraded to sync.
Actually because of other changes about that time zl_itx_list_sz is not
really required to implement the functionality, so this patch removes
some unneeded broken code and variables.
Original idea of zil_slog_limit was to reduce chance of SLOG abuse by
single heavy logger, that increased latency for other (more latency critical)
loggers, by pushing heavy log out into the main pool instead of SLOG. Beside
huge latency increase for heavy writers, this implementation caused double
write of all data, since the log records were explicitly prepared for SLOG.
Since we now have I/O scheduler, I've found it can be much more efficient
to reduce priority of heavy logger SLOG writes from ZIO_PRIORITY_SYNC_WRITE
to ZIO_PRIORITY_ASYNC_WRITE, while still leave them on SLOG.
Existing ZIL implementation had problem with space efficiency when it
has to write large chunks of data into log blocks of limited size. In some
cases efficiency stopped to almost as low as 50%. In case of ZIL stored on
spinning rust, that also reduced log write speed in half, since head had to
uselessly fly over allocated but not written areas. This change improves
the situation by offloading problematic operations from z*_log_write() to
zil_lwb_commit(), which knows real situation of log blocks allocation and
can split large requests into pieces much more efficiently. Also as side
effect it removes one of two data copy operations done by ZIL code WR_COPIED
case.
While there, untangle and unify code of z*_log_write() functions.
Also zfs_log_write() alike to zvol_log_write() can now handle writes crossing
block boundary, that may also improve efficiency if ZPL is made to do that.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Andriy Gapon <avg@FreeBSD.org>
Reviewed by: Steven Hartland <steven.hartland@multiplay.co.uk>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Richard Elling <Richard.Elling@RichardElling.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alexander Motin <mav@FreeBSD.org>
8100 8021 seems to cause random BAD TRAP: type=d (#gp General protection)
illumos/illumos-gate@770499e185770499e185https://www.illumos.org/issues/8021
The ARC buf data project (known simply as "ABD" since its genesis in the ZoL
community) changes the way the ARC allocates `b_pdata` memory from using linear
`void *` buffers to using scatter/gather lists of fixed-size 1KB chunks. This
improves ZFS's performance by helping to defragment the address space occupied
by the ARC, in particular for cases where compressed ARC is enabled. It could
also ease future work to allocate pages directly from `segkpm` for minimal-
overhead memory allocations, bypassing the `kmem` subsystem.
This is essentially the same change as the one which recently landed in ZFS on
Linux, although they made some platform-specific changes while adapting this
work to their codebase:
1. Implemented the equivalent of the `segkpm` suggestion for future work
mentioned above to bypass issues that they've had with the Linux kernel memory
allocator.
2. Changed the internal representation of the ABD's scatter/gather list so it
could be used to pass I/O directly into Linux block device drivers. (This
feature is not available in the illumos block device interface yet.)
https://www.illumos.org/issues/8100
My supermicro system is getting random BAD TRAP: type=d (#gp General
protection) at about the stage where ZFS filesystems are mounted - usually
console login prompt is already present but the services are still starting.
After backing out 8021, the boot is completed and no panics do occur.
Machine does dump, however savecore fails:
savecore: bad magic number baddcafe
I can get more data out with boot -k, if needed.
# psrinfo -vp
The physical processor has 4 cores and 8 virtual processors (0-7)
The core has 2 virtual processors (0 4)
The core has 2 virtual processors (1 5)
The core has 2 virtual processors (2 6)
The core has 2 virtual processors (3 7)
x86 (GenuineIntel 306C3 family 6 model 60 step 3 clock 3500 MHz)
Intel(r) Xeon(r) CPU E3-1246 v3 @ 3.50GHz
# prtconf -m
32657
$ zpool status
pool: rpool
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
rpool ONLINE 0 0 0
raidz1-0 ONLINE 0 0 0
c3t0d0 ONLINE 0 0 0
c3t1d0 ONLINE 0 0 0
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Paul Dagnelie pcd@delphix.com
Reviewed by: John Kennedy john.kennedy@delphix.com
Reviewed by: Prakash Surya prakash.surya@delphix.com
Reviewed by: Prashanth Sreenivasa pks@delphix.com
Reviewed by: Pavel Zakharov pavel.zakharov@delphix.com
Reviewed by: Chris Williamson chris.williamson@delphix.com
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Dan Kimmel <dan.kimmel@delphix.com>
illumos/illumos-gate@bc83969fdbbc83969fdbhttps://www.illumos.org/issues/8265
Reserve bit 23 in the zfs send stream flags for the large
dnode feature which has been implemented for Linux.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Brian Behlendorf <behlendorf1@llnl.gov>
illumos/illumos-gate@2d2f193a212d2f193a21https://www.illumos.org/issues/8166
If we do a scrub while a leaf device is offline (via "zpool offline"),
we will inadvertently clear the DTL (dirty time log) of the offline
device, even though it is still damaged. When the device comes back
online, we will incompletely resilver it, thinking that the scrub
repaired blocks written before the scrub was started. The incomplete
resilver can lead to data loss if there is a subsequent failure of a
different leaf device.
The fix is to never clear the DTL of offline devices. Note that if a
device is onlined while a scrub is in progress, the scrub will be
restarted.
The problem can be worked around by running "zpool scrub" after
"zpool online".
See also https://github.com/zfsonlinux/zfs/issues/5806
Reviewed by: George Wilson george.wilson@delphix.com
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@7855d95b307855d95b30https://www.illumos.org/issues/7446
Since we support whole-disk configuration for boot pool, we also will need
whole disk support with UEFI boot and for this, zpool create should create efi-
system partition.
I have borrowed the idea from oracle solaris, and introducing zpool create -
B switch to provide an way to specify that boot partition should be created.
However, there is still an question, how big should the system partition be.
For time being, I have set default size 256MB (thats minimum size for FAT32
with 4k blocks). To support custom size, the set on creation "bootsize"
property is created and so the custom size can be set as: zpool create B -
o bootsize=34MB rpool c0t0d0
After pool is created, the "bootsize" property is read only. When -B switch is
not used, the bootsize defaults to 0 and is shown in zpool get output with
value ''. Older zfs/zpool implementations are ignoring this property.
https://www.illumos.org/rb/r/219/
Reviewed by: Andrew Stormont <andyjstormont@gmail.com>
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Approved by: Dan McDonald <danmcd@kebe.com>
Author: Toomas Soome <tsoome@me.com>
illumos/illumos-gate@40713f2b2440713f2b24https://www.illumos.org/issues/8070
Add some ZFS comments left by various developers at different times
Reviewed by: Yuri Pankov <yuri.pankov@gmail.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Alan Somers <asomers@gmail.com>
illumos/illumos-gate@ade42b557aade42b557ahttps://www.illumos.org/issues/8064
It's currently nearly impossible to trace what process places a hold on
a vnode, as the only ways holds are place is via the `VN_HOLD()` and
`VN_HOLD_CALLER()` macros, which inline the bumping of `v_count`. Adding
static DTrace probes to these macros would enable tracing of where
specific vnode references come from.
For completeness and symmetry, a similar static probe should be added to
`vn_rele()` and `vn_rele_dnlc()`.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Sebastien Roy <seb@delphix.com>
illumos/illumos-gate@b7b2590dd9b7b2590dd9https://www.illumos.org/issues/8063
A standard practice in ZFS is to keep track of "per-txg" state. Any of
the 3 active TXG's (open, quiescing, syncing) can have different values
for this state. We should assert that we do not attempt to modify other
(inactive) TXG's.
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5f368aef865f368aef86https://www.illumos.org/issues/7786
Currently, vdev_online() will only post sysevent if previous state was
"offline". It should also post the event when the state changes from "removed"
or "faulted" to "healthy" or "degraded".
This will fix the following scenario:
- pull disk from slot A
- check that hotspare has taken its place (if available)
- insert disk into slot B
- check that hotspare moved back to "avail" state (if spare was used)
The problem here is that we don't get any ESC_ZFS_VDEV_* notification and fail
to update the vdev FRU.
Reviewed by: Matthew Ahrens mahrens@delphix.com
Reviewed by: George Wilson george.wilson@delphix.com
Approved by: Albert Lee <trisk@forkgnu.org>
Author: Yuri Pankov <yuri.pankov@nexenta.com>
illumos/illumos-gate@def4fac588def4fac588https://www.illumos.org/issues/8025
dbuf_read() creates a zio_root() to track and wait for all the zio's
that may happen as part of this call. However, if the blkptr_t for
this buffer is NULL or a hole, we will not create any more zio's, so
this zio_root() is unnecessary. This is always the case when calling
dbuf_read() on a bonus buffer, because it has no blkptr (it's part of
the containing dnode). For workloads that read a lot of bonus buffers
(e.g. file creation and removal), creating and destroying these
unnecessary zio's can decrease performance by around 3%.
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@b67dde11a7b67dde11a7https://www.illumos.org/issues/5814
Lets pull in this patch from freebsd:
http://svnweb.freebsd.org/base?view=revision&revision=271781
bpobj_iterate_impl(): Close a refcount leak iterating on a sublist.
If bpobj_space() returned non-zero here, the sublist would have been
left open, along with the bonus buffer hold it requires. This call
does not invoke any calls to bpobj_close() itself.
This bug doesn't have any known vector, but was found on inspection.
MFC after: 1 week
Sponsored by: Spectra Logic
Affects: All ZFS versions starting 21 May 2010 (illumos cde58dbc)
MFSpectraBSD: r1050998 on 2014/03/26
Fix bpobj_iterate_impl() to properly call bpobj_close() if bpobj_space()
returns an error.
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <paul.dagnelie@delphix.com>
Reviewed by: Simon Klinkert <simon.klinkert@gmail.com>
Approved by: Gordon Ross <gwr@nexenta.com>
Author: Will Andrews <will@freebsd.org>
illumos/illumos-gate@af868f46a5af868f46a5https://www.illumos.org/issues/6914
This change allows the kernel to use more virtual address space. This will
allow us to devote 1.5x physmem for the zio arena, and an additional 1.5x
physmem for the kernel heap.
We saw a hang when unable to find any 128K contiguous memory segments. Looking
at the core file we see many threads in stacks similar to this:
> ffffff68c9c87c00::findstack -v
stack pointer for thread ffffff68c9c87c00: ffffff02cd63d8b0
[ ffffff02cd63d8b0 _resume_from_idle+0xf4() ]
ffffff02cd63d8e0 swtch+0x141()
ffffff02cd63d920 cv_wait+0x70(ffffff6009b1b01e, ffffff6009b1b020)
ffffff02cd63da50 vmem_xalloc+0x640(ffffff6009b1b000, 20000, 1000, 0, 0, 0, 0, ffffff0200000004)
ffffff02cd63dac0 vmem_alloc+0x135(ffffff6009b1b000, 20000, 4)
ffffff02cd63db60 segkmem_xalloc+0x171(ffffff6009b1b000, 0, 20000, 4, 0, fffffffffb885fe0, fffffffffbcefa10)
ffffff02cd63dbc0 segkmem_alloc_vn+0x4a(ffffff6009b1b000, 20000, 4, fffffffffbcefa10)
ffffff02cd63dbf0 segkmem_zio_alloc+0x20(ffffff6009b1b000, 20000, 4)
ffffff02cd63dd20 vmem_xalloc+0x5b1(ffffff6009b1c000, 20000, 1000, 0, 0, 0, 0, 4)
ffffff02cd63dd90 vmem_alloc+0x135(ffffff6009b1c000, 20000, 4)
ffffff02cd63de20 kmem_slab_create+0x8d(ffffff605fd37008, 4)
ffffff02cd63de80 kmem_slab_alloc+0x11e(ffffff605fd37008, 4)
ffffff02cd63dee0 kmem_cache_alloc+0x233(ffffff605fd37008, 4)
ffffff02cd63df10 zio_data_buf_alloc+0x5b(20000)
ffffff02cd63df70 arc_get_data_buf+0x92(ffffff6265a70588, 20000, ffffff901fd796f8)
ffffff02cd63dfb0 arc_buf_alloc_impl+0x9c(ffffff6265a70588, ffffff6d233ab0b8)
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Igor Kozhukhov <ikozhukhov@gmail.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Approved by: Garrett D'Amore <garrett@damore.org>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@0c94e1af670c94e1af67https://www.illumos.org/issues/7256
error = dmu_sync(zio, lr->lr_common.lrc_txg,
zfs_get_done, zgd);
ASSERT(error || lr->lr_length <= zp->z_blksz);
It's possible, although extremely rare, that the zfs_get_done() callback is
executed before dmu_sync() returns.
In that case the znode's range lock is dropped and the znode is unreferenced.
Thus, the assertion can access some invalid or wrong data via the zp pointer.
size variable caches the correct value of z_blksz and can be safely used here.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
illumos/illumos-gate@80e10fd0d280e10fd0d2https://www.illumos.org/issues/5379
The following is based on a review of the illumos code and on a similar problem
reported for FreeBSD where the relevant code is different.
Looking at this block of code http://src.illumos.org/source/xref/illumos-gate/
usr/src/uts/common/fs/zfs/zfs_vnops.c#4187 I see code to set up an
sa_bulk_attr_t object, I see code to set up mtime and ctime values, but I do
not see code to actually apply the attributes...
I would expect there to be a call to sa_bulk_update(), there is such a call in
zfs_write() for instance.
mmap_write.c [Magnifier] - demo (1.42 KB) Andriy Gapon, 2015-11-11 01:53 PM
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Prashanth Sreenivasa <pks@delphix.com>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
illumos/illumos-gate@b127fe3c05b127fe3c05https://www.illumos.org/issues/6101
lzc_create(), or more correctly, zfs_ioc_create() does not reject an attempt to
create a filesystem as a child of a volume, instead it proceeds to a crash.
A crash stack obtained on FreeBSD:
page fault while in kernel mode
zap_leaf_lookup()
fzap_lookup()
zap_lookup_norm()
zap_lookup()
zfs_get_zplprop()
zfs_fill_zplprops_impl()
zfs_ioc_create()
zfsdev_ioctl()
devfs_ioctl_f()
kern_ioctl()
sys_ioctl()
This crash happened with a kernel without debugging assertions.
The immediate cause of crash appears to an attempt to interpret a zvol object
as a zap object.
For filesystems:
#define MASTER_NODE_OBJ 1
For zvols:
#define ZVOL_OBJ 1ULL
#define ZVOL_ZAP_OBJ 2ULL
So, I see two problems here:
1. an attempt to create a filesystem under a zvol should be rejected as
early as possible, maybe in zfs_fill_zplprops()
2. maybe zap_lookup / zap_lockdir should reject objects that are not of one
of the zap object types
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@7f0bdb42577f0bdb4257https://www.illumos.org/issues/8061
sa_find_idx_tab() is declared as taking and returning "void *" parameters.
These can be declared to be the specific types.
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@6b036259816b03625981https://www.illumos.org/issues/8026
zfs_throttle_delay and zfs_throttle_resolution became disused since the new
write throttling mechanism was introduced.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@313ae1e182313ae1e182https://www.illumos.org/issues/8027
dsl_pool_dirty_delta() should not wake up waiters when dp->dp_dirty_total ==
zfs_dirty_data_max, because they wait for dp_dirty_total to fall strictly below
the threshold.
It's probably very rare for that condition to occur, but it's better to have
more accurate code.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <avg@FreeBSD.org>
illumos/illumos-gate@c040c10cddc040c10cddhttps://www.illumos.org/issues/7885
When a member of a RAIDZ has been replaced with a device smaller than the
original, then the top level vdev can report its expand size as 16.0E.
The reduced child asize causes the RAIDZ to have a vdev_asize lower than its
vdev_max_asize which then results in an underflow during the calculation of the
parents expand size.
Also for RAIDZ vdevs the sum of their child vdev_min_asize could be smaller
than the parents vdev_min_size.
Fixed by: https://github.com/openzfs/openzfs/pull/296
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Steven Hartland <steven.hartland@multiplay.co.uk>
illumos/illumos-gate@94c2d0eb2294c2d0eb22https://www.illumos.org/issues/7968
spa_sync() iterates over all the dirty dnodes and processes each of them by
calling dnode_sync(). If there are many dirty dnodes (e.g. because we created
or removed a lot of files), the single thread of spa_sync() calling dnode_sync
() can become a bottleneck. Additionally, if many dnodes are dirtied
concurrently in open context (e.g. due to concurrent file creation), the
os_lock will experience lock contention via dnode_setdirty().
The solution is to track dirty dnodes on a multilist_t, and for spa_sync() to
use separate threads to process each of the sublists in the multilist.
On the concurrent file creation microbenchmark, the performance improvement
from dnode_setdirty() is up to 7%. Additionally, the wall clock time spent in
spa_sync() is reduced to 15%-40% of the single-threaded case. In terms of cost/
reward, once the other bottlenecks are addressed, fixing this bug will provide
a medium-large performance gain and require a medium amount of effort to
implement.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@10fbdecb0510fbdecb05https://www.illumos.org/issues/7970
The global tunable zfs_arc_num_sublists_per_state is used by the ARC and
the dbuf cache, and other users are planned. We should change this
tunable to be common to all multilists.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brad Lewis <brad.lewis@delphix.com>
Reviewed by: Saso Kiselkov <saso.kiselkov@nexenta.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@411be58a6e411be58a6ehttps://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: https://github.com/zfsonlinux/zfs/commit/
0eef1bde31d67091d3deed23fe2394f5a8bf2276
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@b0c42cd470b0c42cd470https://www.illumos.org/issues/7801
Add *_by_dnode() routines for accessing objects given their
dnode_t *, this is more efficient than accessing the object by
(objset_t *, uint64_t object). This change converts some but
not all of the existing consumers. As performance-sensitive
code paths are discovered they should be converted to use
these routines.
Ported from: https://github.com/zfsonlinux/zfs/commit/
0eef1bde31d67091d3deed23fe2394f5a8bf2276
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: bzzz77 <bzzz.tomas@gmail.com>
illumos/illumos-gate@a3905a4592a3905a4592https://www.illumos.org/issues/7869
The issue fixed by this patch is a race condition in the deadlist code.
A thread executing an administrative command that uses
`dsl_deadlist_space_range()` holds the lock of the whole `deadlist_t` to
protect the access of all its entries that the deadlist contains in an
avl tree.
Sync threads trying to insert a new entry in the deadlist
(through `dsl_deadlist_insert()` -> `dle_enqueue()`) do not hold the
deadlist lock at that moment. If the `dle_bpobj` is the empty bpobj (our
sentinel value), we close and reopen it. Between these two operations,
it is possible for the `dsl_deadlist_space_range()` thread to dereference
that bpobj which is `NULL` during that window.
Threads should hold the a deadlist's `dl_lock` when they manipulate its
internal data so scenarios like the one above are avoided. In addition,
threads should also hold the bpobj lock whenever they are allocating the
subobj list of a bpobj, and not just when they actually insert the subobj
to the list. This way we can avoid potential memory leaks.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: George Melikov <mail@gmelikov.ru>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Serapheim Dimitropoulos <serapheim@delphix.com>
illumos/illumos-gate@61e255ce7261e255ce72https://www.illumos.org/issues/7793
Background information: This assertion about tx_space_* verifies that we
are not dirtying more stuff than we thought we would. We “need” to know
how much we will dirty so that we can check if we should fail this
transaction with ENOSPC/EDQUOT, in dmu_tx_assign(). While the
transaction is open (i.e. between dmu_tx_assign() and dmu_tx_commit() —
typically less than a millisecond), we call dbuf_dirty() on the exact
blocks that will be modified. Once this happens, the temporary
accounting in tx_space_* is unnecessary, because we know exactly what
blocks are newly dirtied; we call dnode_willuse_space() to track this
more exact accounting.
The fundamental problem causing this bug is that dmu_tx_hold_*() relies
on the current state in the DMU (e.g. dn_nlevels) to predict how much
will be dirtied by this transaction, but this state can change before we
actually perform the transaction (i.e. call dbuf_dirty()).
This bug will be fixed by removing the assertion that the tx_space_*
accounting is perfectly accurate (i.e. we never dirty more than was
predicted by dmu_tx_hold_*()). By removing the requirement that this
accounting be perfectly accurate, we can also vastly simplify it, e.g.
removing most of the logic in dmu_tx_count_*().
The new tx space accounting will be very approximate, and may be more or
less than what is actually dirtied. It will still be used to determine
if this transaction will put us over quota. Transactions that are marked
by dmu_tx_mark_netfree() will be excepted from this check. We won’t make
an attempt to determine how much space will be freed by the transaction
— this was rarely accurate enough to determine if a transaction should
be permitted when we are over quota, which is why dmu_tx_mark_netfree()
was introduced in 2014.
We also won’t attempt to give “credit” when overwriting existing blocks,
if those blocks may be freed. This allows us to remove the
do_free_accounting logic in dbuf_dirty(), and associated routines. This
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@48bbca816848bbca8168https://www.illumos.org/issues/7812
This change removes all gendered language that did not refer specifically
to an individual person or pet. The convention taken was to use
variations on "they" when referring to users and/or human beings, while
using "it" when referring to code, functions, and/or libraries.
Additionally, we took the liberty to fix up any whitespace issues that
were found in any files that were already being modified.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Prakash Surya <prakash.surya@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Chris Williamson <chris.williamson@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Igor Kozhukhov <igor@dilos.org>
Reviewed by: Dan McDonald <danmcd@omniti.com>
Reviewed by: Robert Mustacchi <rm@joyent.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
Author: Daniel Hoffman <dj.hoffman@delphix.com>
illumos/illumos-gate@1c17160ac51c17160ac5https://www.illumos.org/issues/1300
Problem:
We can create invisible file in ZFS.
How to reproduce:
0. Prepare normalization formD ZFS.
# cat cat /etc/release
cat: cat: No such file or directory
OpenIndiana Development oi_151 X86 (powered by illumos)
Copyright 2011 Oracle and/or its affiliates. All rights reserved.
Use is subject to license terms.
Assembled 28 April 2011
# mkfile 100M 100M
# zpool create -O utf8only=on -O normalization=formD test_pool $( pwd )/
100M
# zpool upgrade
This system is currently running ZFS pool version 28.
All pools are formatted using this version.
# zfs get normalization test_pool
NAME PROPERTY VALUE SOURCE
test_pool normalization formD -
# chmod 777 /test_pool
1. Create a NFD file.
$ cd /test_pool/
$ cp /etc/release $( echo "\x75\xcc\x88" )
$ ls -la
total 4
drwxrwxrwx 2 root root 3 2011-07-29 08:53 .
drwxr-xr-x 25 root root 26 2011-07-29 08:53 ..
-r--r--r-- 1 test1 staff 251 2011-07-29 08:53 u?
Reviewed by: Yuri Pankov <yuri.pankov@nexenta.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Kevin Crowe <kevin.crowe@nexenta.com>
illumos/illumos-gate@54207fd2e154207fd2e1https://www.illumos.org/issues/4242
From Joyent's OS-2557:
So we're basically just doing a check here that after we got a 'rename' event
for our file, the file has actually been moved out of the way.
What I've seen in bh1-stage2 is that this happens as you'd expect for all zones
but many times over the last week it's failed because when we do the fs.exists
() check here, the file that we got the rename event for, still exists.
I've confirmed what's happening using the following dtrace script:
#!/usr/sbin/dtrace -s
#pragma D option quiet
#pragma D option bufsize=256k
syscall::open:entry,
syscall::open64:entry
/copyinstr(arg0) == "/var/svc/provisioning" || (strlen(copyinstr(arg0)) == 69
&& substr(copyinstr(arg0), 48) == "/var/svc/provisioning")/
{
this->watching_open = 1;
printf("%d zone %s process %s(%d) [%s] open(%s)\\n",
timestamp,
zonename,
execname,
pid,
curpsinfo->pr_psargs,
copyinstr(arg0));
}
syscall::open:return,
syscall::open64:return
/this->watching_open == 1/
Reviewed by: Robert Mustacchi <rm@joyent.com>
Reviewed by: Marcel Telka <marcel@telka.sk>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Jerry Jelinek <jerry.jelinek@joyent.com>
illumos/illumos-gate@7de35a3ed07de35a3ed0https://www.illumos.org/issues/7740
The problem is that dbuf_findbp will return ENOENT if the block it's
trying to find is beyond the end of the file. If that happens, we assume
there is no birth time, and so we lose that information when we write
out new blkptrs. We should teach dbuf_findbp to look for things that are
beyond the current end, but not beyond the absolute end of the file.
To verify, create a large file, truncate it to a short length, and then
write beyond the end. Check with zdb to make sure that there are no
holes with birth time zero (will appear as gaps).
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@b7f9f60c8eb7f9f60c8ehttps://www.illumos.org/issues/7779
zfsctl_ops_shares_dir and ZFSCTL_INO_SHARES are essentially dead code.
While there, fix the index range check in zfsctl_root_inode_cb.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Serapheim Dimitropoulos <serapheim@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Andriy Gapon <andriy.gapon@clusterhq.com>
illumos/illumos-gate@555da5111b555da5111bhttps://www.illumos.org/issues/7743
When loading a pool that had been created before the existance of
per-vdev zaps, on a system that knows about per-vdev zaps, the
per-vdev zaps will not be allocated and initialized.
This appears to be because the logic that would have done so, in
spa_sync_config_object(), is not reached under normal operation. It is
only reached if spa_config_dirty_list is non-empty.
The fix is to add another `AVZ_ACTION_` enum that will allow this code
to be reached when we detect that we're loading an old pool, even when
there are no dirty configs.
Reviewed by: Matt Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Don Brady <don.brady@intel.com>
Approved by: Robert Mustacchi <rm@joyent.com>
Author: Paul Dagnelie <pcd@delphix.com>
illumos/illumos-gate@5f145778015f14577801https://www.illumos.org/issues/7613
metaslab_t:ms_freetree[TXG_SIZE] is only used in syncing context. We should
replace it with two trees: the freeing tree (ranges that we are freeing this
syncing txg) and the freed tree (ranges which have been freed this txg).
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Alex Reece <alex@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@4ba5b961634ba5b96163https://www.illumos.org/issues/7586
The #ifdef __lint in dmu.h is ugly, and it would be nice not to duplicate it if
we add other inline functions into header files in ZFS, especially since it is
difficult to make any other solution work across all compilation targets. We
should switch to disabling the lint flags that are failing instead.
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>
illumos/illumos-gate@1a01181fdc1a01181fdchttps://www.illumos.org/issues/7580
We need to prevent any reader whenever we're about the zero out all the
blkptrs. To do this we need to grab the dn_struct_rwlock as writer in
dbuf_write_children_ready and free_children just prior to calling bzero.
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: George Wilson <george.wilson@delphix.com>
illumos/illumos-gate@7588687e6b7588687e6bhttps://www.illumos.org/issues/7606
When importing a pool with a large number of filesystems within the same
parent filesystem, we see that dmu_objset_find_dp() takes a long time.
It is called from 3 places: spa_check_logs(), spa_ld_claim_log_blocks(),
and spa_load_verify().
There are several ways to improve performance here:
1. We don't really need to do spa_check_logs() or
spa_ld_claim_log_blocks() if the pool was closed cleanly.
2. spa_load_verify() uses dmu_objset_find_dp() to check that no
datasets have too long of names.
3. dmu_objset_find_dp() is slow because it's doing
zap_value_search() (which is O(N sibling datasets)) to determine
the name of each dsl_dir when it's opened. In this case we
actually know the name when we are opening it, so we can provide
it and avoid the lookup.
This change implements fix#3 from the above list; i.e. make
dmu_objset_find_dp() provide the name of the dataset so that we don't
have to search for it.
Reviewed by: Steve Gonczi <steve.gonczi@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Prashanth Sreenivasa <prashksp@gmail.com>
Approved by: Gordon Ross <gordon.w.ross@gmail.com>
Author: Matthew Ahrens <mahrens@delphix.com>
illumos/illumos-gate@5602294fda5602294fdahttps://www.illumos.org/issues/7252
This feature includes code to allow a system with compressed ARC enabled to
send data in its compressed form straight out of the ARC, and receive data in
its compressed form directly into the ARC.
https://www.illumos.org/issues/7628
We should have longer, more readable versions of the ZFS send / recv options.
7628 create long versions of ZFS send / receive options
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: John Kennedy <john.kennedy@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: Paul Dagnelie <pcd@delphix.com>
Reviewed by: Pavel Zakharov <pavel.zakharov@delphix.com>
Reviewed by: Sebastien Roy <sebastien.roy@delphix.com>
Reviewed by: David Quigley <dpquigl@davequigley.com>
Reviewed by: Thomas Caputi <tcaputi@datto.com>
Approved by: Dan McDonald <danmcd@omniti.com>
Author: Dan Kimmel <dan.kimmel@delphix.com>