The issue with hot spares in ZoL is because it opens all leaf
vdevs exclusively (O_EXCL). On Linux, exclusive opens cause
subsequent exclusive opens to fail with EBUSY.
This could be resolved by not opening any of the devices
exclusively, which is what Illumos does, but the additional
protection offered by exclusive opens is desirable. It cleanly
prevents you from accidentally adding an in-use non-ZFS device
to your pool.
To fix this we very slightly relaxed the usage of O_EXCL in
the following ways.
1) Functions which open the device but only read had the
O_EXCL flag removed and were updated to use O_RDONLY.
2) A common holder was added to the vdev disk code. This
allow the ZFS code to internally open the device multiple
times but non-ZFS callers may not.
3) An exception was added to make_disks() for hot spare when
creating partition tables. For hot spare devices which
are already opened exclusively we skip creating the partition
table because this must already have been done when the disk
was originally added as a hot spare.
Additional minor changes include fixing check_in_use() to use
a partition instead of a slice suffix. And is_spare() was moved
above make_disks() to avoid adding a forward reference.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#250
As described by the comment and enforced the by assertion the
v->vdev_wholedisk will never be -1. The wholedisk handling
is performed by the user space utilities. To prevent confusion
this dead code is being removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When vdev_disk.c was implemented for Linux we failed to handle the
reopen case. According to the vdev_reopen() comment leaf vdevs should
not be closed or opened when v->vdev_reopening is set. Under Linux
we would always close and open the device.
This issue was only noticed when a 'zpool scrub' command was run while
the leaf vdev device names in /dev/disk/by-vdev were missing. The
scrub command calls vdev_reopen() which caused the vdevs to be closed
but they couldn't be reopened due to the missing links. The result
was that all the vdevs were marked unavailable and the pool was
halted due to failmode=wait.
This patch adds the missing functionality in a similiar fashion to
to the Illumos code.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
To determine whether the kernel is capable of handling empty barrier
BIOs, we check for the presence of the bio_empty_barrier() macro,
which was introduced in 2.6.24. If this macro is defined, then we can
flush disk vdevs; if it isn't, then flushing is disabled.
Unfortunately, the bio_empty_barrier() macro was removed in 2.6.37,
even though the kernel is still capable of handling empty barrier BIOs.
As a result, flushing is effectively disabled on kernels >= 2.6.37,
meaning that starting from this kernel version, zfs doesn't use
barriers to guarantee on-disk data consistency. This is quite bad and
can lead to potential data corruption on power failures.
This patch fixes the issue by removing the configure check for
bio_empty_barrier(), as we don't support kernels <= 2.6.24 anymore.
Thanks to Richard Kojedzinszky for catching this nasty bug.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1318
As of Linux 3.4 the UMH_WAIT_* constants were renumbered. In
particular, the meaning of "1" changed from UMH_WAIT_PROC (wait for
process to complete), to UMH_WAIT_EXEC (wait for the exec, but not the
process). A number of call sites used the number 1 instead of the
constant name, so the behavior was not as expected on kernels with this
change.
One visible consequence of this change was that processes accessing
automounted snapshots received an ELOOP error because they failed to
wait for zfs.mount to complete.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#816
The current state of udev and devicer-mapper devices makes it difficult
to construct a mapping of DM partitions and their underlying DM device.
For example, with a /dev directory with the following contents:
$ ls -d /dev/dm-*
/dev/dm-0
/dev/dm-1
/dev/dm-2
/dev/dm-3
it is not immediately apparent if these are completely separate devices,
or partitions and real devices intermixed. In contrast, SCSI devices
would appear as so:
$ ls -d /dev/sd*
/dev/sda
/dev/sda1
/dev/sdb
/dev/sdb1
Here, one can immediately determine that there are two devices (sda and
sdb), each containing a single partition. The lack of a predictable and
consistent mapping from DM devices to DM device partitions makes it
difficult for user space to process these devices the same way it does
SCSI devices.
As a result, the ZFS utilities do not partition DM devices, and instead
set the "vdev_wholedisk" label to 0 and treat them as partitions. This
has the side effect that, even if ZFS has sole ownership of the device,
the IO scheduler will not be modified because it is treated as a
partition.
This change adds an exception for DM devices in vdev_elevator_switch,
allowing the elevator to be modified even though the "vdev_wholedisk"
property is not set.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1149
The following warning was originally added to provide visibility
in to how often a dio gets heavily fragmented in to over 16 bios.
This can happen due to constraints imposed by the block device
and may have a negitive impact on performance but is otherwise
harmless. To prevent needless confusion and worry the message
has been removed.
kernel: WARNING: Resized bio's/dio to 32
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.
Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#906
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.
* The txg_sync thread
* The zvol write/discard threads
* The zpl_putpage() VFS callback
This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages. If a lock is held over this operation the potential exists to
deadlock the system. To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.
Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread. However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.
This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:
21ade34 Disable direct reclaim for z_wr_* threads
cfc9a5c Fix zpl_writepage() deadlock
eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Currently, zpool online -e (dynamic vdev expansion) doesn't work on
whole disks because we're invoking ioctl(BLKRRPART) from userspace
while ZFS still has a partition open on the disk, which results in
EBUSY.
This patch moves the BLKRRPART invocation from the zpool utility to the
module. Specifically, this is done just before opening the device in
vdev_disk_open() which is called inside vdev_reopen(). This requires
jumping through some hoops to get to the disk device from the partition
device, and to make sure we can still open the partition after the
BLKRRPART call.
Note that this new code path is triggered on dynamic vdev expansion
only; other actions, like creating a new pool, are unchanged and still
call BLKRRPART from userspace.
This change also depends on API changes which are available in 2.6.37
and latter kernels. The build system has been updated to detect this,
but there is no compatibility mode for older kernels. This means that
online expansion will NOT be available in older kernels. However, it
will still be possible to expand the vdev offline.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#808
The WRITE_FLUSH, WRITE_FUA, and WRITE_FLUSH_FUA flags have been
introduced as a replacement for WRITE_BARRIER. This was done
to allow richer semantics to be expressed to the block layer.
It is the block layers responsibility to choose the correct way
to implement these semantics.
This change simply updates the bio's to use the new kernel API
which should be absolutely safe. However, since ZFS depends
entirely on this working as designed for correctness we do
want to be careful.
Closes#281
Yesterday I ran across a 3TB drive which exposed 4K sectors to
Linux. While I thought I had gotten this support correct it
turns out there were 2 subtle bugs which prevented it from
working.
sudo ./cmd/zpool/zpool create -f large-sector /dev/sda
cannot create 'large-sector': one or more devices is currently unavailable
1) The first issue was that it was possible that bdev_capacity()
would return the number of 512 byte sectors rather than the number
of 4096 sectors. Internally, certain Linux functions only operate
with 512 byte sectors so you need to be careful. To avoid any
confusion in the future I've updated bdev_capacity() to simply
return the device (or partition) capacity in bytes. The higher
levels of ZFS want the value in bytes anyway so this is cleaner.
2) When creating a bio the ->bi_sector count must always be
expressed in 512 byte sectors. The existing code would scale
the byte offset by the logical sector size. Until now this was
always 512 so it never caused problems. Trying a 4K sector drive
clearly exposed the issue. The problem has been fixed by
hard coding the 512 byte sector which is exactly what the bio
code does internally.
With these changes I'm now able to create ZFS pools using 4K
sector drives. No issues were observed during fairly extensive
testing. This is also a low risk change if your using 512b
sectors devices because none of the logic changes.
Closes#256
This commit adds module options for all existing zfs tunables.
Ideally the average user should never need to modify any of these
values. However, in practice sometimes you do need to tweak these
values for one reason or another. In those cases it's nice not to
have to resort to rebuilding from source. All tunables are visable
to modinfo and the list is as follows:
$ modinfo module/zfs/zfs.ko
filename: module/zfs/zfs.ko
license: CDDL
author: Sun Microsystems/Oracle, Lawrence Livermore National Laboratory
description: ZFS
srcversion: 8EAB1D71DACE05B5AA61567
depends: spl,znvpair,zcommon,zunicode,zavl
vermagic: 2.6.32-131.0.5.el6.x86_64 SMP mod_unload modversions
parm: zvol_major:Major number for zvol device (uint)
parm: zvol_threads:Number of threads for zvol device (uint)
parm: zio_injection_enabled:Enable fault injection (int)
parm: zio_bulk_flags:Additional flags to pass to bulk buffers (int)
parm: zio_delay_max:Max zio millisec delay before posting event (int)
parm: zio_requeue_io_start_cut_in_line:Prioritize requeued I/O (bool)
parm: zil_replay_disable:Disable intent logging replay (int)
parm: zfs_nocacheflush:Disable cache flushes (bool)
parm: zfs_read_chunk_size:Bytes to read per chunk (long)
parm: zfs_vdev_max_pending:Max pending per-vdev I/Os (int)
parm: zfs_vdev_min_pending:Min pending per-vdev I/Os (int)
parm: zfs_vdev_aggregation_limit:Max vdev I/O aggregation size (int)
parm: zfs_vdev_time_shift:Deadline time shift for vdev I/O (int)
parm: zfs_vdev_ramp_rate:Exponential I/O issue ramp-up rate (int)
parm: zfs_vdev_read_gap_limit:Aggregate read I/O over gap (int)
parm: zfs_vdev_write_gap_limit:Aggregate write I/O over gap (int)
parm: zfs_vdev_scheduler:I/O scheduler (charp)
parm: zfs_vdev_cache_max:Inflate reads small than max (int)
parm: zfs_vdev_cache_size:Total size of the per-disk cache (int)
parm: zfs_vdev_cache_bshift:Shift size to inflate reads too (int)
parm: zfs_scrub_limit:Max scrub/resilver I/O per leaf vdev (int)
parm: zfs_recover:Set to attempt to recover from fatal errors (int)
parm: spa_config_path:SPA config file (/etc/zfs/zpool.cache) (charp)
parm: zfs_zevent_len_max:Max event queue length (int)
parm: zfs_zevent_cols:Max event column width (int)
parm: zfs_zevent_console:Log events to the console (int)
parm: zfs_top_maxinflight:Max I/Os per top-level (int)
parm: zfs_resilver_delay:Number of ticks to delay resilver (int)
parm: zfs_scrub_delay:Number of ticks to delay scrub (int)
parm: zfs_scan_idle:Idle window in clock ticks (int)
parm: zfs_scan_min_time_ms:Min millisecs to scrub per txg (int)
parm: zfs_free_min_time_ms:Min millisecs to free per txg (int)
parm: zfs_resilver_min_time_ms:Min millisecs to resilver per txg (int)
parm: zfs_no_scrub_io:Set to disable scrub I/O (bool)
parm: zfs_no_scrub_prefetch:Set to disable scrub prefetching (bool)
parm: zfs_txg_timeout:Max seconds worth of delta per txg (int)
parm: zfs_no_write_throttle:Disable write throttling (int)
parm: zfs_write_limit_shift:log2(fraction of memory) per txg (int)
parm: zfs_txg_synctime_ms:Target milliseconds between tgx sync (int)
parm: zfs_write_limit_min:Min tgx write limit (ulong)
parm: zfs_write_limit_max:Max tgx write limit (ulong)
parm: zfs_write_limit_inflated:Inflated tgx write limit (ulong)
parm: zfs_write_limit_override:Override tgx write limit (ulong)
parm: zfs_prefetch_disable:Disable all ZFS prefetching (int)
parm: zfetch_max_streams:Max number of streams per zfetch (uint)
parm: zfetch_min_sec_reap:Min time before stream reclaim (uint)
parm: zfetch_block_cap:Max number of blocks to fetch at a time (uint)
parm: zfetch_array_rd_sz:Number of bytes in a array_read (ulong)
parm: zfs_pd_blks_max:Max number of blocks to prefetch (int)
parm: zfs_dedup_prefetch:Enable prefetching dedup-ed blks (int)
parm: zfs_arc_min:Min arc size (ulong)
parm: zfs_arc_max:Max arc size (ulong)
parm: zfs_arc_meta_limit:Meta limit for arc size (ulong)
parm: zfs_arc_reduce_dnlc_percent:Meta reclaim percentage (int)
parm: zfs_arc_grow_retry:Seconds before growing arc size (int)
parm: zfs_arc_shrink_shift:log2(fraction of arc to reclaim) (int)
parm: zfs_arc_p_min_shift:arc_c shift to calc min/max arc_p (int)
Occasionally we would see an -EFAULT returned when setting the
I/O scheduler on a vdev. This was caused an improperly formatted
user mode helper command.
This commit restructures the command to something simpler, allocates
space for it dynamically to save stack, and removes the retry logic
which is no longer needed.
Closes#169
ZFS should only change the i/o scheduler for a disk when it has
ownership of the whole disk. This is basically the same logic as
adjusting the write cache behavior on a disk. This change updates
the vdev disk code to skip partitions when setting the i/o scheduler.
Closes#152
There were two cases when attempting to set the vdev block device
scheduler which would causes console warnings.
The first case was when the vdev used a loop, ram, dm, or other
such device which doesn't support a configurable scheduler. In
these cases attempting to set a scheduler is pointless and can
be safely skipped.
The secord case is slightly more troubling. We were seeing
transient cases where setting the elevator would return -EFAULT.
On retry everything is fine so there appears to be a small window
where this is possible. To handle that case we silently retry
up to three times before reporting the warning.
In all of the above cases the warning is harmless and at worse you
may see slightly different performance characteristics from one
or more of your vdevs.
Initial testing has shown the the right IO scheduler to use under Linux
is noop. This strikes the ideal balance by allowing the zfs elevator
to do all request ordering and prioritization. While allowing the
Linux elevator to do the maximum front/back merging allowed by the
physical device. This yields the largest possible requests for the
device with the lowest total overhead.
While 'noop' should be right for your system you can choose a different
IO scheduler with the 'zfs_vdev_scheduler' option. You may set this
value to any of the standard Linux schedulers: noop, cfq, deadline,
anticipatory. In addition, if you choose 'none' zfs will not attempt
to change the IO scheduler for the block device.
This commit fixes a sign extension bug affecting l2arc devices. Extremely
large offsets may be passed down to the low level block device driver on
reads, generating errors similar to
attempt to access beyond end of device
sdbi1: rw=14, want=36028797014862705, limit=125026959
The unwanted sign extension occurrs because the function arc_read_nolock()
stores the offset as a daddr_t, a 32-bit signed int type in the Linux kernel.
This offset is then passed to zio_read_phys() as a uint64_t argument, causing
sign extension for values of 0x80000000 or greater. To avoid this, we store
the offset in a uint64_t.
This change also changes a few daddr_t struct members to uint64_t in the libspl
headers to avoid similar bugs cropping up in the future. We also add an ASSERT
to __vdev_disk_physio() to check for invalid offsets.
Closes#66
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The name of the flag used to mark a bio as synchronous has changed
again in the 2.6.36 kernel due to the unification of the BIO_RW_*
and REQ_* flags. The new flag is called REQ_SYNC. To simplify
checking this flag I have introduced the vdev_disk_dio_is_sync()
helper function. Based on the results of several new autoconf
tests it uses the correct mask to check for a synchronous bio.
Preferred interface for flagging a synchronous bio:
2.6.12-2.6.29: BIO_RW_SYNC
2.6.30-2.6.35: BIO_RW_SYNCIO
2.6.36-2.6.xx: REQ_SYNC
While there is no right maximum timeout for a disk IO we can start
laying the ground work to measure how long they do take in practice.
This change simply measures the IO time and if it exceeds 30s an
event is posted for 'zpool events'.
This value was carefully selected because for sd devices it implies
that at least one timeout (SD_TIMEOUT) has occured. Unfortunately,
even with FAILFAST set we may retry and request and not get an
error. This behavior is strongly dependant on the device driver
and how it is hooked in to the scsi error handling stack. However
by setting the limit at 30s we can log the event even if no error
was returned.
Slightly longer term we can start recording these delays perhaps
as a simple power-of-two histrogram. This histogram can then be
reported as part of the 'zpool status' command when given an command
line option.
None of this code changes the internal behavior of ZFS. Currently
it is simply for reporting excessively long delays.
ZFS works best when it is notified as soon as possible when a device
failure occurs. This allows it to immediately start any recovery
actions which may be needed. In theory Linux supports a flag which
can be set on bio's called FAILFAST which provides this quick
notification by disabling the retry logic in the lower scsi layers.
That's the theory at least. In practice is turns out that while the
flag exists you oddly have to set it with the BIO_RW_AHEAD flag.
And even when it's set it you may get retries in the low level
drivers decides that's the right behavior, or if you don't get the
right error codes reported to the scsi midlayer.
Unfortunately, without additional kernels patchs there's not much
which can be done to improve this. Basically, this just means that
it may take 2-3 minutes before a ZFS is notified properly that a
device has failed. This can be improved and I suspect I'll be
submitting patches upstream to handle this.
All the upper layers of zfs expect zio->io_error to be positive. I was
careful but I missed one instance in vdev_disk_physio_completion() which
could return a negative error. To ensure all cases are always caught I
had additionally added an ASSERT() to check this before zio_interpret().
Finally, as a debugging aid when zfs is build with --enable-debug all
errors from the backing block devices will be reported to the console
with an error message like this:
ZFS: zio error=5 type=1 offset=4217856 size=8192 flags=60440
This commit fixes a bug in vdev_disk_open() in which the whole_disk property
was getting set to 0 for disk devices, even when it was stored as a 1 when the
zpool was created. The whole_disk property lets us detect when the partition
suffix should be stripped from the device name in CLI output. It is also used
to determine how writeback cache should be set for a device.
When an existing zpool is imported its configuration is read from the vdev
label by user space in zpool_read_label(). The whole_disk property is saved in
the nvlist which gets passed into the kernel, where it in turn gets saved in
the vdev struct in vdev_alloc(). Therefore, this value is available in
vdev_disk_open() and should not be overridden by checking the provided device
path, since that path will likely point to a partition and the check will
return the wrong result.
We also add an ASSERT that the whole_disk property is set. We are not aware of
any cases where vdev_disk_open() should be called with a config that doesn't
have this property set. The ASSERT is there so that when debugging is enabled
we can identify any legitimate cases that we are missing. If we never hit the
ASSERT, we can at some point remove it along with the conditional whole_disk
check.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>