Otherwise it will cause zpl_shares_lookup() to return a invalid
pointer when an error occurs.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Closes#626#885#947#977
As of Linux commit ebfc3b49a7ac25920cb5be5445f602e51d2ea559 the
struct nameidata is no longer passed to iops->create. Instead
only the result of (inamedata->flags & LOOKUP_EXCL) is passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
As of Linux commit 00cd8dd3bf95f2cc8435b4cac01d9995635c6d0b the
struct nameidata is no longer passed to iops->lookup. Instead
only the inamedata->flags are passed.
ZFS like almost all Linux fileystems never made use of this so
only the prototype needs to be wrapped for compatibility.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
As of Linux commit 9249e17fe094d853d1ef7475dd559a2cc7e23d42 the
mount flags are now passed to sget() so they can be used when
initializing a new superblock.
ZFS never uses sget() in this fashion so we can simply pass a
zero and add a zpl_sget() compatibility wrapper.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
The .write_super callback was removed the the super_operations
structure by Linux commit f0cd2dbb6cf387c11f87265462e370bb5469299e.
All file systems are now expected to self manage writing any dirty
state assoicated with their super block.
ZFS never made use of this callback so it can simply be removed
from the super_operations structure.
Signed-off-by: Yuxuan Shui <yshuiv7@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #873
Currently, the size of read and write requests on vdevs is aligned
according to the vdev's ashift, allocating a new ZIO buffer and padding
if need be.
This makes sense for write requests to prevent read/modify/write if the
write happens to be smaller than the device's internal block size.
For reads however, the rationale is less clear. It seems that the
original code aligns reads because, on Solaris, device drivers will
outright refuse unaligned requests.
We don't have that issue on Linux. Indeed, Linux block devices are able
to accept requests of any size, and take care of alignment issues
themselves.
As a result, there's no point in enforcing alignment for read requests
on Linux. This is a nice optimization opportunity for two reasons:
- We remove a memory allocation in a heavily-used code path;
- The request gets aligned in the lowest layer possible, which shrinks
the path that the additional, useless padding data has to travel.
For example, when using 4k-sector drives that lie about their sector
size, using 512b read requests instead of 4k means that there will
be less data traveling down the ATA/SCSI interface, even though the
drive actually reads 4k from the platter.
The only exception is raidz, because raidz needs to read the whole
allocated block for parity.
This patch removes alignment enforcement for read requests, except on
raidz. Note that we also remove an assertion that checks that we're
aligning a top-level vdev I/O, because that's not the case anymore for
repair writes that results from failed reads.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1022
There are currently three vmem_size() consumers all of which are
part of the ARC implemention. However, since the expected behavior
of the Linux and Solaris virtual memory subsystems are so different
the behavior in each of these instances needs to be reevaluated.
* arc_evict_needed() - This is actually dead code. Arena support
was never added to the SPL and zio_arena is always NULL. This
support isn't needed so we simply remove this dead code.
* arc_memory_throttle() - On Solaris where virtual memory constitutes
almost all of the address space we can reasonably expect there to be
a fairly large amount free. However, on Linux by default we only
have about 100MB total and that's heavily used by the ARC. So the
expectation on Linux is that this will usually be a small value.
Therefore we remove the vmem_size() check for i386 systems because
the expectation is that it will be less than the zfs_write_limit_max.
* arc_init() - Here vmem_size() is used to initially size the ARC.
Since the ARC is currently backed by the virtual address space it
makes sense to use this as a limit on the ARC for 32-bit systems.
This code can be removed when the ARC is backed by the page cache.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#831
Allow the zfs_txg_timeout variable to be dynamically tuned at run
time. By pulling it down out of the variable declaration it will
be evaluted each time through the loop.
The zfs_txg_timeout variable is now declared extern in a the common
sys/txg.h header rather than locally in dsl_scan.c. This prevents
potential type mismatches if the global variable needs to be used
elsewhere.
Move the module_param() code in to the same source file where
zfs_txg_timeout is declared. This is the most logical location.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Commit c409e4647f introduced a
number of module parameters. This required several types to be
changed to accomidate the required module parameters Linux macros.
Unfortunately, arc.c contained its own extern definition of the
zfs_write_limit_max variable and its type was not updated to be
consistent with its dsl_pool.c counterpart. If the variable had
been properly marked extern in a common header, then gcc would
have generated a warning and this would not have slipped through.
The result of this was that the ARC unconditionally expected
zfs_write_limit_max to be 64-bit. Unfortunately, the largest size
integer module parameter that Linux supports is unsigned long, which
varies in size depending on the host system's native word size. The
effect was that on 32-bit systems, ARC incorrectly performed 64-bit
operations on a 32-bit value by reading the neighboring 32 bits as
the upper 32 bits of the 64-bit value.
We correct that by changing the extern declaration to use the unsigned
long type and move these extern definitions in to the common arc.h
header. This should make ARC correctly treat zfs_write_limit_max as a
32-bit value on 32-bit systems.
Reported-by: Jorgen Lundman <lundman@lundman.net>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#749
zfs_immediate_write_sz variable is a tunable, but lacks proper
module_param() instrumentation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1032
Term 'transaction group' is commonly abbreviated as txg in ZFS sources.
There are some places (Linux specific MODULE_PARAM_DESC() macros)
where it is incorrectly spelled as 'tgx'.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1030
It doesn't make sense for a zvol to use the default system I/O
scheduler because it is a virtual device. Therefore, we change
the default scheduler to 'noop' for zvols provided that the
elevator_change() function is available. This interface has
been available since Linux 2.6.36 and appears in the RHEL 6.x
kernels.
We deliberately do not implement the method for older kernels
because it was racy and could result in system crashes. It's
better to simply manually tune the scheduler for these kernels.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1017
Currently, when processing DISCARD requests, zvol_discard() calls
dmu_free_long_range() with the precise offset and size of the request.
Unfortunately, this is not optimal for requests that are not aligned to
the zvol block boundaries. Indeed, in the case of an unaligned range,
dnode_free_range() will zero out the unaligned parts. Not only is this
useless since we are not freeing any space by doing so, it is also slow
because it translates to a read-modify-write operation.
This patch fixes the issue by rounding up the discard start offset to
the next volume block boundary, and rounding down the discard end
offset to the previous volume block boundary.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1010
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fc for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#1002
illumos/illumos-gate@2e2c135528
Illumos changeset: 13780:6da32a929222
3100 zvol rename fails with EBUSY when dirty
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Reviewed by: Adam H. Leventhal <ahl@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Garrett D'Amore <garrett@damore.org>
Approved by: Eric Schrock <eric.schrock@delphix.com>
Ported-by: Etienne Dechamps <etienne.dechamps@ovh.net>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#995
As of Linux 2.6.36 an elevator_change() interface was added.
This commit updates vdev_elevator_switch() to use this interface
when available, otherwise it falls back to the usermodehelper
method.
Original-patch-by: foobarz <sysop@xeon.(none)>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#906
In order to implement synchronous NFS metadata semantics ZFS
needs to provide the .commit_metadata hook. All it takes there
is to make sure changes are committed to ZIL. Fortunately
zfs_fsync() does just that, so simply calling it from
zpl_commit_metadata() does the trick.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#969
Previously we returned ERR_PTR(-ENOENT) which the rest of the kernel
doesn't expect and as such we can oops.
Signed-off-by: Chris Wedgwood <cw@f00f.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#949Closes#931Closes#789Closes#743Closes#730
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #973
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
When zfs_replay_write() replays TX_WRITE records from ZIL
it calls zpl_write_common() to perform the actual write.
zpl_write_common() returns the number of bytes written
(similar to write() system call) or an (negative) error.
However, the code expects the positive return value to be
a residual counter. Thus when zpl_write_common() successfully
completes it is mistakenly considered to be a partial write and
the error code delivered further. At this point the ZIL processing
is aborted with famous "ZFS replay transaction error 5" error
message given to the message buffer.
The fix is to compare the zpl_write_commmon() return value with
the buffer size and flag error only when they disagree.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#933
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.
The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit. Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#961
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed. This will result in a
NULL dereference as shown by the stack trace is issue #782.
This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#782
The 'zfs destroy' changes in 330d06f disrupted how zvol devices
get removed on ZoL. However, it basically boils down to the
fact that we are no longer reliably calling zvol_remove_minor()
via zfs_ioc_destroy_snaps().
Therefore we add the missing call and handle things similarly
to the existing zfs_unmount_snap() case. Ideally we would check
if this is of type DMU_OST_ZFS or DMU_OST_ZVOL and just do the
right thing as in zfs_ioc_destroy(). However, it looks like
it would be fairly expensive to get the type, and it's harmless
to simply attempt the umount and minor removal.
This is also an issue in the latest FreeBSD and Illumos code.
It was being tracked under the following issue, and we may want
to refresh our code when they settle on what they want to do
about it upstream.
https://www.illumos.org/issues/3170
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #903
Use ZFS dataset fsid guid as a unique file system id, similar to what is
done on Illumos/OpenSolaris.
Signed-off-by: Cyril Plisko <cyril.plisko@mountall.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#888
Buffers for the ARC are normally backed by the SPL virtual slab.
However, if memory is low, AND no slab objects are available,
AND a new slab cannot be quickly constructed a new emergency
object will be directly allocated.
These objects can be as large as order 5 on a system with 4k
pages. And because they are allocated with KM_PUSHPAGE, to
avoid a potential deadlock, they are not allowed to initiate I/O
to satisfy the allocation. This can result in the occasional
allocation failure.
However, since these allocations are allowed to block and
perform operations such as memory compaction they will eventually
succeed. Since this is not unexpected (just unlikely) behavior
this patch disables the warning for the allocation failure.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #465
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim. See commit b8d06fca08
for additional details.
SPL: Fixing allocation for task txg_sync (6093) which
used GFP flags 0x297bda7c with PF_NOFS set
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback. However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit. There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.
This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds. One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#700Closes#907
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.
* The txg_sync thread
* The zvol write/discard threads
* The zpl_putpage() VFS callback
This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages. If a lock is held over this operation the potential exists to
deadlock the system. To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.
Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread. However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.
This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:
21ade34 Disable direct reclaim for z_wr_* threads
cfc9a5c Fix zpl_writepage() deadlock
eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.
This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected. In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected. When debugging is enabled
any misuse will be treated as a fatal error.
This patch is entirely for debugging. We should be careful to
NOT become dependant on it fixing up the incorrect allocations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests. Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...
1) These buffers are short lived. They are only required for
the life of a single I/O at which point they can be used by
the next I/O.
2) The maximum number of concurrent buffers needed by a vdev is
small. It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.
By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it. This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock. This is
particularly important when memory is already low on the system.
It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation. The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0ccf1f eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated. The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC. That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks. This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.
The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it. This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.
As such, we revert this patch in favor of a proper fix for both issues.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC. Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries. However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):
ENOTEMPTY
pathname contains entries other than . and .. ; or, pathname has
.. as its final component. POSIX.1-2001 also allows EEXIST for
this condition.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#895
When calling sa_update() and friends it is possible that a spill
buffer will be needed to accomidate the update. When this happens
a hold is taken on the new dbuf and that hold must be released
before calling dmu_tx_commit(). Failing to release the hold will
cause a copy of the dbuf to be made in dbuf_sync_leaf(). This is
done to ensure further updates to the dbuf never sneak in to the
syncing txg.
This could be left to the sa_update() caller. But then the caller
would need to be aware of this internal SA implementation detail.
It is therefore preferable to handle this all internally in the
SA implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#503Closes#513
This reverts commit ec2626ad3f which
caused consistency problems between the shared and private handles.
Reverting this change should resolve issues #709 and #727. It
will also reintroduce an arc_anon memory leak which is addressed
by the next commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#709Closes#727
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
While I'd like to remove the various pragmas in module/zfs/dbuf.c.
There are consumers such as Lustre which still depend on dmu_buf_*
versions of the symbols. Until all consumers can be converted to
use only the dbuf_* names leave this symbol exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>