Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.
* The txg_sync thread
* The zvol write/discard threads
* The zpl_putpage() VFS callback
This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages. If a lock is held over this operation the potential exists to
deadlock the system. To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.
Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread. However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA. This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.
This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:
21ade34 Disable direct reclaim for z_wr_* threads
cfc9a5c Fix zpl_writepage() deadlock
eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
These allocations in mzap_update() used to be kmem_alloc() but
were changed to vmem_alloc() due to the size of the allocation.
However, since it turns out this function may be called in the
context of the txg_sync thread they must be changed back to use
a kmem_alloc() to ensure the KM_PUSHPAGE flag is honored.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The txg_sync(), zfs_putpage(), zvol_write(), and zvol_discard()
call paths must only use KM_PUSHPAGE to avoid potential deadlocks
during direct reclaim.
This patch annotates these call paths so any accidental use of
KM_SLEEP will be quickly detected. In the interest of stability
if debugging is disabled the offending allocation will have its
GFP flags automatically corrected. When debugging is enabled
any misuse will be treated as a fatal error.
This patch is entirely for debugging. We should be careful to
NOT become dependant on it fixing up the incorrect allocations.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
The vdev queue layer may require a small number of buffers
when attempting to create aggregate I/O requests. Rather than
attempting to allocate them from the global zio buffers, which
is slow under memory pressure, it makes sense to pre-allocate
them because...
1) These buffers are short lived. They are only required for
the life of a single I/O at which point they can be used by
the next I/O.
2) The maximum number of concurrent buffers needed by a vdev is
small. It's roughly limited by the zfs_vdev_max_pending tunable
which defaults to 10.
By keeping a small list of these buffer per-vdev we can ensure
one is always available when we need it. This significantly
reduces contention on the vq->vq_lock, because we no longer
need to perform a slow allocation under this lock. This is
particularly important when memory is already low on the system.
It would probably be wise to extend the use of these buffers beyond
aggregate I/O and in to the raidz implementation. The inability
to quickly allocate buffer for the parity stripes could result in
similiar problems.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit used PF_MEMALLOC to prevent a memory reclaim deadlock.
However, commit 49be0ccf1f eliminated
the invocation of __cv_init(), which was the cause of the deadlock.
PF_MEMALLOC has the side effect of permitting pages from ZONE_DMA
to be allocated. The use of PF_MEMALLOC was found to cause stability
problems when doing swap on zvols. Since this technique is known to
cause problems and no longer fixes anything, we revert it.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
The commit, cfc9a5c88f, to fix deadlocks
in zpl_writepage() relied on PF_MEMALLOC. That had the effect of
disabling the direct reclaim path on all allocations originating from
calls to this function, but it failed to address the actual cause of
those deadlocks. This led to the same deadlocks being observed with
swap on zvols, but not with swap on the loop device, which exercises
this code.
The use of PF_MEMALLOC also had the side effect of permitting
allocations to be made from ZONE_DMA in instances that did not require
it. This contributes to the possibility of panics caused by depletion
of pages from ZONE_DMA.
As such, we revert this patch in favor of a proper fix for both issues.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
Commit eec8164771 worked around an issue
involving direct reclaim through the use of PF_MEMALLOC. Since we
are reworking thing to use KM_PUSHPAGE so that swap works, we revert
this patch in favor of the use of KM_PUSHPAGE in the affected areas.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
The autoconf macro which failed if CONFIG_PREEMPT was set in the kernel
config was removed. With the inclusion of a few previous patches
targeting support for preempt enabled kernels, it is now safe to run
with this kernel config option enabled.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#83
Remove all of the generated autotools products from the repository
and update the .gitignore files accordingly.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#718
Under Solaris the behavior for rmdir(2) is to return EEXIST when
a directory still contains entries. However, on Linux ENOTEMPTY
is the expected return value with EEXIST being technically allowed.
According to rmdir(2):
ENOTEMPTY
pathname contains entries other than . and .. ; or, pathname has
.. as its final component. POSIX.1-2001 also allows EEXIST for
this condition.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#895
Make name in Linux menuconfig consistent with those of other filesystems
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#897
ZFS fails to build when SPL is built into the kernel on unless
--with-spl=/path/to/kernel/sources is specified. We fallback to the
kernel sources directory when SPL is not found elsewhere to resolve
that.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closed#896
When calling sa_update() and friends it is possible that a spill
buffer will be needed to accomidate the update. When this happens
a hold is taken on the new dbuf and that hold must be released
before calling dmu_tx_commit(). Failing to release the hold will
cause a copy of the dbuf to be made in dbuf_sync_leaf(). This is
done to ensure further updates to the dbuf never sneak in to the
syncing txg.
This could be left to the sa_update() caller. But then the caller
would need to be aware of this internal SA implementation detail.
It is therefore preferable to handle this all internally in the
SA implementation.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#503Closes#513
This reverts commit ec2626ad3f which
caused consistency problems between the shared and private handles.
Reverting this change should resolve issues #709 and #727. It
will also reintroduce an arc_anon memory leak which is addressed
by the next commit.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#709Closes#727
After surveying the code, the few places where smp_processor_id is used
were deemed to be safe to use with a preempt enabled kernel. As such, no
core logic had to be changed. These smp_processor_id call sites are simply
are wrapped in kpreempt_disable and kpreempt_enabled to prevent the
Linux kernel from emitting scary warnings.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Issue #83
./configure erroneously detects absence of dops->d_automount
when built against a GrSecurity patched kernel.
Summerized error message found in config.log:
checking whether dops->d_automount() exists
...
In function 'main': ... error: constified variable 'dops'
cannot be local
The "dops" variable cannot be a local variable, so it's
moved to the global scope.
This test also fails if the prototype of the dops->d_automount
function pointer is changed.
Signed-off-by: Massimo Maggi <massimo@mmmm.it>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Closes#884
1796 "ZFS HOLD" should not be used when doing "ZFS SEND" from a read-only pool
2871 support for __ZFS_POOL_RESTRICT used by ZFS test suite
2903 zfs destroy -d does not work
2957 zfs destroy -R/r sometimes fails when removing defer-destroyed snapshot
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Approved by: Eric Schrock <Eric.Schrock@delphix.com>
References:
https://www.illumos.org/issues/1796https://www.illumos.org/issues/2871https://www.illumos.org/issues/2903https://www.illumos.org/issues/2957
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: Matt Ahrens <matt@delphix.com>
Reviewed by: George Wilson <George.Wilson@delphix.com>
Reviewed by: Bill Pijewski <wdp@joyent.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/2635
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#717
This regression was accidentally introduced by commit
330d06f90d due to ZoL
specific code. The fix is to simply ensure the passed
nvlist is initialized and freed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#876
While I'd like to remove the various pragmas in module/zfs/dbuf.c.
There are consumers such as Lustre which still depend on dmu_buf_*
versions of the symbols. Until all consumers can be converted to
use only the dbuf_* names leave this symbol exported.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Add the /usr/src/zfs-<version>-<release>/<kernel> directory to
the zfs-modules-devel package. This ensures that this directory
will be removed when the package is removed.
We do not include the higher level /usr/src/zfs-<version>-<release>
directory since there may be builds for multiple kernels. Instead,
a %postun rmdir is added which attempts to remove this directory.
It will only succeed when the last zfs-modules-devel-* package
for this specific release is removed.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
When mutex debugging is enabled in your kernel the increased
size of the mutex structures can push the zfs_sb_t type beyond
the 8k warning threshold. This isn't harmful so we suppress
the warning for this case.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#628
RPM versions 4.8 and 4.9 differ in the definition of macro %_mandir:
$ rpm --version ; rpm --showrc | grep ^-14:._mandir
RPM version 4.9.0
-14: _mandir %{_prefix}/share/man
$ rpm --version ; rpm --showrc | grep ^-14:._mandir
RPM version 4.8.0
-14: _mandir /usr/share/man
zfs.spec.in defines %_prefix as /, so man pages end up getting
installed in /share/man on RPM 4.9 systems. To fix this, define
%_mandir relative to %_datadir in the spec file.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#353
Export these symbols so they may be used by other ZFS consumers
besides the ZPL.
Remove three stale prototype definites from dbuf.h. The actual
implementations of these functions were removed/renamed long ago.
It would be good in the long term to remove the existing pragmas
we inherited from Solaris and simply use the dbuf_* names.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
This commit adds support for building a zfs-modules-dkms sub package
built around Dynamic Kernel Module Support. This is to allow building
packages using the DKMS infrastructure which is intended to ease the
burden of kernel version changes, upgrades, etc.
By default zfs-modules-dkms-* sub package will be built as part of
the 'make rpm' target. Alternately, you can build only the DKMS
module package using the 'make rpm-dkms' target.
Examples:
# To build packaged binaries as well as a dkms packages
$ ./configure && make rpm
# To build only the packaged binary utilities and dkms packages
$ ./configure && make rpm-utils rpm-dkms
Note: Only the RHEL 5/6, CHAOS 5, and Fedora distributions are
supported for building the dkms sub package.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #535
When checking for the SPL Module.symvers file, a timeout can now be
passed in which will pause the configure step while it waits for this
file to be generated. By default, the configure behavior is unchanged as
a timeout of 0 is used. If a positive number of seconds is passed,
configure will wait that number of seconds for the Module.symvers file
before moving on.
The main motivation for this change was to support parallel execution of
'./configure && make' for the SPL and ZFS packages in preparation of
supporting DKMS based packages.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Richard Lowe <richlowe@richlowe.net>
References:
https://www.illumos.org/issues/1693
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#678
Currently, zvols have a discard granularity set to 0, which suggests to
the upper layer that discard requests of arbirarily small size and
alignment can be made efficiently.
In practice however, ZFS does not handle unaligned discard requests
efficiently: indeed, it is unable to free a part of a block. It will
write zeros to the specified range instead, which is both useless and
inefficient (see dnode_free_range).
With this patch, zvol block devices expose volblocksize as their discard
granularity, so the upper layer is aware that it's not supposed to send
discard requests smaller than volblocksize.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#862
ZFS depends on EFI_PARTITION, ZLIB_DEFLATE and ZLIB_INFLATE, but when
ZFS is integrated with the kernel source tree, menuconfig does not
enforce these dependencies. This can cause build failures in the case of
ZLIB_DEFLATE and ZLIB_INFLATE where symbols are not found. This can also
cause runtime failures in the case of EFI_PARTITION, where the kernel
will not understand GPT partitions when creating pools from raw disks.
We solve this by making menuconfig aware of these dependencies.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#854
The number of blocks that can be discarded in one BLKDISCARD ioctl on a
zvol is currently unlimited. Some applications, such as mkfs, discard
the whole volume at once and they use the maximum possible discard size
to do that. As a result, several gigabytes discard requests are not
uncommon.
Unfortunately, if a large amount of data is allocated in the zvol, ZFS
can be quite slow to process discard requests. This is especially true
if the volblocksize is low (e.g. the 8K default). As a result, very
large discard requests can take a very long time (seconds to minutes
under heavy load) to complete. This can cause a number of problems, most
notably if the zvol is accessed remotely (e.g. via iSCSI), in which case
the client has a high probability of timing out on the request.
This patch solves the issue by adding a new tunable module parameter:
zvol_max_discard_blocks. This indicates the maximum possible range, in
zvol blocks, of one discard operation. It is set by default to 16384
blocks, which appears to be a good tradeoff. Using the default
volblocksize of 8K this is equivalent to 128 MB. When using the maximum
volblocksize of 128K this is equivalent to 2 GB.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#858
1644 add ZFS "clones" property
1645 add ZFS "written" and "written@..." properties
1646 "zfs send" should estimate size of stream
1647 "zfs destroy" should determine space reclaimed by
destroying multiple snapshots
1708 adjust size of zpool history data
References:
https://www.illumos.org/issues/1644https://www.illumos.org/issues/1645https://www.illumos.org/issues/1646https://www.illumos.org/issues/1647https://www.illumos.org/issues/1708
This commit modifies the user to kernel space ioctl ABI. Extra
care should be taken when updating to ensure both the kernel
modules and utilities are updated. This change has reordered
all of the new ioctl()s to the end of the list. This should
help minimize this issue in the future.
Reviewed by: Richard Lowe <richlowe@richlowe.net>
Reviewed by: George Wilson <gwilson@zfsmail.com>
Reviewed by: Albert Lee <trisk@opensolaris.org>
Approved by: Garrett D'Amore <garret@nexenta.com>
Ported by: Martin Matuska <martin@matuska.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#826Closes#664
Added simply for convenience until this, or an equivilant, change
is merged in the upstream grub2 source.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #847
This is done for compatibility with existing Linux infrastructure.
In particular, when using zfs as a root filesystem there are init
scripts which as part of shutdown remount root read-only. Also,
the new systemd infrastructure being used by Fedora expects to be
able to remount a file system read-write.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #847
Support in-tree builtin module building.
These commits add support for compiling the ZFS module as a built-in
kernel module by copying the module code into the kernel source tree.
Here's the procedure:
- Create your kernel configuration (`.config` file) as usual. This
has to be done first so that ZFS's configure script is able to
detect kernel features correctly.
- Run `make prepare scripts` inside the kernel source tree.
- Run `./configure --enable-linux-builtin --with-linux=/usr/src/linux-...`
inside the ZFS directory.
- Run `./copy-builtin /usr/src/linux-...` inside the ZFS directory.
- In the kernel source tree, enable the `CONFIG_ZFS` option (e.g. using
`make menuconfig`). Note that this option depends on `CONFIG_SPL`
(see zfsonlinux/spl@744038069d).
- Build the kernel as usual.
ZFS module parameters can be set at boot time using the following syntax
on the kernel command line: `zfs.parameter_name=parameter_value`.
Note that you also need to rebuild the userspace tools (see
zfsonlinux/zfs@f09398cec6).
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
When libzfs checks if the module is loaded or not, it currently reads
/proc/modules and searches for a line matching the module name.
Unfortunately, if the module is included in the kernel itself (built-in
module), then /proc/modules won't list it, so libzfs will wrongly conclude
that the module is not loaded, thus making all ZFS userspace tools unusable.
Fortunately, all loaded modules appear as directories in /sys/module, even
built-in ones. Thus we can use /sys/module in lieu of /proc/modules to fix
the issue.
As a bonus, the code for checking becomes much simpler.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
This commit introduces a "copy-builtin" script designed to prepare a
kernel source tree for building ZFS as a builtin module. The script
makes a full copy of all needed files, thus making the kernel source
tree fully independent of the zfs source package.
To achieve that, some compilation flags (-include, -I) have been moved
to module/Makefile. This Makefile is only used when compiling external
modules; when compiling builtin modules, a Kbuild file generated by the
configure-builtin script is used instead. This makes sure Makefiles
inside the kernel source tree does not contain references to the zfs
source package.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
This patch adds a new autoconf function: ZFS_LINUX_TRY_COMPILE_SYMBOL.
This new function does the following:
- Call LINUX_TRY_COMPILE with the specified parameters.
- If unsuccessful, return false.
- If successful and we're configuring with --enable-linux-builtin,
return true.
- Else, call CHECK_SYMBOL_EXPORT with the specified parameters and
return the result.
All calls to CHECK_SYMBOL_EXPORT are converted to
LINUX_TRY_COMPILE_SYMBOL so that the tests work even when configuring
for builtin on a kernel which doesn't have loadable module support, or
hasn't been built yet.
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
Currently, when building a test case, we're compiling an entire Linux
module from beginning to end. This includes the MODPOST stage, which
generates a "conftest.mod.c" file with some boilerplate module
declaration code.
This poses a problem when configuring for built-in on kernels which have
loadable module support disabled. In this case conftest.mod.c is
referencing disabled code, resulting in a compilation failure, thus
breaking the tests.
This patch fixes the issue by faking the modpost stage when the
--enable-linux-builtin option is provided. It does so by forcing the
modpost command to be /bin/true, and using an empty conftest.mod.c file.
The test module still compiles fine, although the result isn't loadable,
but we don't really care at this point.
Note it is important to preserve the modpost stage when building out of
tree. The ZFS_AC_KERNEL_BLK_END_REQUEST, ZFS_AC_KERNEL_BLK_QUEUE_FLUSH,
and ZFS_AC_KERNEL_BLK_RQ_BYTES configure checks all depend on it to
identify GPL-only symbols.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
This patch adds a new option to configure: --enable-linux-builtin. When
this option is used, the following happens:
- Compilation of kernel modules is disabled.
- A failure to find UTS_RELEASE is followed by a suggestion to run
"make prepare" on the kernel source tree.
This patch also adds a new test which tries to compile an empty module
as a basic toolchain sanity test. If it fails and the option was
specified, the error is followed by a suggestion to run "make scripts"
on the kernel source tree.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
Currently, when configure --with-config is used, selective compilation
is only effective for the simple "make" case. Package builders (e.g.
make rpm) still build everything (utils and modules). This patch fixes
that.
Signed-off-by: Prakash Surya <surya1@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #851
The end_writeback() function was changed by moving the call to
inode_sync_wait() earlier in to evict(). This effecitvely changes
the ordering of the sync but it does not impact the details of
the zfs implementation.
However, as part of this change end_writeback() was renamed to
clear_inode() to reflect the new semantics. This change does
impact us and clear_inode() now maps to end_writeback() for
kernels prior to 3.5.
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes#784