Commit Graph

73 Commits

Author SHA1 Message Date
Tim Chase
5074bfe8ad Allocate zfs_znode_cache on the Linux slab
The Linux slab, in general, performs better than the SPl slab in cases
where a lot of objects are allocated and fragmentation is likely present.

This patch fixes pathologically bad behavior in cases where the ARC is
filled with mostly metadata and a user program needs to allocate and
dirty enough memory which would require an insignificant amount of the
ARC to be reclaimed.

If zfs_znode_cache is on the SPL slab, the system may spin for a very
long time trying to reclaim sufficient memory.  If it is on the Linux
slab, the behavior has been observed to be much more predictible; the
memory is reclaimed more efficiently.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #3283
2015-04-14 12:19:22 -07:00
Tim Chase
40d06e3c78 Mark all ZPL and ioctl functions as PF_FSTRANS
Prevent deadlocks by disabling direct reclaim during all ZPL and ioctl
calls as well as the l2arc and adapt ARC threads.

This obviates the need for MUTEX_FSTRANS so its previous uses and
definition have been eliminated.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3225
2015-04-03 11:38:59 -07:00
Chris Dunlop
d07b7c7f21 Reduce size of zfs_sb_t: allocate z_hold_mtx separately
zfs_sb_t has grown to the point where using kmem_zalloc() for allocations
is triggering the 32k warning threshold.

We can't safely convert this entire allocation to use vmem_alloc() instead
of kmem_alloc() because the backing_dev_info structure is embedded here.
It depends on the bit_waitqueue() function which won't behave properly
when given a virtual address.

Instead, use vmem_alloc() to allocate the z_hold_mtx array separately.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Chris Dunlop <chris@onthe.net.au>
Closes #3178
2015-03-24 13:17:44 -07:00
Brian Behlendorf
79c76d5b65 Change KM_PUSHPAGE -> KM_SLEEP
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:26 -08:00
Brian Behlendorf
efcd79a883 Retire KM_NODEBUG
Callers of kmem_alloc() which passed the KM_NODEBUG flag to suppress
the large allocation warning have been replaced by vmem_alloc() as
appropriate.  The updated vmem_alloc() call will not print a warning
regardless of the size of the allocation.

A careful reader will notice that not all callers have been changed
to vmem_alloc().  Some have only had the KM_NODEBUG flag removed.
This was possible because the default warning threshold has been
increased to 32k.  This is desirable because it minimizes the need
for Linux specific code changes.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:40:32 -08:00
Ned Bass
bc151f7b31 Remove checks for mandatory locks
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp().  Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose.  Rather than emulating those interfaces we remove the
redundant checks.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2804
2014-10-22 11:06:53 -07:00
Tim Chase
cb08f06307 Perform whole-page page truncation for hole-punching under a range lock
As an attempt to perform the page truncation more optimally, the
hole-punching support added in 223df0161f
truncated performed the operation in two steps: first, sub-page "stubs"
were zeroed under the range lock in zfs_free_range() using the new
zfs_zero_partial_page() function and then the whole pages were truncated
within zfs_freesp().  This left a window of opportunity during which
the full pages could be touched.

This patch closes the window by moving the whole-page truncation into
zfs_free_range() under the range lock.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2733
2014-09-29 09:22:03 -07:00
Brian Behlendorf
2d50158343 Fix z_teardown_inactive_lock deadlock
When rolling back a mounted filesystem zfs_suspend() is called
which acquires the z_teardown_inactive_lock.  This lock can not
be dropped until the filesystem has been rolled back and resumed
in zfs_resume_fs().

Therefore, we must not call iput() under this lock because it
may result in the inode->evict() handler being called which also
takes this lock.  Instead use zfs_iput_async() to ensure dropping
the last reference is deferred and runs in a safe context.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2670
2014-09-11 09:11:45 -07:00
Tim Chase
223df0161f Implement fallocate FALLOC_FL_PUNCH_HOLE
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2).  Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation.  It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.

Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.

Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2619
2014-09-08 13:52:25 -07:00
Brian Behlendorf
0a50679ce9 Add zfs_iput_async() interface
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks.  When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
2014-08-11 16:11:43 -07:00
Max Grossman
b0bc7a84d9 Illumos 4370, 4371
4370 avoid transmitting holes during zfs send
4371 DMU code clean up

Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <jeffpc@josefsipek.net>
Approved by: Garrett D'Amore <garrett@damore.org>a

References:
  https://www.illumos.org/issues/4370
  https://www.illumos.org/issues/4371
  https://github.com/illumos/illumos-gate/commit/43466aa

Ported by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2529
2014-07-28 14:29:58 -07:00
Brian Behlendorf
7a8f0e80ea zfs_trunc() should use dmu_tx_assign(tx, TXG_WAIT)
As part of the write throttle & i/o schedule performance work the
zfs_trunc() function should have been updated to use TXG_WAIT.
Using TXG_WAIT ensures that the tx will be part of the next txg.
If TXG_NOWAIT is used and retried for ERESTART errors then the
tx can suffer from starvation.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #2488
2014-07-22 09:41:38 -07:00
Brian Behlendorf
1e8db77102 Fix zil_commit() NULL dereference
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot.  If they do somehow get
dirtied an attempt will make made to write them to disk.  In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2405
2014-07-17 15:15:07 -07:00
Richard Yao
9d317793aa Implement File Attribute Support
We add support for lsattr and chattr to resolve a regression caused
by 88c283952f that broke Python's
xattr.list(). That changet broke Gentoo Portage's FEATURES=xattr,
which depended on Python's xattr.list().

Only attributes common to both Solaris and Linux are supported. These
are 'a', 'd' and 'i' in Linux's lsattr and chattr commands. File
attributes exclusive to Solaris are present in the ZFS code, but cannot
be accessed or modified through this method.  That was the case prior to
this patch. The resolution of issue zfsonlinux/zfs#229 should implement
some method to permit access and modification of Solaris-specific
attributes.

References:
  https://bugs.gentoo.org/show_bug.cgi?id=483516

Original-patch-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1691
2014-05-01 10:11:18 -07:00
Richard Yao
f3ad9cd67a Fix locking order in zfs_zget()
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-04 09:12:47 -07:00
Richard Yao
6f9548c487 Fix deadlock in zfs_zget()
zfsonlinux/zfs#180 occurred because of a race between inode eviction and
zfs_zget(). zfsonlinux/zfs@36df284 tried to address it by making a call
to the VFS to learn whether an inode is being evicted.  If it was being
evicted the operation was retried after dropping and reacquiring the
relevant resources.  Unfortunately, this introduced another deadlock.

  INFO: task kworker/u24:6:891 blocked for more than 120 seconds.
        Tainted: P           O 3.13.6 #1
  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
  kworker/u24:6   D ffff88107fcd2e80     0   891      2 0x00000000
  Workqueue: writeback bdi_writeback_workfn (flush-zfs-5)
   ffff8810370ff950 0000000000000002 ffff88103853d940 0000000000012e80
   ffff8810370fffd8 0000000000012e80 ffff88103853d940 ffff880f5c8be098
   ffff88107ffb6950 ffff8810370ff980 ffff88103a9a5b78 0000000000000000
  Call Trace:
   [<ffffffff813dd1d4>] schedule+0x24/0x70
   [<ffffffff8115fc09>] __wait_on_freeing_inode+0x99/0xc0
   [<ffffffff8115fdd8>] find_inode_fast+0x78/0xb0
   [<ffffffff811608c5>] ilookup+0x65/0xd0
   [<ffffffffa035c5ab>] zfs_zget+0xdb/0x260 [zfs]
   [<ffffffffa03589d6>] zfs_get_data+0x46/0x340 [zfs]
   [<ffffffffa035fee1>] zil_add_block+0xa31/0xc00 [zfs]
   [<ffffffffa0360642>] zil_commit+0x12/0x20 [zfs]
   [<ffffffffa036a6e4>] zpl_putpage+0x174/0x840 [zfs]
   [<ffffffff811071ec>] do_writepages+0x1c/0x40
   [<ffffffff8116df2b>] __writeback_single_inode+0x3b/0x2b0
   [<ffffffff8116ecf7>] writeback_sb_inodes+0x247/0x420
   [<ffffffff8116f5f3>] wb_writeback+0xe3/0x320
   [<ffffffff81170b8e>] bdi_writeback_workfn+0xfe/0x490
   [<ffffffff8106072c>] process_one_work+0x16c/0x490
   [<ffffffff810613f3>] worker_thread+0x113/0x390
   [<ffffffff81066edf>] kthread+0xdf/0x100

This patch implements the original fix in a slightly different manner in
order to avoid both deadlocks.  Instead of relying on a call to ilookup()
which can block in __wait_on_freeing_inode() the return value from igrab()
is used.  This gives us the information that ilookup() provided without
the risk of a deadlock.

Alternately, this race could be closed by registering an sops->drop_inode()
callback.  The callback would need to detect the active SA hold thereby
informing the VFS that this inode should not be evicted.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #180
2014-04-04 09:11:54 -07:00
Brian Behlendorf
8ac67298b1 Revert "Fixed a use-after-free bug in zfs_zget()."
This reverts commit 36df284366.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2014-04-03 16:23:28 -07:00
Tim Chase
6d111134c0 Implement relatime.
Add the "relatime" property.  When set to "on", a file's atime will only
be updated if the existing atime at least a day old or if the existing
ctime or mtime has been updated since the last access.  This behavior
is compatible with the Linux "relatime" mount option.

Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2064
Closes #1917
2014-01-29 15:50:44 -08:00
Brian Behlendorf
7f89ae6ba0 Use local variable to read zp->z_mode
When accessing the zp->z_mode through the SA bulk interface we
expect that 64-bits are available to hold the result.  However,
on 32-bit platforms mode_t will only be 32-bits so we cannot
pass it to SA_ADD_BULK_ATTR().  Instead a local uint64_t variable
must be used and the result assigned to zp->z_mode.

This went unnoticed on 32-bit little endian platforms because
the bytes happen to end up in the correct 32-bits.  But on big
endian platforms like Sparc the zp->z_mode will always end up
set to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
2014-01-09 15:50:11 -08:00
Michael Kjorling
d1d7e2689d cstyle: Resolve C style issues
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written.  Others were
caused when the common code was slightly adjusted for Linux.

This patch contains no functional changes.  It only refreshes
the code to conform to style guide.

Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request.  The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1821
2013-12-18 16:46:35 -08:00
Tim Chase
84b0aac5fd Fix atime handling.
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime.  However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.

This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.

It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.

The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949
2013-12-12 10:23:58 -08:00
Matthew Ahrens
384f8a09f8 Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT)
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.

Porting Notes:

zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version.  This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4347
  illumos/illumos-gate@e722410c49

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-12-06 09:30:51 -08:00
Will Andrews
d3cc8b152e Illumos #3742
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3742
  illumos/illumos-gate@f717074149

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The change to zfs_vfsops.c was dropped because it involves
   zfs_mount_label_policy, which does not exist in the Linux port.
2013-11-04 10:55:25 -08:00
Matthew Ahrens
2e528b49f8 Illumos #3598
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3598
  illumos/illumos-gate@be6fd75a69

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. include/sys/zfs_context.h has been modified to render some new
   macros inert until dtrace is available on Linux.

2. Linux-specific changes have been adapted to use SET_ERROR().

3. I'm NOT happy about this change.  It does nothing but ugly
   up the code under Linux.  Unfortunately we need to take it to
   avoid more merge conflicts in the future.  -Brian
2013-10-31 14:58:04 -07:00
Brian Behlendorf
17897ce2c8 Fix uninitialized variables
When compiling on an ARM device using gcc 4.7.3 several variables
in the zfs_obj_to_path_impl() function were flagged as uninitialized.
To resolve the warnings explicitly initialize them to zero.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1716
2013-09-13 13:31:24 -07:00
Richard Yao
570d6edf1d Linux 3.8 compat: Support CONFIG_UIDGID_STRICT_TYPE_CHECKS
When CONFIG_UIDGID_STRICT_TYPE_CHECKS is enabled uid_t/git_t are
replaced by kuid_t/kgid_t, which are structures instead of integral
types. This causes any code that uses an integral type to fail to build.
The User Namespace functionality introduced in Linux 3.8 requires
CONFIG_UIDGID_STRICT_TYPE_CHECKS, so we could not build against any
kernel that supported it.

We resolve this by converting between the new kuid_t/kgid_t structures
and the original uid_t/gid_t types.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1589
2013-08-09 15:31:52 -07:00
Madhav Suresh
c99c90015e Illumos #3006
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
     argument is zero

Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@fb09f5aad4
  https://illumos.org/issues/3006

Requires:
  zfsonlinux/spl@1c6d149feb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1509
2013-06-19 15:14:10 -07:00
Brian Behlendorf
7b3e34ba5a Fix 'zfs rollback' on mounted file systems
Rolling back a mounted filesystem with open file handles and
cached dentries+inodes never worked properly in ZoL.  The
major issue was that Linux provides no easy mechanism for
modules to invalidate the inode cache for a file system.

Because of this it was possible that an inode from the previous
filesystem would not get properly dropped from the cache during
rolling back.  Then a new inode with the same inode number would
be create and collide with the existing cached inode.  Ideally
this would trigger an VERIFY() but in practice the error wasn't
handled and it would just NULL reference.

Luckily, this issue can be resolved by sprucing up the existing
Solaris zfs_rezget() functionality for the Linux VFS.

The way it works now is that when a file system is rolled back
all the cached inodes will be traversed and refetched from disk.
If a version of the cached inode exists on disk the in-core
copy will be updated accordingly.  If there is no match for that
object on disk it will be unhashed from the inode cache and
marked as stale.

This will effectively make the inode unfindable for lookups
allowing the inode number to be immediately recycled.  The inode
will then only be accessible from the cached dentries.  Subsequent
dentry lookups which reference a stale inode will result in the
dentry being invalidated.  Once invalidated the dentry will drop
its reference on the inode allowing it to be safely pruned from
the cache.

Special care is taken for negative dentries since they do not
reference any inode.  These dentires will be invalidate based
on when they were added to the dentry cache.  Entries added
before the last rollback will be invalidate to prevent them
from masking real files in the dataset.

Two nice side effects of this fix are:

* Removes the dependency on spl_invalidate_inodes(), it can now
  be safely removed from the SPL when we choose to do so.

* zfs_znode_alloc() no longer requires a dentry to be passed.
  This effectively reverts this portition of the code to its
  upstream counterpart.  The dentry is not instantiated more
  correctly in the Linux ZPL layer.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Closes #795
2013-01-17 09:51:20 -08:00
Brian Behlendorf
e89260a1c8 Directory xattr znodes hold a reference on their parent
Unlike normal file or directory znodes, an xattr znode is
guaranteed to only have a single parent.  Therefore, we can
take a refernce on that parent if it is provided at create
time and cache it.  Additionally, we take care to cache it
on any subsequent zfs_zaccess() where the parent is provided
as an optimization.

This allows us to avoid needing to do a zfs_zget() when
setting up the SELinux security xattr in the create path.
This is critical because a hash lookup on the directory
will deadlock since it is locked.

The zpl_xattr_security_init() call has also been moved up
to the zpl layer to ensure TXs to create the required
xattrs are performed after the create TX.  Otherwise we
run the risk of deadlocking on the open create TX.

Ideally the security xattr should be fully constructed
before the new inode is unlocked.  However, doing so would
require far more extensive changes to ZFS.

This change may also have the benefitial side effect of
ensuring xattr directory znodes are evicted from the cache
before normal file or directory znodes due to the extra
reference.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #671
2012-12-03 12:10:46 -08:00
Brian Behlendorf
004324ecc6 Disable page allocation warnings for super block
Due to the slightly increased size of the ZFS super block
caused by 30315d2 there are now allocation warnings.  The
allocation size is still small (just over 8k) and super
blocks are rarely allocated so we suppress the warning.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1101
2012-11-30 11:04:44 -08:00
Richard Yao
b8d06fca08 Switch KM_SLEEP to KM_PUSHPAGE
Differences between how paging is done on Solaris and Linux can cause
deadlocks if KM_SLEEP is used in any the following contexts.

  * The txg_sync thread
  * The zvol write/discard threads
  * The zpl_putpage() VFS callback

This is because KM_SLEEP will allow for direct reclaim which may result
in the VM calling back in to the filesystem or block layer to write out
pages.  If a lock is held over this operation the potential exists to
deadlock the system.  To ensure forward progress all memory allocations
in these contexts must us KM_PUSHPAGE which disables performing any I/O
to accomplish the memory allocation.

Previously, this behavior was acheived by setting PF_MEMALLOC on the
thread.  However, that resulted in unexpected side effects such as the
exhaustion of pages in ZONE_DMA.  This approach touchs more of the zfs
code, but it is more consistent with the right way to handle these cases
under Linux.

This is patch lays the ground work for being able to safely revert the
following commits which used PF_MEMALLOC:

  21ade34 Disable direct reclaim for z_wr_* threads
  cfc9a5c Fix zpl_writepage() deadlock
  eec8164 Fix ASSERTION(!dsl_pool_sync_context(tx->tx_pool))

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #726
2012-08-27 12:01:37 -07:00
Brian Behlendorf
ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Brian Behlendorf
ab26409db7 Linux 3.1 compat, super_block->s_shrink
The Linux 3.1 kernel has introduced the concept of per-filesystem
shrinkers which are directly assoicated with a super block.  Prior
to this change there was one shared global shrinker.

The zfs code relied on being able to call the global shrinker when
the arc_meta_limit was exceeded.  This would cause the VFS to drop
references on a fraction of the dentries in the dcache.  The ARC
could then safely reclaim the memory used by these entries and
honor the arc_meta_limit.  Unfortunately, when per-filesystem
shrinkers were added the old interfaces were made unavailable.

This change adds support to use the new per-filesystem shrinker
interface so we can continue to honor the arc_meta_limit.  The
major benefit of the new interface is that we can now target
only the zfs filesystem for dentry and inode pruning.  Thus we
can minimize any impact on the caching of other filesystems.

In the context of making this change several other important
issues related to managing the ARC were addressed, they include:

* The dnlc_reduce_cache() function which was called by the ARC
to drop dentries for the Posix layer was replaced with a generic
zfs_prune_t callback.  The ZPL layer now registers a callback to
drop these dentries removing a layering violation which dates
back to the Solaris code.  This callback can also be used by
other ARC consumers such as Lustre.

  arc_add_prune_callback()
  arc_remove_prune_callback()

* The arc_reduce_dnlc_percent module option has been changed to
arc_meta_prune for clarity.  The dnlc functions are specific to
Solaris's VFS and have already been largely eliminated already.
The replacement tunable now represents the number of bytes the
prune callback will request when invoked.

* Less aggressively invoke the prune callback.  We used to call
this whenever we exceeded the arc_meta_limit however that's not
strictly correct since it results in over zeleous reclaim of
dentries and inodes.  It is now only called once the arc_meta_limit
is exceeded and every effort has been made to evict other data from
the ARC cache.

* More promptly manage exceeding the arc_meta_limit.  When reading
meta data in to the cache if a buffer was unable to be recycled
notify the arc_reclaim thread to invoke the required prune.

* Added arcstat_prune kstat which is incremented when the ARC
is forced to request that a consumer prune its cache.  Remember
this will only occur when the ARC has no other choice.  If it
can evict buffers safely without invoking the prune callback
it will.

* This change is also expected to resolve the unexpect collapses
of the ARC cache.  This would occur because when exceeded just the
arc_meta_limit reclaim presure would be excerted on the arc_c
value via arc_shrink().  This effectively shrunk the entire cache
when really we just needed to reclaim meta data.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #466
Closes #292
2012-01-11 11:46:02 -08:00
Darik Horn
28eb9213d8 Linux 3.2 compat: set_nlink()
Directly changing inode->i_nlink is deprecated in Linux 3.2 by commit

  SHA: bfe8684869601dacfcb2cd69ef8cfd9045f62170

Use the new set_nlink() kernel function instead.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes: #462
2011-12-16 20:02:52 -08:00
Brian Behlendorf
82a37189aa Implement SA based xattrs
The current ZFS implementation stores xattrs on disk using a hidden
directory.  In this directory a file name represents the xattr name
and the file contexts are the xattr binary data.  This approach is
very flexible and allows for arbitrarily large xattrs.  However,
it also suffers from a significant performance penalty.  Accessing
a single xattr can requires up to three disk seeks.

  1) Lookup the dnode object.
  2) Lookup the dnodes's xattr directory object.
  3) Lookup the xattr object in the directory.

To avoid this performance penalty Linux filesystems such as ext3
and xfs try to store the xattr as part of the inode on disk.  When
the xattr is to large to store in the inode then a single external
block is allocated for them.  In practice most xattrs are small
and this approach works well.

The addition of System Attributes (SA) to zfs provides us a clean
way to make this optimization.  When the dataset property 'xattr=sa'
is set then xattrs will be preferentially stored as System Attributes.
This allows tiny xattrs (~100 bytes) to be stored with the dnode and
up to 64k of xattrs to be stored in the spill block.  If additional
xattr space is required, which is unlikely under Linux, they will be
stored using the traditional directory approach.

This optimization results in roughly a 3x performance improvement
when accessing xattrs which brings zfs roughly to parity with ext4
and xfs (see table below).  When multiple xattrs are stored per-file
the performance improvements are even greater because all of the
xattrs stored in the spill block will be cached.

However, by default SA based xattrs are disabled in the Linux port
to maximize compatibility with other implementations.  If you do
enable SA based xattrs then they will not be visible on platforms
which do not support this feature.

----------------------------------------------------------------------
   Time in seconds to get/set one xattr of N bytes on 100,000 files
------+--------------------------------+------------------------------
      |            setxattr            |            getxattr
bytes |  ext4     xfs zfs-dir  zfs-sa  |  ext4     xfs zfs-dir  zfs-sa
------+--------------------------------+------------------------------
1     |  2.33   31.88   21.50    4.57  |  2.35    2.64    6.29    2.43
32    |  2.79   30.68   21.98    4.60  |  2.44    2.59    6.78    2.48
256   |  3.25   31.99   21.36    5.92  |  2.32    2.71    6.22    3.14
1024  |  3.30   32.61   22.83    8.45  |  2.40    2.79    6.24    3.27
4096  |  3.57  317.46   22.52   10.73  |  2.78   28.62    6.90    3.94
16384 |   n/a 2342.39   34.30   19.20  |   n/a   45.44  145.90    7.55
65536 |   n/a 2941.39  128.15  131.32* |   n/a  141.92  256.85  262.12*

Legend:
* ext4      - Stock RHEL6.1 ext4 mounted with '-o user_xattr'.
* xfs       - Stock RHEL6.1 xfs mounted with default options.
* zfs-dir   - Directory based xattrs only.
* zfs-sa    - Prefer SAs but spill in to directories as needed, a
              trailing * indicates overflow in to directories occured.

NOTE: Ext4 supports 4096 bytes of xattr name/value pairs per file.
NOTE: XFS and ZFS have no limit on xattr name/value pairs per file.
NOTE: Linux limits individual name/value pairs to 65536 bytes.
NOTE: All setattr/getattr's were done after dropping the cache.
NOTE: All tests were run against a single hard drive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #443
2011-11-28 15:45:51 -08:00
Brian Behlendorf
2708f716c0 Fix usage of zsb after free
Caught by code inspection, the variable zsb was referenced after
being freed.  Move the kmem_free() to the end of the function.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-09-09 10:29:48 -07:00
Tim Haley
7b8518cb8d Illumos #xxx: zdb -vvv broken after zfs diff integration
References to Illumos issue and patch:
- https://github.com/illumos/illumos-gate/commit/163eb7ff

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #340
2011-08-01 12:09:02 -07:00
Brian Behlendorf
22872ff5da Use zfs_mknode() to create dataset root
Long, long, long ago when the effort to port ZFS was begun
the zfs_create_fs() function was heavily modified to remove
all of its VFS dependencies.  This allowed Lustre to use
the dataset without us having to spend the time porting all
the required VFS code.

Fast-forward several years and we now have all the VFS code
in place but are still relying on the modified zfs_create_fs().
This isn't required anymore and we can now use zfs_mknode()
to create the root znode for the filesystem.

This commit reverts the contents of zfs_create_fs() to largely
match the upstream OpenSolaris code.  There have been minor
modifications to accomidate the Linux VFS but that is all.

This code fixes issue #116 by bootstraping enough of the VFS
data structures so we can rely on zfs_mknode() to create the
root directory.  This ensures it is created properly with
support for system attributes.  Previously it wasn't which
is why it behaved differently that all other directories
when modified.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #116
2011-07-20 19:52:26 -07:00
Gunnar Beutner
3c9609b322 Renamed HAVE_SHARE ifdefs to HAVE_SMB_SHARE.
The remaining code that is guarded by HAVE_SHARE ifdefs is related to the
.zfs/shares functionality which is currently not available on Linux.

On Solaris the .zfs/shares directory can be used to set permissions for
SMB shares.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-07-06 09:20:28 -07:00
Rohan Puri
a89c3e0bd5 Support mandatory locks (nbmand)
The Linux kernel already has support for mandatory locking.  This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants.  In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS.  However, for now we'll leave it in place even
if it is redundant just in case we missed something.

The original patch to update the code to support mandatory locking
was done by Rohan Puri.  This patch is an updated version which is
compatible with the previous mount option handling changes.

Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #222
Closes #253
2011-07-01 13:39:40 -07:00
Brian Behlendorf
5c03efc379 Linux compat 2.6.39: security_inode_init_security()
The security_inode_init_security() function now takes an additional
qstr argument which must be passed in from the dentry if available.
Passing a NULL is safe when no qstr is available the relevant
security checks will just be skipped.

Closes #246
Closes #217
Closes #187
2011-07-01 12:40:08 -07:00
Brian Behlendorf
f0b2486034 Remove unused MMAP functions
The following functions were required for the OpenSolaris mmap
implementation.  Because the Linux VFS does most the most heavy
lifting for us they are not required and are being removed to
keep the code clean and easy to understand.

  * zfs_null_putapage()
  * zfs_frlock()
  * zfs_no_putpage()

Signed-off-by: Brian Behlendorf <behlendorf@llnl.gov>
2011-07-01 12:22:57 -07:00
Ned A. Bass
aa6d8c1086 Don't store rdev in SA for FIFOs and sockets
Update the handling of named pipes and sockets to be consistent with
other platforms with regard to the rdev attribute.  While all ZFS
ipmlementations store the rdev for device files in a system attribute
(SA), this is not the case for FIFOs and sockets.  Indeed, Linux always
passes rdev=0 to mknod() for FIFOs and sockets, so the value is not
needed.  Add an ASSERT that rdev==0 for FIFOs and sockets to detect if
the expected behavior ever changes.

Closes #216
2011-05-09 13:35:07 -07:00
Brian Behlendorf
5f35b19007 Fully update inode when created
When a new znode/inode pair is created both the znode and the inode
should be immediately updated to the correct values.  This was done
for the znode and for most of the values in the inode, but not all
of them.  This normally wasn't a problem because most subsequent
operations would cause the inode to be immediately updated.  This
change ensures the inode is now fully updated before it is inserted
in to the inode hash.

Closes #116
Closes #146
Closes #164
2011-05-02 14:04:19 -07:00
Gunnar Beutner
36df284366 Fixed a use-after-free bug in zfs_zget().
Fixed a bug where zfs_zget could access a stale znode pointer when
the inode had already been removed from the inode cache via iput ->
iput_final -> ... -> zfs_zinactive but the corresponding SA handle
was still alive.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #180
2011-04-21 13:48:01 -07:00
Brian Behlendorf
c85b224faf Call d_instantiate before unlocking inode
Under Linux a dentry referencing an inode must be instantiated before
the inode is unlocked.  To accomplish this without overly modifing
the core ZFS code the dentry it passed via the vattr_t.  There are
cases such as replay when a dentry is not available.  In which case
it is obviously not initialized at inode creation time, if a dentry
is needed it will be spliced as when required via d_lookup().
2011-04-07 09:51:57 -07:00
Brian Behlendorf
d6bd8eaae4 Fix evict() deadlock
Now that KM_SLEEP is not defined as GFP_NOFS there is the possibility
of synchronous reclaim deadlocks.  These deadlocks never existed in the
original OpenSolaris code because all memory reclaim on Solaris is done
asyncronously.  Linux does both synchronous (direct) and asynchronous
(indirect) reclaim.

This commit addresses a deadlock caused by inode eviction.  A KM_SLEEP
allocation may trigger direct memory reclaim and shrink the inode cache.
This can occur while a mutex in the array of ZFS_OBJ_HOLD mutexes is
held.  Through the ->shrink_icache_memory()->evict()->zfs_inactive()->
zfs_zinactive() call path the same mutex may be reacquired resulting
in a deadlock.  To avoid this deadlock the process must not reacquire
the mutex when it is already holding it.

This is a reasonable fix for now but longer term the ZFS_OBJ_HOLD
mutex locking should be reevaluated.  This infrastructure already
prevents us from ever using the Linux lock dependency analysis tools,
and it may limit scalability.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
691f6ac4c2 Use KM_PUSHPAGE instead of KM_SLEEP
It used to be the case that all KM_SLEEP allocations were GFS_NOFS.
Unfortunately this often resulted in the kernel being unable to
reclaim the ARC, inode, and dentry caches in a timely manor.
The fix was to make KM_SLEEP a GFP_KERNEL allocation in the SPL.

However, this increases the posibility of deadlocking the system
on a zfs write thread.  If a zfs write thread attempts to perform
an allocation it may trigger synchronous reclaim.  This reclaim
may attempt to flush dirty data/inode to disk to free memory.
Unforunately, this write cannot finish because the write thread
which would handle it is holding the previous transaction open.
Deadlock.

To avoid this all allocations in the zfs write thread path must
use KM_PUSHPAGE which prohibits synchronous reclaim for that
thread.  In this way forward progress in ensured.  The risk
with this change is I missed updating an allocation for the
write threads leaving an increased posibility of deadlock.  If
any deadlocks remain they will be unlikely but we'll have to
make sure they all get fixed.
2011-03-22 12:14:55 -07:00
Brian Behlendorf
adf2e8778e Fix O_APPEND Corruption
Due to an uninitialized variable files opened with O_APPEND may
overwrite the start of the file rather than append to it.  This
was introduced accidentally when I removed the Solaris vnodes.

The zfs_range_lock_writer() function used to key off zf->z_vnode
to determine if a znode_t was for a zvol of zpl object.  With
the removal of vnodes this was replaced by the flag zp->z_is_zvol.
This flag was used to control the append behavior for range locks.

Unfortunately, this value was never properly initialized after
the vnode removal.  However, because most of memory is usually
zeros it happened to be set correctly most of the time making
the bug appear racy.  Properly initializing zp->z_is_zvol to
zero completely resolves the problem with O_APPEND.

Closes #126
2011-03-09 13:31:00 -08:00
Brian Behlendorf
5484965ab6 Drop HAVE_XVATTR macros
When I began work on the Posix layer it immediately became clear to
me that to integrate cleanly with the Linux VFS certain Solaris
specific things would have to go.  One of these things was to elimate
as many Solaris specific types from the ZPL layer as possible.  They
would be replaced with their Linux equivalents.  This would not only
be good for performance, but for the general readability and health of
the code.  The Solaris and Linux VFS are different beasts and should
be treated as such.  Most of the code remains common for constructing
transactions and such, but there are subtle and important differenced
which need to be repsected.

This policy went quite for for certain types such as the vnode_t,
and it initially seemed to be working out well for the vattr_t.  There
was a relatively small amount of related xvattr_t code I was forced to
comment out with HAVE_XVATTR.  But it didn't look that hard to come
back soon and replace it all with a native Linux type.

However, after going doing this path with xvattr some distance it
clear that this code was woven in the ZPL more deeply than I thought.
In particular its hooks went very deep in to the ZPL replay code
and replacing it would not be as easy as I originally thought.

Rather than continue persuing replacing and removing this code I've
taken a step back and reevaluted things.  This commit reverts many of
my previous commits which removed xvattr related code.  It restores
much of the code to its original upstream state and now relies on
improved xvattr_t support in the zfs package itself.

The result of this is that much of the code which I had commented
out, which accidentally broke things like replay, is now back in
place and working.  However, there may be a small performance
impact for getattr/setattr operations because they now require
a translation from native Linux to Solaris types.  For now that's
a price I'm willing to pay.  Once everything is completely functional
we can revisting the issue of removing the vattr_t/xvattr_t types.

Closes #111
2011-03-02 11:44:34 -08:00