Commit Graph

82 Commits

Author SHA1 Message Date
Matthew Ahrens
f1512ee61e Illumos 5027 - zfs large block support
5027 zfs large block support
Reviewed by: Alek Pinchuk <pinchuk.alek@gmail.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Josef 'Jeff' Sipek <josef.sipek@nexenta.com>
Reviewed by: Richard Elling <richard.elling@richardelling.com>
Reviewed by: Saso Kiselkov <skiselkov.ml@gmail.com>
Reviewed by: Brian Behlendorf <behlendorf1@llnl.gov>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5027
  https://github.com/illumos/illumos-gate/commit/b515258

Porting Notes:

* Included in this patch is a tiny ISP2() cleanup in zio_init() from
Illumos 5255.

* Unlike the upstream Illumos commit this patch does not impose an
arbitrary 128K block size limit on volumes.  Volumes, like filesystems,
are limited by the zfs_max_recordsize=1M module option.

* By default the maximum record size is limited to 1M by the module
option zfs_max_recordsize.  This value may be safely increased up to
16M which is the largest block size supported by the on-disk format.
At the moment, 1M blocks clearly offer a significant performance
improvement but the benefits of going beyond this for the majority
of workloads are less clear.

* The illumos version of this patch increased DMU_MAX_ACCESS to 32M.
This was determined not to be large enough when using 16M blocks
because the zfs_make_xattrdir() function will fail (EFBIG) when
assigning a TX.  This was immediately observed under Linux because
all newly created files must have a security xattr created and
that was failing.  Therefore, we've set DMU_MAX_ACCESS to 64M.

* On 32-bit platforms a hard limit of 1M is set for blocks due
to the limited virtual address space.  We should be able to relax
this one the ABD patches are merged.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #354
2015-05-11 12:23:16 -07:00
Chunwei Chen
07012da668 Fix kernel panic due to tsd_exit in ZFS_EXIT(zsb)
The following panic would occur under certain heavy load:
[ 4692.202686] Kernel panic - not syncing: thread ffff8800c4f5dd60 terminating with rrw lock ffff8800da1b9c40 held
[ 4692.228053] CPU: 1 PID: 6250 Comm: mmap_deadlock Tainted: P           OE  3.18.10 #7

The culprit is that ZFS_EXIT(zsb) would call tsd_exit() every time, which
would purge all tsd data for the thread. However, ZFS_ENTER is designed to be
reentrant, so we cannot allow ZFS_EXIT to blindly purge tsd data.

Instead, we rely on the new behavior of tsd_set. When NULL is passed as the
new value to tsd_set, it will automatically remove the tsd entry specified the
the key for the current thread.

rrw_tsd_key and zfs_allow_log_key already calls tsd_set(key, NULL) when
they're done. The zfs_fsyncer_key relied on ZFS_EXIT(zsb) to call tsd_exit() to
do clean up. Now we explicitly call tsd_set(key, NULL) on them.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #3247
2015-04-24 14:57:54 -07:00
Brian Behlendorf
79c76d5b65 Change KM_PUSHPAGE -> KM_SLEEP
By marking DMU transaction processing contexts with PF_FSTRANS
we can revert the KM_PUSHPAGE -> KM_SLEEP changes.  This brings
us back in line with upstream.  In some cases this means simply
swapping the flags back.  For others fnvlist_alloc() was replaced
by nvlist_alloc(..., KM_PUSHPAGE) and must be reverted back to
fnvlist_alloc() which assumes KM_SLEEP.

The one place KM_PUSHPAGE is kept is when allocating ARC buffers
which allows us to dip in to reserved memory.  This is again the
same as upstream.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2015-01-16 14:41:26 -08:00
Brian Behlendorf
d958324f97 Fix zfs_putpage() lock inversion (again)
This is a follow up commit to 74328ee which correctly resolved a lock
inversion between zfs_putpage() and zfs_free_range().  Unfortunately,
in the process it accidentally introduced another inversion between
zfs_putpage() and zfs_read().  The page must be unlocked before taking
the range lock.  This patch corrects that issue.

In addition, because the locking rules here are subtle a block comment
has been added clearly explaining why the ordering here is critical.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Issue #2976
2015-01-08 16:09:41 -08:00
Brian Behlendorf
74328ee18f Fix zfs_putpage() lock inversion
There exists a lock inversions involving the zfs range lock and the
individual page writeback bits which can result in a deadlock.  To
prevent this we must always manipulate the writeback bit while
holding the range lock.  The exact deadlock is as follows:

------ Process A ------        ------ Process B ------
zpl_writepages                 zpl_fallocate
write_cache_pages              zpl_fallocate_common
zpl_putpage                    zfs_space
zfs_putpage (set bit)          zfs_freesp
zfs_range_lock (wait on lock)  zfs_free_range (take lock)
[has not yet initiated I/O,    truncate_inode_pages_range
the bit will not be cleared]   wait_on_page_writeback (wait on bit)

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Richard Yao <richard.yao@clusterhq.com>
Issue #2976
2014-12-22 09:31:56 -08:00
Brian Behlendorf
c944be5d7e Fix snapshots with dirty inodes
Filesystems which are mounted read-only or are immutable because
they are snapshots must not be allowed to dirty and inode.  This
will result in a write which will correctly cause a kernel panic
because these filesystem are (and must be) immutable.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2812
2014-11-20 10:38:16 -08:00
Ned Bass
bc151f7b31 Remove checks for mandatory locks
The Linux VFS handles mandatory locks generically so we shouldn't
need to check for conflicting locks in zfs_read(), zfs_write(), or
zfs_freesp().  Linux 3.18 removed the lock_may_read() and
lock_may_write() interfaces which we were relying on for this
purpose.  Rather than emulating those interfaces we remove the
redundant checks.

Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2804
2014-10-22 11:06:53 -07:00
Matthew Ahrens
d97aa48f7c Illumos 5139 - SEEK_HOLE failed to report a hole at end of file
5139 SEEK_HOLE failed to report a hole at end of file
Reviewed by: Adam Leventhal <adam.leventhal@delphix.com>
Reviewed by: Alex Reece <alex.reece@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Max Grossman <max.grossman@delphix.com>
Reviewed by: Peng Dai <peng.dai@delphix.com>
Reviewed by: Richard Elling <richard.elling@gmail.com>
Approved by: Dan McDonald <danmcd@omniti.com>

References:
  https://www.illumos.org/issues/5139
  https://github.com/illumos/illumos-gate/commit/0fbc0cd

Ported by: Turbo Fredriksson <turbo@bayour.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2714
2014-09-23 10:38:45 -07:00
Tim Chase
223df0161f Implement fallocate FALLOC_FL_PUNCH_HOLE
Add support for the FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE mode of
fallocate(2).  Mimic the behavior of other native file systems such as
ext4 in cases where the file might be extended. If the offset is beyond
the end of the file, return success without changing the file. If the
extent of the punched hole would extend the file, only the existing tail
of the file is punched.

Add the zfs_zero_partial_page() function, modeled after update_page(),
to handle zeroing partial pages in a hole-punching operation.  It must
be used under a range lock for the requested region in order that the
ARC and page cache stay in sync.

Move the existing page cache truncation via truncate_setsize() into
zfs_freesp() for better source structure compatibility with upstream code.

Add page cache truncation to zfs_freesp() and zfs_free_range() to handle
hole punching.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Tim Chase <tim@chase2k.com>
Closes #2619
2014-09-08 13:52:25 -07:00
Brian Behlendorf
0a50679ce9 Add zfs_iput_async() interface
Handle all iputs in zfs_purgedir() and zfs_inode_destroy()
asynchronously to prevent deadlocks.  When the iputs are allowed
to run synchronously in the destroy call path deadlocks between
xattr directory inodes and their parent file inodes are possible.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Closes #457
2014-08-11 16:11:43 -07:00
Brian Behlendorf
1e8db77102 Fix zil_commit() NULL dereference
Update the current code to ensure inodes are never dirtied if they are
part of a read-only file system or snapshot.  If they do somehow get
dirtied an attempt will make made to write them to disk.  In the case
of snapshots, which don't have a ZIL, this will result in a NULL
dereference in zil_commit().

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #2405
2014-07-17 15:15:07 -07:00
Brian Behlendorf
b585bc4afa Fix zfs_getattr_fast types
On Sparc sp->blksize will be a 64-bit value which is then cast
incorrectly to a 32-bit value.  For big endian systems this
results in an incorrect value for sp->blksize.  To resolve the
problem local variables of the correct size are used and then
assigned to sp->blksize.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: marku89 <mar42@kola.li>
Issue #1700
2014-01-09 15:50:23 -08:00
Michael Kjorling
d1d7e2689d cstyle: Resolve C style issues
The vast majority of these changes are in Linux specific code.
They are the result of not having an automated style checker to
validate the code when it was originally written.  Others were
caused when the common code was slightly adjusted for Linux.

This patch contains no functional changes.  It only refreshes
the code to conform to style guide.

Everyone submitting patches for inclusion upstream should now
run 'make checkstyle' and resolve any warning prior to opening
a pull request.  The automated builders have been updated to
fail a build if when 'make checkstyle' detects an issue.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1821
2013-12-18 16:46:35 -08:00
Chunwei Chen
7dc71949f2 Fix z_sync_cnt decrement in zfs_close
The comment in zfs_close states that "Under Linux the zfs_close() hook
is not symmetric with zfs_open()". This is not true. zfs_open/zfs_close
is associated with every successful struct file creation/deletion, which
should always be balanced.

Here is an example of what's wrong:

Process A		B
	open(O_SYNC)
	z_sync_cnt = 1
			open(O_SYNC)
			z_sync_cnt = 2
	close()
	z_sync_cnt = 0

So z_sync_cnt is 0 even if B still has the file with O_SYNC.

Also moves the generic_file_open call before zfs_open to ensure that in
the case generic_file_open fails z_sync_cnt is not incremented.  This
is safe because generic_file_open has no side effects.

Signed-off-by: Chunwei Chen <tuxoko@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1962
2013-12-17 10:28:27 -08:00
Tim Chase
84b0aac5fd Fix atime handling.
Previously, the atime-modifying vnops called ZFS_ACCESSTIME_STAMP()
followed by zfs_inode_update() to update the atime.  However, since atimes
are cached in the znode for delayed writing, the zfs_inode_update()
function would effectively ignore the cached atime by reading it from
the SA.

This commit moves the updating of the atime in the inode into
zfs_tstamp_update_setup() which is called by the ZFS_ACCESSTIME_STAMP()
macro and eliminates the call to zfs_inode_update() in the atime-modifying
vnops.

It's possible the same thing could have been done directly in
zfs_inode_update() but I wasn't sure that it was safe in all cases where
it is called.

The effect is that atime handling is as if "strictatime" were selected;
even if the filesystem is mounted with "relatime".

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1949
2013-12-12 10:23:58 -08:00
Matthew Ahrens
e8b96c6007 Illumos #4045 write throttle & i/o scheduler performance work
4045 zfs write throttle & i/o scheduler performance work

1. The ZFS i/o scheduler (vdev_queue.c) now divides i/os into 5 classes: sync
read, sync write, async read, async write, and scrub/resilver.  The scheduler
issues a number of concurrent i/os from each class to the device.  Once a class
has been selected, an i/o is selected from this class using either an elevator
algorithem (async, scrub classes) or FIFO (sync classes).  The number of
concurrent async write i/os is tuned dynamically based on i/o load, to achieve
good sync i/o latency when there is not a high load of writes, and good write
throughput when there is.  See the block comment in vdev_queue.c (reproduced
below) for more details.

2. The write throttle (dsl_pool_tempreserve_space() and
txg_constrain_throughput()) is rewritten to produce much more consistent delays
when under constant load.  The new write throttle is based on the amount of
dirty data, rather than guesses about future performance of the system.  When
there is a lot of dirty data, each transaction (e.g. write() syscall) will be
delayed by the same small amount.  This eliminates the "brick wall of wait"
that the old write throttle could hit, causing all transactions to wait several
seconds until the next txg opens.  One of the keys to the new write throttle is
decrementing the amount of dirty data as i/o completes, rather than at the end
of spa_sync().  Note that the write throttle is only applied once the i/o
scheduler is issuing the maximum number of outstanding async writes.  See the
block comments in dsl_pool.c and above dmu_tx_delay() (reproduced below) for
more details.

This diff has several other effects, including:

 * the commonly-tuned global variable zfs_vdev_max_pending has been removed;
use per-class zfs_vdev_*_max_active values or zfs_vdev_max_active instead.

 * the size of each txg (meaning the amount of dirty data written, and thus the
time it takes to write out) is now controlled differently.  There is no longer
an explicit time goal; the primary determinant is amount of dirty data.
Systems that are under light or medium load will now often see that a txg is
always syncing, but the impact to performance (e.g. read latency) is minimal.
Tune zfs_dirty_data_max and zfs_dirty_data_sync to control this.

 * zio_taskq_batch_pct = 75 -- Only use 75% of all CPUs for compression,
checksum, etc.  This improves latency by not allowing these CPU-intensive tasks
to consume all CPU (on machines with at least 4 CPU's; the percentage is
rounded up).

--matt

APPENDIX: problems with the current i/o scheduler

The current ZFS i/o scheduler (vdev_queue.c) is deadline based.  The problem
with this is that if there are always i/os pending, then certain classes of
i/os can see very long delays.

For example, if there are always synchronous reads outstanding, then no async
writes will be serviced until they become "past due".  One symptom of this
situation is that each pass of the txg sync takes at least several seconds
(typically 3 seconds).

If many i/os become "past due" (their deadline is in the past), then we must
service all of these overdue i/os before any new i/os.  This happens when we
enqueue a batch of async writes for the txg sync, with deadlines 2.5 seconds in
the future.  If we can't complete all the i/os in 2.5 seconds (e.g. because
there were always reads pending), then these i/os will become past due.  Now we
must service all the "async" writes (which could be hundreds of megabytes)
before we service any reads, introducing considerable latency to synchronous
i/os (reads or ZIL writes).

Notes on porting to ZFS on Linux:

- zio_t gained new members io_physdone and io_phys_children.  Because
  object caches in the Linux port call the constructor only once at
  allocation time, objects may contain residual data when retrieved
  from the cache. Therefore zio_create() was updated to zero out the two
  new fields.

- vdev_mirror_pending() relied on the depth of the per-vdev pending queue
  (vq->vq_pending_tree) to select the least-busy leaf vdev to read from.
  This tree has been replaced by vq->vq_active_tree which is now used
  for the same purpose.

- vdev_queue_init() used the value of zfs_vdev_max_pending to determine
  the number of vdev I/O buffers to pre-allocate.  That global no longer
  exists, so we instead use the sum of the *_max_active values for each of
  the five I/O classes described above.

- The Illumos implementation of dmu_tx_delay() delays a transaction by
  sleeping in condition variable embedded in the thread
  (curthread->t_delay_cv).  We do not have an equivalent CV to use in
  Linux, so this change replaced the delay logic with a wrapper called
  zfs_sleep_until(). This wrapper could be adopted upstream and in other
  downstream ports to abstract away operating system-specific delay logic.

- These tunables are added as module parameters, and descriptions added
  to the zfs-module-parameters.5 man page.

  spa_asize_inflation
  zfs_deadman_synctime_ms
  zfs_vdev_max_active
  zfs_vdev_async_write_active_min_dirty_percent
  zfs_vdev_async_write_active_max_dirty_percent
  zfs_vdev_async_read_max_active
  zfs_vdev_async_read_min_active
  zfs_vdev_async_write_max_active
  zfs_vdev_async_write_min_active
  zfs_vdev_scrub_max_active
  zfs_vdev_scrub_min_active
  zfs_vdev_sync_read_max_active
  zfs_vdev_sync_read_min_active
  zfs_vdev_sync_write_max_active
  zfs_vdev_sync_write_min_active
  zfs_dirty_data_max_percent
  zfs_delay_min_dirty_percent
  zfs_dirty_data_max_max_percent
  zfs_dirty_data_max
  zfs_dirty_data_max_max
  zfs_dirty_data_sync
  zfs_delay_scale

  The latter four have type unsigned long, whereas they are uint64_t in
  Illumos.  This accommodates Linux's module_param() supported types, but
  means they may overflow on 32-bit architectures.

  The values zfs_dirty_data_max and zfs_dirty_data_max_max are the most
  likely to overflow on 32-bit systems, since they express physical RAM
  sizes in bytes.  In fact, Illumos initializes zfs_dirty_data_max_max to
  2^32 which does overflow. To resolve that, this port instead initializes
  it in arc_init() to 25% of physical RAM, and adds the tunable
  zfs_dirty_data_max_max_percent to override that percentage.  While this
  solution doesn't completely avoid the overflow issue, it should be a
  reasonable default for most systems, and the minority of affected
  systems can work around the issue by overriding the defaults.

- Fixed reversed logic in comment above zfs_delay_scale declaration.

- Clarified comments in vdev_queue.c regarding when per-queue minimums take
  effect.

- Replaced dmu_tx_write_limit in the dmu_tx kstat file
  with dmu_tx_dirty_delay and dmu_tx_dirty_over_max.  The first counts
  how many times a transaction has been delayed because the pool dirty
  data has exceeded zfs_delay_min_dirty_percent.  The latter counts how
  many times the pool dirty data has exceeded zfs_dirty_data_max (which
  we expect to never happen).

- The original patch would have regressed the bug fixed in
  zfsonlinux/zfs@c418410, which prevented users from setting the
  zfs_vdev_aggregation_limit tuning larger than SPA_MAXBLOCKSIZE.
  A similar fix is added to vdev_queue_aggregate().

- In vdev_queue_io_to_issue(), dynamically allocate 'zio_t search' on the
  heap instead of the stack.  In Linux we can't afford such large
  structures on the stack.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Reviewed by: Ned Bass <bass6@llnl.gov>
Reviewed by: Brendan Gregg <brendan.gregg@joyent.com>
Approved by: Robert Mustacchi <rm@joyent.com>

References:
  http://www.illumos.org/issues/4045
  illumos/illumos-gate@69962b5647

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1913
2013-12-06 09:32:43 -08:00
Matthew Ahrens
384f8a09f8 Illumos #4347 ZPL can use dmu_tx_assign(TXG_WAIT)
Fix a lock contention issue by allowing threads not holding
ZPL locks to block when waiting to assign a transaction.

Porting Notes:

zfs_putpage() still uses TXG_NOWAIT, unlike the upstream version.  This
case may be a contention point just like zfs_write(), however it is not
safe to block here since it may be called during memory reclaim.

Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Dan McDonald <danmcd@nexenta.com>
Reviewed by: Boris Protopopov <boris.protopopov@nexenta.com>
Approved by: Dan McDonald <danmcd@nexenta.com>

References:
  https://www.illumos.org/issues/4347
  illumos/illumos-gate@e722410c49

Ported-by: Ned Bass <bass6@llnl.gov>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2013-12-06 09:30:51 -08:00
Etienne Dechamps
119a394ab0 Only commit the ZIL once in zpl_writepages() (msync() case).
Currently, using msync() results in the following code path:

    sys_msync -> zpl_fsync -> filemap_write_and_wait_range -> zpl_writepages -> write_cache_pages -> zpl_putpage

In such a code path, zil_commit() is called as part of zpl_putpage().
This means that for each page, the write is handed to the DMU, the ZIL
is committed, and only then do we move on to the next page. As one might
imagine, this results in atrocious performance where there is a large
number of pages to write: instead of committing a batch of N writes,
we do N commits containing one page each. In some extreme cases this
can result in msync() being ~700 times slower than it should be, as well
as very inefficient use of ZIL resources.

This patch fixes this issue by making sure that the requested writes
are batched and then committed only once. Unfortunately, the
implementation is somewhat non-trivial because there is no way to run
write_cache_pages in SYNC mode (so that we get all pages) without
making it wait on the writeback tag for each page.

The solution implemented here is composed of two parts:

 - I added a new callback system to the ZIL, which allows the caller to
   be notified when its ITX gets written to stable storage. One nice
   thing is that the callback is called not only in zil_commit() but
   in zil_sync() as well, which means that the caller doesn't have to
   care whether the write ended up in the ZIL or the DMU: it will get
   notified as soon as it's safe, period. This is an improvement over
   dmu_tx_callback_register() that was used previously, which only
   supports DMU writes. The rationale for this change is to allow
   zpl_putpage() to be notified when a ZIL commit is completed without
   having to block on zil_commit() itself.

 - zpl_writepages() now calls write_cache_pages in non-SYNC mode, which
   will prevent (1) write_cache_pages from blocking, and (2) zpl_putpage
   from issuing ZIL commits. zpl_writepages() will issue the commit
   itself instead of relying on zpl_putpage() to do it, thus nicely
   batching the writes. Note, however, that we still have to call
   write_cache_pages() again in SYNC mode because there is an edge case
   documented in the implementation of write_cache_pages() whereas it
   will not give us all dirty pages when running in non-SYNC mode. Thus
   we need to run it at least once in SYNC mode to make sure we honor
   persistency guarantees. This only happens when the pages are
   modified at the same time msync() is running, which should be rare.
   In most cases there won't be any additional pages and this second
   call will do nothing.

Note that this change also fixes a bug related to #907 whereas calling
msync() on pages that were already handed over to the DMU in a previous
writepages() call would make msync() block until the next TXG sync
instead of returning as soon as the ZIL commit is complete. The new
callback system fixes that problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1849
Closes #907
2013-11-23 15:08:29 -08:00
George Wilson
03c6040bee Illumos #3236
3236 zio nop-write
Reviewed by: Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <chris.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  illumos/illumos-gate@80901aea8e
  https://www.illumos.org/issues/3236

Porting Notes

1. This patch is being merged dispite an increased instance of
   https://www.illumos.org/issues/3113 being triggered by ztest.

Ported-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1489
2013-11-05 12:14:21 -08:00
Will Andrews
d3cc8b152e Illumos #3742
3742 zfs comments need cleaner, more consistent style
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Reviewed by: George Wilson <george.wilson@delphix.com>
Reviewed by: Eric Schrock <eric.schrock@delphix.com>
Approved by: Christopher Siden <christopher.siden@delphix.com>

References:
  https://www.illumos.org/issues/3742
  illumos/illumos-gate@f717074149

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. The change to zfs_vfsops.c was dropped because it involves
   zfs_mount_label_policy, which does not exist in the Linux port.
2013-11-04 10:55:25 -08:00
Mark Shellenbaum
c0ebc844c7 6939941 problem with moving files in zfs
References:
  illumos/illumos-gate@d39ee142a9

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. This commit was so old that only two lines applied to the modern
   code base.
2013-11-04 10:53:18 -08:00
Matthew Ahrens
2e528b49f8 Illumos #3598
3598 want to dtrace when errors are generated in zfs
Reviewed by: Dan Kimmel <dan.kimmel@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Christopher Siden <christopher.siden@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3598
  illumos/illumos-gate@be6fd75a69

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775

Porting notes:

1. include/sys/zfs_context.h has been modified to render some new
   macros inert until dtrace is available on Linux.

2. Linux-specific changes have been adapted to use SET_ERROR().

3. I'm NOT happy about this change.  It does nothing but ugly
   up the code under Linux.  Unfortunately we need to take it to
   avoid more merge conflicts in the future.  -Brian
2013-10-31 14:58:04 -07:00
George Wilson
a117a6d66e Illumos #3522
3522 zfs module should not allow uninitialized variables
Reviewed by: Sebastien Roy <seb@delphix.com>
Reviewed by: Adam Leventhal <ahl@delphix.com>
Reviewed by: Matthew Ahrens <mahrens@delphix.com>
Approved by: Garrett D'Amore <garrett@damore.org>

References:
  https://www.illumos.org/issues/3522
  illumos/illumos-gate@d5285cae91

Ported-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Porting notes:

1. ZFSOnLinux had already addressed many of these issues because of
   its use of -Wall. However, the manner in which they were addressed
   differed. The illumos fixes replace the ones previously made in
   ZFSOnLinux to reduce code differences.

2. Part of the upstream patch made a small change to arc.c that might
   address zfsonlinux/zfs#1334.

3. The initialization of aclsize in zfs_log_create() differs because
   vsecp is a NULL pointer on ZFSOnLinux.

4. The changes to zfs_register_callbacks() were dropped because it
   has diverged and needs to be resynced.
2013-10-30 14:51:27 -07:00
Richard Yao
9cac042cfe Reintroduce uio_prefaultpages()
This was accidentally removed by overzealous commenting.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #1775
2013-10-29 15:06:18 -07:00
Massimo Maggi
023699cd62 Posix ACL Support
This change adds support for Posix ACLs by storing them as an xattr
which is common practice for many Linux file systems.  Since the
Posix ACL is stored as an xattr it will not overwrite any existing
ZFS/NFSv4 ACLs which may have been set.  The Posix ACL will also
be non-functional on other platforms although it may be visible
as an xattr if that platform understands SA based xattrs.

By default Posix ACLs are disabled but they may be enabled with
the new 'aclmode=noacl|posixacl' property.  Set the property to
'posixacl' to enable them.  If ZFS/NFSv4 ACL support is ever added
an appropriate acltype will be added.

This change passes the POSIX Test Suite cleanly with the exception
of xacl/00.t test 45 which is incorrect for Linux (Ext4 fails too).

  http://www.tuxera.com/community/posix-test-suite/

Signed-off-by: Massimo Maggi <me@massimo-maggi.eu>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #170
2013-10-29 14:54:26 -07:00
Richard Yao
c12e3a594a Restructure zfs_readdir() to fix regressions
This does the following:

1. It creates a uint8_t type value, which is initialized to DT_DIR on
dot directories and ZFS_DIRENT_TYPE(zap.za_first_integer) otherwise.
This resolves a regression where we return unintialized values as the
directory entry type on dot directories. This was accidentally
introduced by commit 8170d28126.

2. It restructures zfs_readdir() code to use `uint64_t offset` like
Illumos instead of `loff_t *pos`. This resolves a regression where
negative ZAP cursors were treated as if they were dot directories.

3. It restructures the function to more closely match the structure of
zfs_readdir() on Illumos and removes the unused variable outcount, which
was only used on Illumos.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1750
2013-10-29 09:51:59 -07:00
Richard Yao
0f37d0c8be Linux 3.11 compat: fops->iterate()
Commit torvalds/linux@2233f31aad
replaced ->readdir() with ->iterate() in struct file_operations.
All filesystems must now use the new ->iterate method.

To handle this the code was reworked to use the new ->iterate
interface.  Care was taken to keep the majority of changes
confined to the ZPL layer which is already Linux specific.
However, minor changes were required to the common zfs_readdir()
function.

Compatibility with older kernels was accomplished by adding
versions of the trivial dir_emit* helper functions.  Also the
various *_readdir() functions were reworked in to wrappers
which create a dir_context structure to pass to the new
*_iterate() functions.

Unfortunately, the new dir_emit* functions prevent us from
passing a private pointer to the filldir function.  The xattr
directory code leveraged this ability through zfs_readdir()
to generate the list of xattr names.  Since we can no longer
use zfs_readdir() a simplified zpl_xattr_readdir() function
was added to perform the same task.

Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1653
Issue #1591
2013-08-15 16:19:07 -07:00
Richard Yao
8170d28126 Return correct type and offset from zfs_readdir
zfs_readdir() is used by getdents(), which provides a list of all files
in directory, their types and an offset that be used by llseek() to seek
to the next directory entry.

On Solaris, the first two directory entries "." and ".." respectively
have offsets 1 and 2 on ZFS while the other files have rather large
numbers. Currently, ZFSOnLinux is  giving "." offset 0 and all other
entries large numbers. The first entry's next entry offset points to
itself, which causes software that uses llseek() in conjunction with
getdents() for filesystem navigation to enter an infinite loop.  The
offsets used for each directory entry are filesystem specific on all
platforms, so we can fix this by adopting the Solaris behavior.

Also, we currently report each directory entry as having type 0 (???).
This is not wrong, but we can do better. getdents() on Solaris does not
appear to provide this information, but it does on Linux and Mac OS X
do. ZFS provides easy access to type information in zfs_readdir(), so
this patch provides this as well.

Reported-by: Andrey <andrey@kudinov.su>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1624
2013-08-07 16:16:43 -07:00
Li Dongyang
802e7b5feb Add SEEK_DATA/SEEK_HOLE to lseek()/llseek()
The approach taken was the rework zfs_holey() as little as
possible and then just wrap the code as needed to ensure
correct locking and error handling.

Tested with xfstests 285 and 286.  All tests pass except for
7-9 of 285 which try to reserve blocks first via fallocate(2)
and fail because fallocate(2) is not yet supported.

Note that the filp->f_lock spinlock did not exist prior to
Linux 2.6.30, but we avoid the need for autotools check by
virtue of the fact that SEEK_DATA/SEEK_HOLE support was not
added until Linux 3.1.

An autoconf check was added for lseek_execute() which is
currently a private function but the expectation is that it
will be exported perhaps as early as Linux 3.11.

Reviewed-by: Richard Laager <rlaager@wiktel.com>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1384
2013-07-02 09:24:43 -07:00
Matthew Ahrens
cf91b2b6b2 Readd zfs_holey() from OpenSolaris
This patch restores the zfs_holey() function from OpenSolaris.
This was removed by commit 3558fd7 because it wasn't clear we
had a use for it in ZoL.  However, this functionality is a
prerequisite for adding SEEK_DATA/SEEK_HOLE support to the ZPL.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Signed-off-by: Richard Yao <ryao@gentoo.org>
Issue #1384
2013-07-02 09:24:18 -07:00
Madhav Suresh
c99c90015e Illumos #3006
3006 VERIFY[S,U,P] and ASSERT[S,U,P] frequently check if first
     argument is zero

Reviewed by Matt Ahrens <matthew.ahrens@delphix.com>
Reviewed by George Wilson <george.wilson@delphix.com>
Approved by Eric Schrock <eric.schrock@delphix.com>

References:
  illumos/illumos-gate@fb09f5aad4
  https://illumos.org/issues/3006

Requires:
  zfsonlinux/spl@1c6d149feb

Ported-by: Tim Chase <tim@chase2k.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #1509
2013-06-19 15:14:10 -07:00
Brian Behlendorf
d5446cfc52 Revert "Remove TSD zfs_fsyncer_key"
This reverts commit 31f2b5abdf back
to the original code until the fsync(2) performance regression
can be addressed.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-12-20 09:56:28 -08:00
Brian Behlendorf
31f2b5abdf Remove TSD zfs_fsyncer_key
It's my understanding that the zfs_fsyncer_key TSD was added as
a performance omtimization to reduce contention on the zl_lock
from zil_commit().  This issue manifested itself as very long
(100+ms) fsync() system call times for fsync() heavy workloads.

However, under Linux I'm not seeing the same contention that
was originally described.  Therefore, I'm removing this code
in order to ween ourselves off any dependence on TSD.  If the
original performance issue reappears on Linux we can revisit
fixing it without resorting to TSD.

This just leaves one small ZFS TSD consumer.  If it can be
cleanly removed from the code we'll be able to shed the SPL
TSD implementation entirely.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes zfsonlinux/spl#174
2012-12-19 09:08:01 -08:00
Brian Behlendorf
8780c53961 Update SAs when an inode is dirtied
Revert the portion of commit d3aa3ea which always resulted in the
SAs being update when an mmap()'ed file was closed.  That change
accidentally resulted in unexpected ctime updates which upset tools
like git.  That was always a horrible hack and I'm happy it will
never make it in to a tagged release.

The right fix is something I initially resisted doing because I
was worried about the additional overhead.  However, in hindsight
the overhead isn't as bad as I feared.

This patch implemented the sops->dirty_inode() callback which is
unsurprisingly called when an inode is dirtied.  We leverage this
callback to keep the znode SAs strictly in sync with the inode.

However, for now we're going to go slowly to avoid introducing
any new unexpected issues by only updating the atime, mtime, and
ctime.  This will cover the callpath of most concern to us.

  ->filemap_page_mkwrite->file_update_time->update_time->
      mark_inode_dirty_sync->__mark_inode_dirty->dirty_inode

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
Closes #1140
2012-12-14 12:18:54 -08:00
Brian Behlendorf
d3aa3ea96e Preserve inode mtime/ctime in .writepage()
When updating a file via mmap()'ed I/O preserve the mtime/ctime
which were updated when the page was made writable by the generic
callback filemap_page_mkwrite().

But more importantly than preserving the exact time add the missing
call to sa_bulk_update().  This ensures that the znode modifications
are written to disk as part of the transaction.  Without this the
inode may mistaken rollback to the previous on-disk znode state.

Additionally, for mmap()'ed znodes explicitly set the atime, mtime,
and ctime on close using the up to date values in the inode.  This
is critical because writepage() may occur after close and on close
we need to ensure the values are correct.

Original-patch-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Richard Yao <ryao@cs.stonybrook.edu>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #764
2012-12-05 13:00:25 -08:00
Brian Behlendorf
4c837f0d93 Fix "allocating allocated segment" panic
Gunnar Beutner did all the hard work on this one by correctly
identifying that this issue is a race between dmu_sync() and
dbuf_dirty().

Now in all cases the caller is responsible for preventing this
race by making sure the zfs_range_lock() is held when dirtying
a buffer which may be referenced in a log record.  The mmap
case which relies on zfs_putpage() was not taking the range
lock.  This code was accidentally dropped when the function
was rewritten for the Linux VFS.

This patch adds the required range locking to zfs_putpage().

It also adds the missing ZFS_ENTER()/ZFS_EXIT() macros which
aren't strictly required due to the VFS holding a reference.
However, this makes the code more consistent with the upsteam
code and there's no harm in being extra careful here.

Original-patch-by: Gunnar Beutner <gunnar@beutner.name>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #541
2012-11-09 19:01:09 -08:00
Brian Behlendorf
8312c6df55 Clear PG_writeback for sync I/O error case
Commit 2b2861362f accidentally
introduced this issue by only conditionally registering the
commit callback in the async case.

The error handing code for the dmu_tx_assign() failure case
relied on there always being a registered commit callback to
clear the PG_writeback bit.  Since that is no longer strictly
true for the synchronous case we must explicitly invoke the
callback.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #961
2012-09-14 15:53:47 -07:00
Brian Behlendorf
5915791096 Move iput() after zfs_inode_update()
When replaying an unlink/remove operation via zfs_rmdir() the object
being removed will be instantiated by a call to zfs_dirent_lock().
This means that there is a single reference protecting the object.
Right before the call to zfs_inode_update() this reference is dropped
which may cause the object to be destroyed.  This will result in a
NULL dereference as shown by the stack trace is issue #782.

This likely isn't an issue during normal operation because there is
always an additional reference held on the object by the VFS.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #782
2012-09-12 14:22:52 -07:00
Brian Behlendorf
0ef0ff546e Switch KM_SLEEP to KM_PUSHPAGE
This warning indicates the incorrect use of KM_SLEEP in a call
path which must use KM_PUSHPAGE to avoid deadlocking in direct
reclaim.  See commit b8d06fca08
for additional details.

  SPL: Fixing allocation for task txg_sync (6093) which
  used GFP flags 0x297bda7c with PF_NOFS set

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #917
2012-09-04 16:00:06 -07:00
Brian Behlendorf
2b2861362f Clear PG_writeback after zil_commit() for sync I/O
When writing via ->writepage() the writeback bit was always cleared
as part of the txg commit callback.  However, when the I/O is also
being written synchronsously to the zil we can immediately clear this
bit.  There is no need to wait for the subsequent TXG sync since the
data is already safe on stable storage.

This has been observed to reduce the msync(2) delay from up to 5
seconds down 10s of miliseconds.  One workload which is expected
to benefit from this are the intermittent samba hands described
in issue #700.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #700
Closes #907
2012-08-30 20:16:28 -07:00
Brian Behlendorf
ebe7e575ea Add .zfs control directory
Add support for the .zfs control directory.  This was accomplished
by leveraging as much of the existing ZFS infrastructure as posible
and updating it for Linux as required.  The bulk of the core
functionality is now all there with the following limitations.

*) The .zfs/snapshot directory automount support requires a 2.6.37
   or newer kernel.  The exception is RHEL6.2 which has backported
   the d_automount patches.

*) Creating/destroying/renaming snapshots with mkdir/rmdir/mv
   in the .zfs/snapshot directory works as expected.  However,
   this functionality is only available to root until zfs
   delegations are finished.

      * mkdir - create a snapshot
      * rmdir - destroy a snapshot
      * mv    - rename a snapshot

The following issues are known defeciences, but we expect them to
be addressed by future commits.

*) Add automount support for kernels older the 2.6.37.  This should
   be possible using follow_link() which is what Linux did before.

*) Accessing the .zfs/snapshot directory via NFS is not yet possible.
   The majority of the ground work for this is complete.  However,
   finishing this work will require resolving some lingering
   integration issues with the Linux NFS kernel server.

*) The .zfs/shares directory exists but no futher smb functionality
   has yet been implemented.

Contributions-by: Rohan Puri <rohan.puri15@gmail.com>
Contributiobs-by: Andrew Barnes <barnes333@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #173
2012-03-22 13:03:47 -07:00
Etienne Dechamps
aec69371a6 Check permissions in zfs_space().
This isn't done on Solaris because on this OS zfs_space() can
only be called with an opened file handle. Since the addition of
zpl_truncate_range() this isn't the case anymore, so we need to
enforce access rights.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #334
2012-02-09 15:20:37 -08:00
Brian Behlendorf
0c5dde492f Allow multiple values per directory entry
When using zfs to back a Lustre filesystem it's advantageous to
to store a fid with the object id in the directory zap.  The only
technical impediment to doing this is that the zpl code expects
a single value in the zap per directory entry.

This change relaxes that requirement such that multiple entries
are allowed provided the first one is the object id.  The zpl
code will just ignore additional entries.  This allows the ZoL
count to mount datasets which are being used as Lustre server
backends.

Once the upstream feature flags support is merged in this change
should be updated to a read-only feature.  Until this occurs
other zfs implementations will not be able to read the zfs
filesystems created by Lustre.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2012-02-02 11:22:08 -08:00
Gunnar Beutner
a7b125e9a5 Fix a race condition in zfs_getattr_fast()
zfs_getattr_fast() was missing a lock on the ZFS superblock which
could result in zfs_znode_dmu_fini() clearing the zp->z_sa_hdl member
while zfs_getattr_fast() was accessing the znode. The result of this
would usually be a panic.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Fixes #431
2011-11-03 10:13:09 -07:00
Brian Behlendorf
3c0e5c0f45 Cleanup mmap(2) writes
While the existing implementation of .writepage()/zpl_putpage() was
functional it was not entirely correct.  In particular, it would move
dirty pages in to a clean state simply after copying them in to the
ARC cache.  This would result in the pages being lost if the system
were to crash enough though the Linux VFS believed them to be safe on
stable storage.

Since at the moment virtually all I/O, except mmap(2), bypasses the
page cache this isn't as bad as it sounds.  However, as hopefully
start using the page cache more getting this right becomes more
important so it's good to improve this now.

This patch takes a big step in that direction by updating the code
to correctly move dirty pages through a writeback phase before they
are marked clean.  When a dirty page is copied in to the ARC it will
now be set in writeback and a completion callback is registered with
the transaction.  The page will stay in writeback until the dmu runs
the completion callback indicating the page is on stable storage.
At this point the page can be safely marked clean.

This process is normally entirely asynchronous and will be repeated
for every dirty page.  This may initially sound inefficient but most
of these pages will end up in a few txgs.  That means when they are
eventually written to disk they should be nicely batched.  However,
there is room for improvement.  It may still be desirable to batch
up the pages in to larger writes for the dmu.  This would reduce
the number of callbacks and small 4k buffer required by the ARC.

Finally, if the caller requires that the I/O be done synchronously
by setting WB_SYNC_ALL or if ZFS_SYNC_ALWAYS is set.  Then the I/O
will trigger a zil_commit() to flush the data to stable storage.
At which point the registered callbacks will be run leaving the
date safe of disk and marked clean before returning from .writepage.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
2011-08-02 10:34:55 -07:00
Brian Behlendorf
a140dc5469 Fix mmap(2)/write(2)/read(2) deadlock
When modifing overlapping regions of a file using mmap(2) and
write(2)/read(2) it is possible to deadlock due to a lock inversion.
The zfs_write() and zfs_read() hooks first take the zfs range lock
and then lock the individual pages.  Conversely, when using mmap'ed
I/O the zpl_writepage() hook is called with the individual page
locks already taken and then zfs_putpage() takes the zfs range lock.

The most straight forward fix is to simply not take the zfs range
lock in the mmap(2) case.  The individual pages will still be locked
thus serializing access.  Updating the same region of a file with
write(2) and mmap(2) has always been a dodgy thing to do.  This change
at a minimum ensures we don't deadlock and is consistent with the
existing Linux semantics enforced by the VFS.

This isn't an issue under Solaris because the only range locking
performed will be with the zfs range locks.  It's up to each filesystem
to perform its own file locking.  Under Linux the VFS provides many
of these services.

It may be possible/desirable at a latter date to entirely dump the
existing zfs range locking and rely on the Linux VFS page locks.
However, for now its safest to perform both layers of locking until
zfs is more tightly integrated with the page cache.

Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Issue #302
2011-07-19 11:55:42 -07:00
Brian Behlendorf
057e8eee35 Improve fstat(2) performance
There is at most a factor of 3x performance improvement to be
had by using the Linux generic_fillattr() helper.  However, to
use it safely we need to ensure the values in a cached inode
are kept rigerously up to date.  Unfortunately, this isn't
the case for the blksize, blocks, and atime fields.  At the
moment the authoritative values are still stored in the znode.

This patch introduces an optimized zfs_getattr_fast() call.
The idea is to use the up to date values from the inode and
the blksize, block, and atime fields from the znode.  At some
latter date we should be able to strictly use the inode values
and further improve performance.

The remaining overhead in the zfs_getattr_fast() call can be
attributed to having to take the znode mutex.  This overhead is
unavoidable until the inode is kept strictly up to date.  The
the careful reader will notice the we do not use the customary
ZFS_ENTER()/ZFS_EXIT() macros.  These macro's are designed to
ensure the filesystem is not torn down in the middle of an
operation.  However, in this case the VFS is holding a
reference on the active inode so we know this is impossible.

=================== Performance Tests ========================

This test calls the fstat(2) system call 10,000,000 times on
an open file description in a tight loop.  The test results
show the zfs stat(2) performance is now only 22% slower than
ext4.  This is a 2.5x improvement and there is a clear long
term plan to get to parity with ext4.

filesystem    | test-1  test-2  test-3  | average | times-ext4
--------------+-------------------------+---------+-----------
ext4          |  7.785s  7.899s  7.284s |  7.656s | 1.000x
zfs-0.6.0-rc4 | 24.052s 22.531s 23.857s | 23.480s | 3.066x
zfs-faststat  |  9.224s  9.398s  9.485s |  9.369s | 1.223x

The second test is to run 'du' of a copy of the /usr tree
which contains 110514 files.  The test is run multiple times
both using both a cold cache (/proc/sys/vm/drop_caches) and
a hot cache.  As expected this change signigicantly improved
the zfs hot cache performance and doesn't quite bring zfs to
parity with ext4.

A little surprisingly the zfs cold cache performance is better
than ext4.  This can probably be attributed to the zfs allocation
policy of co-locating all the meta data on disk which minimizes
seek times.  By default the ext4 allocator will spread the data
over the entire disk only co-locating each directory.

filesystem    | cold    | hot
--------------+---------+--------
ext4          | 13.318s | 1.040s
zfs-0.6.0-rc4 |  4.982s | 1.762s
zfs-faststat  |  4.933s | 1.345s
2011-07-11 09:11:22 -07:00
Brian Behlendorf
285226eff3 Always allow non-user xattrs
Under Linux you may only disable USER xattrs.  The SECURITY,
SYSTEM, and TRUSTED xattr namespaces must always be available
if xattrs are supported by the filesystem.  The enforcement
of USER xattrs is performed in the zpl_xattr_user_* handlers.

Under Solaris there is only a single xattr namespace which
is managed globally.
2011-07-01 13:39:48 -07:00
Rohan Puri
a89c3e0bd5 Support mandatory locks (nbmand)
The Linux kernel already has support for mandatory locking.  This
change just replaces the Solaris mandatory locking calls with the
Linux equivilants.  In fact, it looks like this code could be
removed entirely because this checking is already done generically
in the Linux VFS.  However, for now we'll leave it in place even
if it is redundant just in case we missed something.

The original patch to update the code to support mandatory locking
was done by Rohan Puri.  This patch is an updated version which is
compatible with the previous mount option handling changes.

Original-Patch-by: Rohan Puri <rohan.puri15@gmail.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #222
Closes #253
2011-07-01 13:39:40 -07:00
Brian Behlendorf
2cf7f52bc4 Linux compat 2.6.39: mount_nodev()
The .get_sb callback has been replaced by a .mount callback
in the file_system_type structure.  When using the new
interface the caller must now use the mount_nodev() helper.

Unfortunately, the new interface no longer passes the vfsmount
down to the zfs layers.  This poses a problem for the existing
implementation because we currently save this pointer in the
super block for latter use.  It provides our only entry point
in to the namespace layer for manipulating certain mount options.

This needed to be done originally to allow commands like
'zfs set atime=off tank' to work properly.  It also allowed me
to keep more of the original Solaris code unmodified.  Under
Solaris there is a 1-to-1 mapping between a mount point and a
file system so this is a fairly natural thing to do.  However,
under Linux they many be multiple entries in the namespace
which reference the same filesystem.  Thus keeping a back
reference from the filesystem to the namespace is complicated.

Rather than introduce some ugly hack to get the vfsmount and
continue as before.  I'm leveraging this API change to update
the ZFS code to do things in a more natural way for Linux.
This has the upside that is resolves the compatibility issue
for the long term and fixes several other minor bugs which
have been reported.

This commit updates the code to remove this vfsmount back
reference entirely.  All modifications to filesystem mount
options are now passed in to the kernel via a '-o remount'.
This is the expected Linux mechanism and allows the namespace
to properly handle any options which apply to it before passing
them on to the file system itself.

Aside from fixing the compatibility issue, removing the
vfsmount has had the benefit of simplifying the code.  This
change which fairly involved has turned out nicely.

Closes #246
Closes #217
Closes #187
Closes #248
Closes #231
2011-07-01 13:36:39 -07:00