r209260:
Backout r207970 for now, it can lead to deadlocks.
Reported by: kan
r209261:
Turn off UMA allocations on all archs by default. It isn't stable even
on amd64.
Reported by: many
Approved by: re (kib)
Fix freeing space after deleting large files with holes.
OpenSolaris onnv revision: 9950:78fc41aa9bc5
Reviewed by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6792701)
Approved by: re (kib)
Fix ZIL close when doing zfs rollback or zfs receive on a mounted dataset.
The fix is a partial import and merge of OpenSolaris onnv revisions
8227:f7d7be9b1f56. and 9292:e112194b5b73
Reviewed by: pjd, delphij (mentor)
Obtained from: OpenSolaris (Bug ID 6798298)
Approved by: re (kib)
r208454,r208455,r208458:
r207920:
Back out r205134. It is not stable.
r207934:
Add missing new line characters to the warnings.
r207936:
Eventhough r203504 eliminates taste traffic provoked by vdev_geom.c,
ZFS still like to open all vdevs, close them and open them again,
which in turn provokes taste traffic anyway.
I don't know of any clean way to fix it, so do it the hard way - if we can't
open provider for writing just retry 5 times with 0.5 pauses. This should
elimitate accidental races caused by other classes tasting providers created on
top of our vdevs.
Reported by: James R. Van Artsdalen <james-freebsd-fs2@jrv.org>
Reported by: Yuri Pankov <yuri.pankov@gmail.com>
r207937:
I added vfs_lowvnodes event, but it was only used for a short while and now
it is totally unused. Remove it.
r207970:
When there is no memory or KVA, try to help by reclaiming some vnodes.
This helps with 'kmem_map too small' panics.
No objections from: kib
Tested by: Alexander V. Ribchansky <shurik@zk.informjust.ua>
r208142:
The whole point of having dedicated worker thread for each leaf VDEV was to
avoid calling zio_interrupt() from geom_up thread context. It turns out that
when provider is forcibly removed from the system and we kill worker thread
there can still be some ZIOs pending. To complete pending ZIOs when there is
no worker thread anymore we still have to call zio_interrupt() from geom_up
context. To avoid this race just remove use of worker threads altogether.
This should be more or less fine, because I also thought that zio_interrupt()
does more work, but it only makes small UMA allocation with M_WAITOK.
It also saves one context switch per I/O request.
PR: kern/145339
Reported by: Alex Bakhtin <Alex.Bakhtin@gmail.com>
r208147:
Add task structure to zio and use it instead of allocating one.
This eliminates the only place where we can sleep when calling zio_interrupt().
As a side-effect this can actually improve performance a little as we
allocate one less thing for every I/O.
Prodded by: kib
r208148:
Allow to configure UMA usage for ZIO data via loader and turn it on by
default for amd64. On i386 I saw performance degradation when UMA was used,
but for amd64 it should help.
r208166:
Fix userland build by making io_task available only for the kernel and by
providing taskq_dispatch_safe() macro.
r208454:
Remove ZIO_USE_UMA from arc.c as well.
r208455:
ZIO_USE_UMA is no longer used.
r208458:
Create UMA zones unconditionally.
Import OpenSolaris revision 7837:001de5627df3
It includes the following changes:
- parallel reads in traversal code (Bug ID 6333409)
- faster traversal for zfs send (Bug ID 6418042)
- traversal code cleanup (Bug ID 6725675)
- fix for two scrub related bugs (Bug ID 6729696, 6730101)
- fix assertion in dbuf_verify (Bug ID 6752226)
- fix panic during zfs send with i/o errors (Bug ID 6577985)
- replace P2CROSS with P2BOUNDARY (Bug ID 6725680)
List of OpenSolaris Bug IDs:
6333409, 6418042, 6757112, 6725668, 6725675, 6725680,
6725698, 6729696, 6730101, 6752226, 6577985, 6755042
Approved by: pjd, delphij (mentor)
Obtained from: OpenSolaris (multiple Bug IDs)
- Fix broken symlinks on cross platform zfs send/recv. [1]
- Enable zfs_ace_byteswap() on FreeBSD as it works just fine (tested between
amd64 and sparc64 in both directions by Michael Moll).
PR: 146272
Approved by: mm, pjd
Obtained from: OpenSolaris (onnv rev. 8283:1ca59f393041; Bug ID 6764193) [1]
Partially MFp4 #176265 by pjd@:
- Properly initialize and destroy system_taskq.
- Add a dummy implementation of taskq_create_proc().
Note: We do not currently use system_taskq in ZFS so this is mostly a
no-op at this time. Proper system_taskq initialization is required
by newer ZFS code.
Ok'ed by: pjd
r207068:
Allow to modify directory's content even if the ZFS_NOUNLINK (SF_NOUNLINK,
sunlnk) flag is set. We only deny dirctory's removal or rename.
PR: kern/143343
Reported by: marck
r207334:
Backport fix for 'zfs_znode_dmu_init: existing znode for dbuf' panic from OpenSolaris.
PR: kern/144402
Reported by: Alex Bakhtin <alex.bakhtin@gmail.com>
Tested by: Alex Bakhtin <alex.bakhtin@gmail.com>
Obtained from: OpenSolaris, Bug ID 6895088
r205134,r205231,r205253,r205264,r205346,r206051,r206667,r206792,r206793,
r206794,r206795,r206796,r206797:
r203504:
Open provider for writting when we find the right one. Opening too much
providers for writing provokes huge traffic related to taste events send
by GEOM on close. This can lead to various problems with opening GEOM
providers that are created on top of other GEOM providers.
Reorted by: Kurt Touet <ktouet@gmail.com>, mr
Tested by: mr, Baginski Darren <kickbsd@ya.ru>
r204067:
Update comment. We also look for GPT partitions.
r204073:
Add tunable and sysctl to skip hostid check on pool import.
r204101:
Don't set f_bsize to recordsize. It might confuse some software (like squid).
Submitted by: Alexander Zagrebin <alexz@visp.ru>
r204804:
Remove racy assertion.
Reported by: Attila Nagy <bra@fsn.hu>
Obtained from: OpenSolaris, Bug ID 6827260
r205079:
Remove bogus assertion.
Reported by: Johan Ström <johan@stromnet.se>
Obtained from: OpenSolaris, Bug ID 6920880
r205080:
Force commit to correct Bug ID:
Obtained from: OpenSolaris, Bug ID 6920880
r205132:
Don't bottleneck on acquiring the stream locks - this avoids a massive
drop off in throughput with large numbers of simultaneous reads
r205133:
fix compilation under ZIO_USE_UMA
r205134:
make UMA the default allocator for ZFS buffers - this avoids
a great deal of contention in kmem_alloc
r205231:
- reduce contention by breaking up ARC state locks in to 16 for data
and 16 for metadata
- export L2ARC tunables as sysctls
- add several kstats to track L2ARC state more precisely
- avoid holding a contended lock when atomically incrementing a
contended counter (no lock protection needed for atomics)
r205253:
use CACHE_LINE_SIZE instead of hardcoding 128 for lock pad
pointed out by Marius Nuennerich and jhb@
r205264:
- cache line align arcs_lock array (h/t Marius Nuennerich)
- fix ARCS_LOCK_PAD to use architecture defined CACHE_LINE_SIZE
- cache line align buf_hash_table ht_locks array
r205346:
The same code is used to import and to create pool.
The order of operations is the following:
1. Try to open vdev by remembered path and guid.
2. If 1 failed, try to find vdev which guid matches and ignore the path.
3. If 2 failed this means either that the vdev we're looking for is gone
or that pool is being created and vdev doesn't contain proper guid yet.
To be able to handle pool creation we open vdev by path anyway.
Because of 3 it is possible that we open wrong vdev on import which can lead to
confusions.
The solution for this is to check spa_load_state. On pool creation it will be
equal to SPA_LOAD_NONE and we can open vdev only by path immediately and if it
is not equal to SPA_LOAD_NONE we first open by path+guid and when that fails,
we open by guid. We no longer open wrong vdev on import.
r206051:
IOCPARM_MAX defines maximum size of a structure that can be passed
directly to ioctl(2). Because of how ioctl command is build using _IO*()
macros we have only 13 bits to encode structure size. So the structure
can be up to 8kB-1.
Currently we define IOCPARM_MAX as PAGE_SIZE.
This is IMHO wrong for three main reasons:
1. It is confusing on archs with page size larger than 8kB (not really
sure if we support such archs (sparc64?)), as even if PAGE_SIZE is
bigger than 8kB, we won't be able to encode anything larger in ioctl
command.
2. It is a waste. Why the structure can be only 4kB on most archs if we
have 13 bits dedicated for that, not 12?
3. It shouldn't depend on architecture and page size. My ioctl command
can work on one arch, but can't on the other?
Increase IOCPARM_MAX to 8kB and make it independed of PAGE_SIZE and
architecture it is compiled for. This allows to use all the bits on all the
archs for size. Note that this doesn't mean we will copy more on every ioctl(2)
call. No. We still copyin(9)/copyout(9) only exact number of bytes encoded in
ioctl command.
Practical use for this change is ZFS. zfs_cmd_t structure used for ZFS
ioctls is larger than 4kB.
Silence on: arch@
r206667:
Fix 3-way deadlock that can happen because of ZFS and vnode lock
order reversal.
thread0 (vfs_fhtovp) thread1 (vop_getattr) thread2 (zfs_recv)
-------------------- --------------------- ------------------
vn_lock
rrw_enter_read
rrw_enter_write (hangs)
rrw_enter_read (hangs)
vn_lock (hangs)
Reported by: Attila Nagy <bra@fsn.hu>
r206792:
Set ARC_L2_WRITING on L2ARC header creation.
Obtained from: OpenSolaris
r206793:
Remove racy assertion.
Obtained from: OpenSolaris
r206794:
Extend locks scope to match OpenSolaris.
r206795:
Add missing list and lock destruction.
r206796:
Style fixes.
r206797:
Restore previous order.
On FreeBSD, time_t is 64-bit for all platforms except i386 and powerpc,
where the type is 32-bit. ZFS can handle 64-bit timestamp internally
but zfs_setattr() would check if the time value can fit, we change the
checking macros to match 64-bit timestamp if the platform supports it.
This change has some downsides like, while you can import zfs on 32-bit
platforms, the timestamp would overflow if they are out of the range.
This fixes the Y2.038K issue on platforms using 64-bit timestamps.
Reviewed by: pjd
Teach the (gpt)zfsboot and zfsloader raidz code to use its buffers
more efficiently.
Before this patch, in the worst case memory use would increase
exponentially on the number of drives in the raidz vdev.
Submitted by: Matt Reimer <mattjreimer@gmail.com>
Sponsored by: VPOP Technologies, Inc.
Silence from: dfr
Instead of assuming all vdevs are healthy, check the newest vdev label
for each vdev's status. Booting from a degraded vdev should now be
more robust.
Submitted by: Matt Reimer <mattjreimer at gmail.com>
Sponsored by: VPOP Technologies, Inc.
Apply OpenSolaris revision 8021:b8fe9660eb2d which brings our zpool
to version 14, making it possible for zpools created on OpenSolaris
2009.06 be used on FreeBSD.
PR: kern/141800
Submitted by: mm
Reviewed by: pjd, trasz
Obtained from: OpenSolaris onnv-gate
---snip---
Prevent paging pressure from draining arc too much
- always drain arc if above arc_c_max - never drain arc if arc is below
arc_c_max
---snip---
Apply fix for Solaris bug 6764159: restore_object() makes a call
that can block while having a tx open but not yet committed
(onnv revision 7994)
Submitted by: mm
Approved by: pjd
Obtained from: OpenSolaris
r200124:
Avoid using additional variable for storing an error if we are not going
to do anything with it.
r200126:
Fix deadlock when ZVOLs are present and we are replacing dead component or
calling scrub when pool is in a degraded state. It will try to taste ZVOLs,
which will lead to deadlock, as ZVOL will try to acquire the same locks as
replace/scrub is holding already.
We can't simply skip provider based on their GEOM class, because ZVOL can have
providers build on top of it and we need to skip those as well.
We do it by asking for ZFS::iszvol attribute. Any ZVOL-based provider will give
us positive answer and we have to skip those providers.
This way we remove possibility to create ZFS pools on top of ZVOLs, but it is
not very useful anyway.
I believe deadlock is still possible in some very complex situations like when
we have MD provider on top of UFS file on top of ZVOL. When we try to replace
dead component in the pool mentioned ZVOL is based on, there might be a
deadlock when ZFS will try to taste MD provider. There is no easy way to detect
that, but it isn't very common.
r200125,r200158:
Fix order of looking for providers.
Before r200125 the order of looking for providers was wrong. It was:
1. Find provider by name.
2. Find provider by guid.
3. Find provider by name and guid.
Where it should have been:
1. Find provider by name and guid.
2. Find provider by guid.
3. Find provider by name.
Correct some issues with zfs boot.
- Teach it to read gang blocks. (essentially untested)
If you see "ZFS: gang block detected!", please let
me know, so we can either remove the printf if it
works, or fix it if it doesn't.
- If multiple partitions exist on a disk, probe them all.
We also need to reset dsk->start to 0 to read the right
sector here.
- With GPT, we can have 128 partitions.
- If the bootfs property has ever been set on a pool
it seems that it never goes away. zpool won't allow
you to add to the pool with the bootfs property set.
However, if you clear the property back to default
we end up getting 0 for the object number and read
a bogus block pointer and fail to boot.
- Fix some error printfs. The printf in the loader is
only capable of c,s and u formats.
- Teach printf how to display %llu
This patch addresses an overflow in the the zfs boot code and allows
users to boot from zfs raidz volumes. This has been tested by a number
of users and does not impact those which are not booting from zfs raidz
volumes.
Submitted by: Matt Reimer <mattjreimer@gmail.com>
r198703:
- zfs_zaccess() can handle VAPPEND too, so map V_APPEND to VAPPEND and call
zfs_access() instead of vaccess() in this case as well.
- If VADMIN is specified with another V* flag (unlikely) call both
zfs_access() and vaccess() after spliting V* flags.
This fixes "dirtying snapshot!" panic.
PR: kern/139806
Reported by: Carl Chave <carl@chave.us>
In co-operation with: jh
r199156:
Avoid passing invalid mountpoint to getnewvnode().
Reported by: rwatson
Tested by: rwatson
r199157:
Be careful which vattr fields are set during setattr replay.
Without this fix strange things can appear after unclean shutdown like
files with mode set to 07777.
Reported by: des
r197831:
Fix situation where Mac OS X NFS client creates a file and when it tries
to set ownership and mode in the same setattr operation, the mode was
overwritten by secpolicy_vnode_setattr().
PR: kern/118320
Submitted by: Mark Thompson <info-gentoo@mark.thompson.bz>
r197842:
Fix white-spaces.
r197843:
On FreeBSD it is enough to report provider removal when orphan event is
received, we don't have to do it on every ENXIO error in I/O path.
Solaris has no GEOM so they have to handle it in a less clean way.
r197860:
File system owner is when uid matches and jail matches.
r197861:
Allow file system owner to modify system flags if securelevel permits.
Approved by: re (kib)
Return EOPNOTSUPP instead of EINVAL when doing chflags(2) over an old
format ZFS, as defined in the manual page.
Submitted by: pjd (response of my original patch but bugs are mine)
Approved by: re (kib)
r197512, r197513, r197514, r197515, r197525:
r197287:
Purge namecache for the file system being rolled back, so it doesn't point at
invalid vnodes after the rollback resulting in EIO errors when trying to access
files which are in the namecache.
Reported by: des
r197289:
Purge file system namecache when receiving incremental stream and rolling back
to it.
r197351:
Purge namecache in the same place OpenSolaris does.
r197426:
Restore BSD behaviour - when creating new directory entry use parent directory
gid to set group ownership and not process gid.
This was overlooked during v6 -> v13 switch.
PR: kern/139076
Reported by: Sean Winn <sean@gothic.net.au>
r197458:
Close race in zfs_zget(). We have to increase usecount first and then
check for VI_DOOMED flag. Before this change vnode could be reclaimed
between checking for the flag and increasing usecount.
r197459:
Before calling vflush(FORCECLOSE) mark file system as unmounted so the
following vnops will fail. This is very important, because without this change
vnode could be reclaimed at any point, even if we increased usecount. The only
way to ensure that vnode won't be reclaimed was to lock it, which would be very
hard to do in ZFS without changing a lot of code. With this change simply
increasing usecount is enough to be sure vnode won't be reclaimed from under
us. To be precise it can still be reclaimed but we won't be able to see it,
because every try to enter ZFS through VFS will result in EIO.
The only function that cannot return EIO, because it is needed for vflush() is
zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks
z_teardown_lock and never returns EIO.
r197497:
Switch to fletcher4 as the default checksum algorithm. Fletcher2 was proven to
be a bit weak and OpenSolaris also switched to fletcher4.
r197498: head/cddl/contrib/opensolaris
Fletcher4 is not the default checksum algorithm.
r197512:
- Don't depend on value returned by gfs_*_inactive(), it doesn't work
well with forced unmounts when GFS vnodes are referenced.
- Make other preparations to GFS for forced unmounts.
PR: kern/139062
Reported by: trasz
r197513:
Use traverse() function to find and return mount point's vnode instead of
covered vnode when snapshot is already mounted.
r197514:
On lookup error VFS expects *vpp to be set to NULL, be sure to do that.
r197515:
Handle cases where virtual (GFS) vnodes are referenced when doing forced
unmount. In that case we cannot depend on the proper order of invalidating
vnodes, so we have to free resources when we have a chance.
PR: kern/139062
Reported by: trasz
r197525:
Ensure that tv_sec is between INT32_MIN and INT32_MAX, so ZFS won't object.
This completes the fix from r185586.
PR: kern/139059
Reported by: Daniel Braniss <danny@cs.huji.ac.il>
Submitted by: Jaakko Heinonen <jh@saunalahti.fi>
Tested by: Daniel Braniss <danny@cs.huji.ac.il>
Approved by: re (kib)
r196943,r196944,r196947,r196950,r196953,r196954,r196965,r196978,r196979,
r196980,r196982,r196985,r196992,r197131,r197133,r197150,r197151,r197152,
r197153,r197167,r197172,r197177,r197200,r197201:
r196456:
- Give minclsyspri and maxclsyspri real values (consulted with kmacy).
- Honour 'pri' argument for thread_create().
r196457:
Set priority of vdev_geom threads and zvol threads to PRIBIO.
r196458:
- Hide ZFS kernel threads under zfskern process.
- Use better (shorter) threads names:
'zvol:worker zvol/tank/vol00' -> 'zvol tank/vol00'
'vdev:worker da0' -> 'vdev da0'
r196662:
Add missing mountpoint vnode locking.
This fixes panic on assertion with DEBUG_VFS_LOCKS and vfs.usermount=1 when
regular user tries to mount dataset owned by him.
r196702:
Remove empty directory.
r196703:
Backport the 'dirtying dbuf' panic fix from newer ZFS version.
Reported by: Thomas Backman <serenity@exscape.org>
r196919:
bzero() on-stack argument, so mutex_init() won't misinterpret that the
lock is already initialized if we have some garbage on the stack.
PR: kern/135480
Reported by: Emil Mikulic <emikulic@gmail.com>
r196927:
Changing provider size is not really supported by GEOM, but doing so when
provider is closed should be ok.
When administrator requests to change ZVOL size do it immediately if ZVOL
is closed or do it on last ZVOL close.
PR: kern/136942
Requested by: Bernard Buri <bsd@ask-us.at>
r196928:
Teach zdb(8) how to obtain GEOM provider size.
PR: kern/133134
Reported by: Philipp Wuensche <cryx-freebsd@h3q.com>
r196943:
- Avoid holding mutex around M_WAITOK allocations.
- Add locking for mnt_opt field.
r196944:
Don't recheck ownership on update mount. This will eliminate LOR between
vfs_busy() and mount mutex. We check ownership in vfs_domount() anyway.
Noticed by: kib
Reviewed by: kib
r196947:
Defer thread start until we set priority.
Reviewed by: kib
r196950:
Fix detection of file system being shared. Now zfs unshare/destroy/rename
command will properly remove exported file systems.
r196953:
When snapshot mount point is busy (for example we are still in it)
we will fail to unmount it, but it won't be removed from the tree,
so in that case there is no need to reinsert it.
Reported by: trasz
r196954:
If we have to use avl_find(), optimize a bit and use avl_insert() instead of
avl_add() (the latter is actually a wrapper around avl_find() + avl_insert()).
Fix similar case in the code that is currently commented out.
r196965:
Fix reference count leak for a case where snapshot's mount point is updated.
r196978:
Call ZFS_EXIT() after locking the vnode.
r196979:
On FreeBSD we don't have to look for snapshot's mount point,
because fhtovp method is already called with proper mount point.
r196980:
When we automatically mount snapshot we want to return vnode of the mount point
from the lookup and not covered vnode. This is one of the fixes for using .zfs/
over NFS.
r196982:
We don't export individual snapshots, so mnt_export field in snapshot's
mount point is NULL. That's why when we try to access snapshots over NFS
use mnt_export field from the parent file system.
r196985:
Only log successful commands! Without this fix we log even unsuccessful
commands executed by unprivileged users. Action is not really taken, but it is
logged to pool history, which might be confusing.
Reported by: Denis Ahrens <denis@h3q.com>
r196992:
Implement __assert() for Solaris-specific code. Until now Solaris code was
using Solaris prototype for __assert(), but FreeBSD's implementation.
Both take different arguments, so we were either core-dumping in assert()
or printing garbage.
Reported by: avg
r197131:
Tighten up the check for race in zfs_zget() - ZTOV(zp) can not only contain
NULL, but also can point to dead vnode, take that into account.
PR: kern/132068
Reported by: Edward Fisk <7ogcg7g02@sneakemail.com>, kris
Fix based on patch from: Jaakko Heinonen <jh@saunalahti.fi>
r197133:
- Protect reclaim with z_teardown_inactive_lock.
- Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
z_dbuf field is NULL - this might happen in case of rollback or forced
unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
- On forced unmount wait for all znodes to be destroyed - destruction can be
done asynchronously via zfs_reclaim_complete().
r197150:
There is a bug where mze_insert() can trigger an assert() of inserting
the same entry twice. This bug is not fixed yet, but leads to situation
where when try to access corrupted directory the kernel will panic.
Until the bug is properly fixed, try to recover from it and log that it
happened.
Reported by: marck
OpenSolaris bug: 6709336
r197151:
Be sure not to overflow struct fid.
r197152:
Extend scope of the z_teardown_lock lock for consistency and "just in case".
r197153:
When zfs.ko is compiled with debug, make sure that znode and vnode point at
each other.
r197167:
Work-around READDIRPLUS problem with .zfs/ and .zfs/snapshot/ directories
by just returning EOPNOTSUPP. This will allow NFS server to fall back to
regular READDIR.
Note that converting inode number to snapshot's vnode is expensive operation.
Snapshots are stored in AVL tree, but based on their names, not inode numbers,
so to convert inode to snapshot vnode we have to interate over all snalshots.
This is not a problem in OpenSolaris, because in their READDIRPLUS
implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
d_fileno as we do.
PR: kern/125149
Reported by: Weldon Godfrey <wgodfrey@ena.com>
Analysis by: Jaakko Heinonen <jh@saunalahti.fi>
r197172:
Add missing \n.
Reported by: marck
r197177:
Support both case: when snapshot is already mounted and when it is not yet
mounted.
r197200:
Modify mount(8) to skip MNT_IGNORE file systems by default, just like df(1)
does. This is not POLA violation, because there is no single file system in the
base that use MNT_IGNORE currently, although ZFS snapshots will be mounted with
MNT_IGNORE after next commit.
Reviewed by: kib
r197201:
- Mount ZFS snapshots with MNT_IGNORE flag, so they are not visible in regular
df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
ZFS route of not listing snapshots by default with 'zfs list' command.
- Add UPDATING entry to note that ZFS snapshots are no longer visible in
mount(8) and df(1) output by default.
Reviewed by: kib
Approved by: re (bz)
Our libc doesn't implement control method for XDR (only kernel does) and it
will always return failure. Fix this by bringing userland implementation of
xdrmem_control() back. This allow 'zpool import' to work again.
Reported by: Thomas Backman <serenity@exscape.org>
Reviewed by: kmacy
Approved by: re (kib)