The order of operations is the following:
1. Try to open vdev by remembered path and guid.
2. If 1 failed, try to find vdev which guid matches and ignore the path.
3. If 2 failed this means either that the vdev we're looking for is gone
or that pool is being created and vdev doesn't contain proper guid yet.
To be able to handle pool creation we open vdev by path anyway.
Because of 3 it is possible that we open wrong vdev on import which can lead to
confusions.
The solution for this is to check spa_load_state. On pool creation it will be
equal to SPA_LOAD_NONE and we can open vdev only by path immediately and if it
is not equal to SPA_LOAD_NONE we first open by path+guid and when that fails,
we open by guid. We no longer open wrong vdev on import.
MFC after: 2 weeks
and 16 for metadata
- export L2ARC tunables as sysctls
- add several kstats to track L2ARC state more precisely
- avoid holding a contended lock when atomically incrementing a
contended counter (no lock protection needed for atomics)
providers for writing provokes huge traffic related to taste events send
by GEOM on close. This can lead to various problems with opening GEOM
providers that are created on top of other GEOM providers.
Reorted by: Kurt Touet <ktouet@gmail.com>, mr
Tested by: mr, Baginski Darren <kickbsd@ya.ru>
MFC after: 2 weeks
where the type is 32-bit. ZFS can handle 64-bit timestamp internally
but zfs_setattr() would check if the time value can fit, we change the
checking macros to match 64-bit timestamp if the platform supports it.
This change has some downsides like, while you can import zfs on 32-bit
platforms, the timestamp would overflow if they are out of the range.
This fixes the Y2.038K issue on platforms using 64-bit timestamps.
Reviewed by: pjd
MFC after: 1 month
revision 200726 and 200727). It looks like that the two revisions
were not applied in the right sequence, I found this when comparing
with the OpenSolaris code.
MFC after: 3 days
Reviewed by: mm@
for each vdev's status. Booting from a degraded vdev should now be
more robust.
Submitted by: Matt Reimer <mattjreimer at gmail.com>
Sponsored by: VPOP Technologies, Inc.
MFC after: 2 weeks
more efficiently.
Before this patch, in the worst case memory use would increase
exponentially on the number of drives in the raidz vdev.
Submitted by: Matt Reimer <mattjreimer@gmail.com>
Sponsored by: VPOP Technologies, Inc.
Silence from: dfr
making it possible for zpools created on OpenSolaris 2009.06 be used
on FreeBSD.
PR: kern/141800
Submitted by: mm
Reviewed by: pjd, trasz
Obtained from: OpenSolaris
MFC after: 2 weeks
both to not panic when fsync(2) is called for fifo on zfs
filedescriptor, and to actually fsync fifo inode to permanent storage.
PR: kern/141177
Reviewed by: pjd
MFC after: 1 week
for attaching when there is no metadata yet.
Before r200125 the order of looking for providers was wrong. It was:
1. Find provider by name.
2. Find provider by guid.
3. Find provider by name and guid.
Where it should have been:
1. Find provider by name and guid.
2. Find provider by guid.
3. Find provider by name.
MFC after: 1 week
calling scrub when pool is in a degraded state. It will try to taste ZVOLs,
which will lead to deadlock, as ZVOL will try to acquire the same locks as
replace/scrub is holding already.
We can't simply skip provider based on their GEOM class, because ZVOL can have
providers build on top of it and we need to skip those as well.
We do it by asking for ZFS::iszvol attribute. Any ZVOL-based provider will give
us positive answer and we have to skip those providers.
This way we remove possibility to create ZFS pools on top of ZVOLs, but it is
not very useful anyway.
I believe deadlock is still possible in some very complex situations like when
we have MD provider on top of UFS file on top of ZVOL. When we try to replace
dead component in the pool mentioned ZVOL is based on, there might be a
deadlock when ZFS will try to taste MD provider. There is no easy way to detect
that, but it isn't very common.
MFC after: 1 week
zfs_access() instead of vaccess() in this case as well.
- If VADMIN is specified with another V* flag (unlikely) call both
zfs_access() and vaccess() after spliting V* flags.
This fixes "dirtying snapshot!" panic.
PR: kern/139806
Reported by: Carl Chave <carl@chave.us>
In co-operation with: jh
MFC after: 3 days
- Teach it to read gang blocks. (essentially untested)
If you see "ZFS: gang block detected!", please let
me know, so we can either remove the printf if it
works, or fix it if it doesn't.
- If multiple partitions exist on a disk, probe them all.
We also need to reset dsk->start to 0 to read the right
sector here.
- With GPT, we can have 128 partitions.
- If the bootfs property has ever been set on a pool
it seems that it never goes away. zpool won't allow
you to add to the pool with the bootfs property set.
However, if you clear the property back to default
we end up getting 0 for the object number and read
a bogus block pointer and fail to boot.
- Fix some error printfs. The printf in the loader is
only capable of c,s and u formats.
- Teach printf how to display %llu
Reviewed by: dfr, jhb
MFC after: 2 weeks
received, we don't have to do it on every ENXIO error in I/O path.
Solaris has no GEOM so they have to handle it in a less clean way.
MFC after: 3 days
to set ownership and mode in the same setattr operation, the mode was
overwritten by secpolicy_vnode_setattr().
PR: kern/118320
Submitted by: Mark Thompson <info-gentoo@mark.thompson.bz>
MFC after: 3 days
unmount. In that case we cannot depend on the proper order of invalidating
vnodes, so we have to free resources when we have a chance.
PR: kern/139062
Reported by: trasz
MFC after: 3 days
well with forced unmounts when GFS vnodes are referenced.
- Make other preparations to GFS for forced unmounts.
PR: kern/139062
Reported by: trasz
MFC after: 3 days
following vnops will fail. This is very important, because without this change
vnode could be reclaimed at any point, even if we increased usecount. The only
way to ensure that vnode won't be reclaimed was to lock it, which would be very
hard to do in ZFS without changing a lot of code. With this change simply
increasing usecount is enough to be sure vnode won't be reclaimed from under
us. To be precise it can still be reclaimed but we won't be able to see it,
because every try to enter ZFS through VFS will result in EIO.
The only function that cannot return EIO, because it is needed for vflush() is
zfs_root(). Introduce ZFS_ENTER_NOERROR() macro that only locks
z_teardown_lock and never returns EIO.
MFC after: 3 days
gid to set group ownership and not process gid.
This was overlooked during v6 -> v13 switch.
PR: kern/139076
Reported by: Sean Winn <sean@gothic.net.au>
MFC after: 3 days
df(1) and mount(8) output. This is a bit smilar to OpenSolaris and follows
ZFS route of not listing snapshots by default with 'zfs list' command.
- Add UPDATING entry to note that ZFS snapshots are no longer visible in
mount(8) and df(1) output by default.
Reviewed by: kib
MFC after: 3 days
by just returning EOPNOTSUPP. This will allow NFS server to fall back to
regular READDIR.
Note that converting inode number to snapshot's vnode is expensive operation.
Snapshots are stored in AVL tree, but based on their names, not inode numbers,
so to convert inode to snapshot vnode we have to interate over all snalshots.
This is not a problem in OpenSolaris, because in their READDIRPLUS
implementation they use VOP_LOOKUP() on d_name, instead of VFS_VGET() on
d_fileno as we do.
PR: kern/125149
Reported by: Weldon Godfrey <wgodfrey@ena.com>
Analysis by: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 3 days
the same entry twice. This bug is not fixed yet, but leads to situation
where when try to access corrupted directory the kernel will panic.
Until the bug is properly fixed, try to recover from it and log that it
happened.
Reported by: marck
OpenSolaris bug: 6709336
MFC after: 3 days
- Be prepared for dbuf to disappear in zfs_reclaim_complete() and check if
z_dbuf field is NULL - this might happen in case of rollback or forced
unmount between zfs_freebsd_reclaim() and zfs_reclaim_complete().
- On forced unmount wait for all znodes to be destroyed - destruction can be
done asynchronously via zfs_reclaim_complete().
MFC after: 1 week
NULL, but also can point to dead vnode, take that into account.
PR: kern/132068
Reported by: Edward Fisk" <7ogcg7g02@sneakemail.com>, kris
Fix based on patch from: Jaakko Heinonen <jh@saunalahti.fi>
MFC after: 1 week
commands executed by unprivileged users. Action is not really taken, but it is
logged to pool history, which might be confusing.
Reported by: Denis Ahrens <denis@h3q.com>
MFC after: 3 days
Such situation is not supported.
This problem was triggered by something like this:
# zpool create tank da0
# zfs snapshot tank@snap
# cd /tank/.zfs/snapshot/snap (this will mount the snapshot)
# cd
# mount -u nosuid /tank/.zfs/snapshot/snap (refcount leak)
# zpool export tank
cannot export 'tank': pool is busy
MFC after: 1 week
we will fail to unmount it, but it won't be removed from the tree,
so in that case there is no need to reinsert it.
This fixes a panic reproducable in the following steps:
# zfs create tank/foo
# zfs snapshot tank/foo@snap
# cd /tank/foo/.zfs/snapshot/snap
# umount /tank/foo
panic: avl_find() succeeded inside avl_add()
Reported by: trasz
MFC after: 3 days
provider is closed should be ok.
When administrator requests to change ZVOL size do it immediately if ZVOL
is closed or do it on last ZVOL close.
PR: kern/136942
Requested by: Bernard Buri <bsd@ask-us.at>
MFC after: 1 week
will always return failure. Fix this by bringing userland implementation of
xdrmem_control() back. This allow 'zpool import' to work again.
Reported by: Thomas Backman <serenity@exscape.org>
Reviewed by: kmacy
Approved by: re (kib)
(..), calling readdir and looking for previous directory inode. In case of
.zfs/ directory this doesn't work, because .zfs/ is hidden by default, so it
won't be visible in readdir output.
Fix this by implementing VPTOCNP for snapshot directories, so __getcwd()
doesn't fail and getcwd() doesn't have to use readdir method.
This fixes /bin/pwd from within .zfs/snapshot/<name>/.
Suggested by: kib
Approved by: re (rwatson)
replace it with wrappers around our taskqueue(9).
To make it possible implement taskqueue_member() function which returns 1
if the given thread was created by the given taskqueue.
Approved by: re (kib)
initialized. Also destroy /dev/zfs before doing other deinitializations.
- Initialization through taskq is no longer needed and there is a race
where one of the zpool/zfs command loads zfs.ko and tries to do some work
immediately, but /dev/zfs is not there yet.
Reported by: pav
Approved by: re (kib)
panic when in zfs_fuid_create_cred() when userid is negative. It is
converted to unsigned value which makes IS_EPHEMERAL() macro to
incorrectly report that this is ephemeral ID. The most reasonable
solution for now is to always report that the given ID is not ephemeral.
PR: kern/132337
Submitted by: Matthew West <freebsd@r.zeeb.org>
Tested by: Thomas Backman <serenity@exscape.org>, Michael Reifenberger <mike@reifenberger.com>
Approved by: re (kib)
MFC after: 2 weeks
doesn't exist and user doesn't have write access to the file.
Without this fix, it returns bogus value instead of 0. For some
reason this didn't manifest on my kernel compiled with -O0.
PR: kern/136601
Submitted by: Jaakko Heinonen <jh at saunalahti dot fi>
Approved by: re (kib)
this change, ZFS uses SunOS Alternate Data Streams semantics - each
EA has its own permissions, which are set at EA creation time
and - unlike SunOS - invisible to the user and impossible to change.
From the user point of view, it's just broken: sometimes access
is granted when it shouldn't be, sometimes it's denied when
it shouldn't be.
This patch makes it behave just like UFS, i.e. depend on current
file permissions. Also, it fixes returned error codes (ENOATTR
instead of ENOENT) and makes listextattr(2) return 0 instead
of EPERM where there is no EA directory (i.e. the file never had
any EA).
Reviewed by: pjd (idea, not actual code)
Approved by: re (kib)
Currently dtrace_gethrtime uses formula similar to the following for
converting TSC ticks to nanoseconds:
rdtsc() * 10^9 / tsc_freq
The dividend overflows 64-bit type and wraps-around every 2^64/10^9 =
18446744073 ticks which is just a few seconds on modern machines.
Now we instead use precalculated scaling factor of
10^9*2^N/tsc_freq < 2^32 and perform TSC value multiplication separately
for each 32-bit half. This allows to avoid overflow of the dividend
described above.
The idea is taken from OpenSolaris.
This has an added feature of always scaling TSC with invariant value
regardless of TSC frequency changes. Thus the timestamps will not be
accurate if TSC actually changes, but they are always proportional to
TSC ticks and thus monotonic. This should be much better than current
formula which produces wildly different non-monotonic results on when
tsc_freq changes.
Also drop write-only 'cp' variable from amd64 dtrace_gethrtime_init()
to make it identical to the i386 twin.
PR: kern/127441
Tested by: Thomas Backman <serenity@exscape.org>
Reviewed by: jhb
Discussed with: current@, bde, gnn
Silence from: jb
Approved by: re (gnn)
MFC after: 1 week
PCATCH, to indicate that thread shall not be stopped upon receipt of
SIGSTOP until it reaches the kernel->usermode boundary.
Also change thread_single(SINGLE_NO_EXIT) to only stop threads at
the user boundary unconditionally.
Tested by: pho
Reviewed by: jhb
Approved by: re (kensmith)
The programmer was aware that alignment was not guaranteed in the
packed structure and used bzero() to NULL out the pointers.
However, on ia64, the compiler is quite agressive in finding ILP
and calls to bzero() are often replaced by simple assignments (i.e.
stores). Especially when the width or size in question corresponds
with a store instruction (i.e. st1, st2, st4 or st8).
The problem here is not a compiler bug. The address of the memory
to zero-out was given by '&packed->nvl_priv' and given the type of
the 'packed' pointer the compiler could assume proper alignment for
the replacement of bzero() with an 8-byte wide store to be valid.
The problem is with the programmer. The programmer knew that the
address did not have the alignment guarantees needed for a regular
assignment, but failed to inform the compiler of that fact. In
fact, the programmer told the compiler the opposite: alignment is
guaranteed.
The fix is to avoid using a pointer of type "nvlist_t *" and
instead use a "char *" pointer as the basis for calculating the
address. This tells the compiler that only 1-byte alignment can
be assumed and the compiler will either keep the bzero() call
or instead replace it with a sequence of byte-wise stores. Both
are valid.
Approved by: re (kib)
On amd64 KERNBASE/kernbase does not mean start of kernel memory.
This should fix a KASSERT panic in dtrace_copycheck when copyin*()
is used in D program.
Also make checks for user memory a bit stricter.
Reported by: Thomas Backman <serenity@exscape.org>
Submitted by: wxs (kaddr part)
Tested by: Thomas Backman (prototype), wxs
Reviewed by: alc (concept), jhb, current@
Aprroved by: jb (concept)
MFC after: 2 weeks
PR: kern/134408
vn_open_cred invocations shall not audit namei path.
In particular, specify VN_OPEN_NOAUDIT for dotdot lookup performed by
default implementation of vop_vptocnp, and for the open done for core
file. vn_fullpath is called from the audit code, and vn_open there need
to disable audit to avoid infinite recursion. Core file is created on
return to user mode, that, in particular, happens during syscall return.
The creation of the core file is audited by direct calls, and we do not
want to overwrite audit information for syscall.
Reported, reviewed and tested by: rwatson
by prefetched than helped. On i386 systems and systems with less than 4GB,
prefetch is now disabled by default. I've added a prefetch enable tunable, to
enable prefetching for those systems. The prefetch disable tunable will continue
to unconditionally disable prefetching.
The system hostname is now stored in prison0, and the global variable
"hostname" has been removed, as has the hostname_mtx mutex. Jails may
have their own host information, or they may inherit it from the
parent/system. The proper way to read the hostname is via
getcredhostname(), which will copy either the hostname associated with
the passed cred, or the system hostname if you pass NULL. The system
hostname can still be accessed directly (and without locking) at
prison0.pr_host, but that should be avoided where possible.
The "similar information" referred to is domainname, hostid, and
hostuuid, which have also become prison parameters and had their
associated global variables removed.
Approved by: bz (mentor)
Introduce for this operation the reverse NO_ADAPTIVE_SX option.
The flag SX_ADAPTIVESPIN to be passed to sx_init_flags(9) gets suppressed
and the new flag, offering the reversed logic, SX_NOADAPTIVE is added.
Additively implements adaptive spininning for sx held in shared mode.
The spinning limit can be handled through sysctls in order to be tuned
while the code doesn't reach the release, after which time they should
be dropped probabilly.
This change has made been necessary by recent benchmarks where it does
improve concurrency of workloads in presence of high contention
(ie. ZFS).
KPI breakage is documented by __FreeBSD_version bumping, manpage and
UPDATING updates.
Requested by: jeff, kmacy
Reviewed by: jeff
Tested by: pho
- add FreeBSD implementation of xdrmem_control needed by zfs
- have zfs define xdr_ops using FreeBSD's definition
- remove solaris xdr files from zfs compile
adds probes for mutexes, reader/writer and shared/exclusive locks to
gather contention statistics and other locking information for
dtrace scripts, the lockstat(1M) command and other potential
consumers.
Reviewed by: attilio jhb jb
Approved by: gnn (mentor)
about removing a few #ifdefs and providing compatibility wrappers and
VOP implementations to get and set an ACL; ZFS does ACL enforcement all
by itself.
Note that the VOPs are ifdefed out for now, so this change should be
a no-op.
Reviewed by: pjd
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.
In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.
While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.
VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.
virtualized instances of hostname and domainname, as well as a new top-level
virtualization struct vimage, which holds pointers to struct vnet and struct
vprocg. Struct vprocg is likely to become replaced in the near future with
a new jail management API import.
As a consequence of this change, change struct ucred to point to a struct
vimage, instead of directly pointing to a vnet.
Merge vnet / vimage / ucred refcounting infrastructure from p4 / vimage
branch.
Permit kldload / kldunload operations to be executed only from the default
vimage context.
This change should have no functional impact on nooptions VIMAGE kernel
builds.
Reviewed by: bz
Approved by: julian (mentor)
This is based on a fix that went in to opensolaris on March 9th. However, it uses a dedicated
thread instead of a Solaris' taskq to avoid doing a blocking memory allocation with the vnode
interlock held.
This fixes a long-time deadlock in ZFS. This is not, strictly speaking, an LOR. The spa_zio
thread releases a vnode, this calls in to vn_reclaim which in turn needs to acquire range locks
to sync dirty data out to disk. The range locks are already held by a user-level process waiting
on a condition variable that it the process is waiting on a spa_zio thread to signal it on. The
process could not be signalled because the spa_zio thread could not proceed.
The nature of this problem was not apparent due to ZFS locks opting out of witness which meant
that DDB did not know about the locks that were held by ZFS.
Reviewed by: pjd
MFC after: 7 days
interface as nmount(2). Three new system calls are added:
* jail_set, to create jails and change the parameters of existing jails.
This replaces jail(2).
* jail_get, to read the parameters of existing jails. This replaces the
security.jail.list sysctl.
* jail_remove to kill off a jail's processes and remove the jail.
Most jail parameters may now be changed after creation, and jails may be
set to exist without any attached processes. The current jail(2) system
call still exists, though it is now a stub to jail_set(2).
Approved by: bz (mentor)
the removal of NQNFS, but was left in in case it was required for NFSv4.
Since our new NFSv4 client and server can't use it for their
requirements, GC the old mechanism, as well as other unused lease-
related code and interfaces.
Due to its impact on kernel programming and binary interfaces, this
change should not be MFC'd.
Proposed by: jeff
Reviewed by: jeff
Discussed with: rmacklem, zach loafman @ isilon
events are:
nfsclient:accesscache:flush:done
nfsclient:accesscache:get:hit
nfsclient:accesscache:get:miss
nfsclient:accesscache:load:done
They pass the vnode, uid, and requested or loaded access mode (if any);
the load event may also report a load error if the RPC fails.
The attribute cache events are:
nfsclient:attrcache:flush:done
nfsclient:attrcache:get:hit
nfsclient:attrcache:get:miss
nfsclient:attrcache:load:done
They pass the vnode, optionally the vattr if one is present (hit or load),
and in the case of a load event, also a possible RPC error.
MFC after: 1 month
Sponsored by: Google, Inc.
provider. The NFS client exposes 'start' and 'done' probes for NFSv2
and NFSv3 RPCs when using the new RPC implementation, passing in the
vnode, mbuf chain, credential, and NFSv2 or NFSv3 procedure number.
For 'done' probes, the error number is also available.
Probes are named in the following way:
...
nfsclient:nfs2:write:start
nfsclient:nfs2:write:done
...
nfsclient:nfs3:access:start
nfsclient:nfs3:access:done
...
Access to the unmarshalled arguments is not easily available at this
point in the stack, but the passed probe arguments are sufficient to
to a lot of interesting things in practice. Technically, these probes
may cover multiple RPC retransmits, and even transactions if the
transaction ID change as a result of authentication failure or a
jukebox error from the server, but usefully capture the intent of a
single NFS request, such as access, getattr, write, etc.
Typical use might involve profiling RPC latency by system call, number
of RPCs, how often a getattr leads to a call to access, when failed
access control checks occur, etc. More detailed RPC information might
best be provided by adding a krpc provider. It would also be useful
to add NFS client probes for events such as the access cache or
attribute cache satisfying requests without an RPC.
Sponsored by: Google, Inc.
MFC after: 1 month