brings in support for an optional intent log which eliminates the need
for background fsck on unclean shutdown.
Sponsored by: iXsystems, Yahoo!, and Juniper.
With help from: McKusick and Peter Holm
Introduce the new flag KNF_NOKQLOCK to allow event callers to be called
without KQ_LOCK mtx held.
- Modify VFS knote calls to always use KNF_NOKQLOCK flag. This is required
for ZFS as its getattr implementation may sleep.
Approved by: re (rwatson)
Reviewed by: kib
MFC after: 2 weeks
the VFS. Now all the VFS_* functions and relating parts don't want the
context as long as it always refers to curthread.
In some points, in particular when dealing with VOPs and functions living
in the same namespace (eg. vflush) which still need to be converted,
pass curthread explicitly in order to retain the old behaviour.
Such loose ends will be fixed ASAP.
While here fix a bug: now, UFS_EXTATTR can be compiled alone without the
UFS_EXTATTR_AUTOSTART option.
VFS KPI is heavilly changed by this commit so thirdy parts modules needs
to be recompiled. Bump __FreeBSD_version in order to signal such
situation.
filesystem supports additional operations using shared vnode locks.
Currently this is used to enable shared locks for open() and close() of
read-only file descriptors.
- When an ISOPEN namei() request is performed with LOCKSHARED, use a
shared vnode lock for the leaf vnode only if the mount point has the
extended shared flag set.
- Set LOCKSHARED in vn_open_cred() for requests that specify O_RDONLY but
not O_CREAT.
- Use a shared vnode lock around VOP_CLOSE() if the file was opened with
O_RDONLY and the mountpoint has the extended shared flag set.
- Adjust md(4) to upgrade the vnode lock on the vnode it gets back from
vn_open() since it now may only have a shared vnode lock.
- Don't enable shared vnode locks on FIFO vnodes in ZFS and UFS since
FIFO's require exclusive vnode locks for their open() and close()
routines. (My recent MPSAFE patches for UDF and cd9660 already included
this change.)
- Enable extended shared operations on UFS, cd9660, and UDF.
Submitted by: ups
Reviewed by: pjd (ZFS bits)
MFC after: 1 month
vfsopt and the vfs_buildopts function public, and add some new fields
to struct vfsopt (pos and seen), and new functions vfs_getopt_pos and
vfs_opterror.
Further extend the interface to allow reading options from the kernel
in addition to sending them to the kernel, with vfs_setopt and related
functions.
While this allows the "name=value" option interface to be used for more
than just FS mounts (planned use is for jails), it retains the current
"vfsopt" name and <sys/mount.h> requirement.
Approved by: bz (mentor)
- threadA runs vfs_rel(mp1)
- threadB does unmount the mp1 fs, sets MNTK_UNMOUNT and drop MNT_ILOCK()
- threadA runs vfs_busy(mp1) and, as long as, MNTK_UNMOUNT is set, sleeps
waiting for threadB to complete the unmount
- threadB, in vfs_mount_destroy(), finds mnt_lock > 0 and sleeps waiting
for the refcount to expire.
Fix the deadlock by adding a flag called MNTK_REFEXPIRE which signals the
unmounter is waiting for mnt_ref to expire.
The vfs_busy contenders got awake, fails, and if they retry the
MNTK_REFEXPIRE won't allow them to sleep again.
2) Simplify significantly the code of vfs_mount_destroy() trimming
unnecessary codes:
- as long as any reference exited, it is no-more possible to have
write-op (primarty and secondary) in progress.
- it is no needed to drop and reacquire the mount lock.
- filling the structures with dummy values is unuseful as long as
it is going to be freed.
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
Discussed with: kib
to the fs, but before a vnode on the fs is locked, unmount may free fs
structures, causing access to destroyed data and freed memory.
Introduce a vfs_busymp() function that looks up and busies found
fs while mountlist_mtx is held. Use it in nfsrv_fhtovp() and in the
implementation of the handle syscalls.
Two other uses of the vfs_getvfs() in the vfs_subr.c, namely in
sysctl_vfs_ctl and vfs_getnewfsid seems to be ok. In particular,
sysctl_vfs_ctl is protected by Giant by being a non-sleeping sysctl
handler, that prevents Giant-locked unmount code to interfere with it.
Noted by: tegge
Reviewed by: dfr
Tested by: pho
MFC after: 1 month
This bring huge amount of changes, I'll enumerate only user-visible changes:
- Delegated Administration
Allows regular users to perform ZFS operations, like file system
creation, snapshot creation, etc.
- L2ARC
Level 2 cache for ZFS - allows to use additional disks for cache.
Huge performance improvements mostly for random read of mostly
static content.
- slog
Allow to use additional disks for ZFS Intent Log to speed up
operations like fsync(2).
- vfs.zfs.super_owner
Allows regular users to perform privileged operations on files stored
on ZFS file systems owned by him. Very careful with this one.
- chflags(2)
Not all the flags are supported. This still needs work.
- ZFSBoot
Support to boot off of ZFS pool. Not finished, AFAIK.
Submitted by: dfr
- Snapshot properties
- New failure modes
Before if write requested failed, system paniced. Now one
can select from one of three failure modes:
- panic - panic on write error
- wait - wait for disk to reappear
- continue - serve read requests if possible, block write requests
- Refquota, refreservation properties
Just quota and reservation properties, but don't count space consumed
by children file systems, clones and snapshots.
- Sparse volumes
ZVOLs that don't reserve space in the pool.
- External attributes
Compatible with extattr(2).
- NFSv4-ACLs
Not sure about the status, might not be complete yet.
Submitted by: trasz
- Creation-time properties
- Regression tests for zpool(8) command.
Obtained from: OpenSolaris
Really, the concept of holdcnt in the struct mount is rappresented by
the mnt_ref (which prevents the type-stable structure from being
"recycled) handled through vfs_ref() and vfs_rel().
On this optic, switch the holdcnt acquisition into an emulated vfs_ref()
(and subsequent release into vfs_rel()).
Discussed with: kib
Tested by: pho
and server. This replaces the RPC implementation of the NFS client and
server with the newer RPC implementation originally developed
(actually ported from the userland sunrpc code) to support the NFS
Lock Manager. I have tested this code extensively and I believe it is
stable and that performance is at least equal to the legacy RPC
implementation.
The NFS code currently contains support for both the new RPC
implementation and the older legacy implementation inherited from the
original NFS codebase. The default is to use the new implementation -
add the NFS_LEGACYRPC option to fall back to the old code. When I
merge this support back to RELENG_7, I will probably change this so
that users have to 'opt in' to get the new code.
To use RPCSEC_GSS on either client or server, you must build a kernel
which includes the KGSSAPI option and the crypto device. On the
userland side, you must build at least a new libc, mountd, mount_nfs
and gssd. You must install new versions of /etc/rc.d/gssd and
/etc/rc.d/nfsd and add 'gssd_enable=YES' to /etc/rc.conf.
As long as gssd is running, you should be able to mount an NFS
filesystem from a server that requires RPCSEC_GSS authentication. The
mount itself can happen without any kerberos credentials but all
access to the filesystem will be denied unless the accessing user has
a valid ticket file in the standard place (/tmp/krb5cc_<uid>). There
is currently no support for situations where the ticket file is in a
different place, such as when the user logged in via SSH and has
delegated credentials from that login. This restriction is also
present in Solaris and Linux. In theory, we could improve this in
future, possibly using Brooks Davis' implementation of variant
symlinks.
Supporting RPCSEC_GSS on a server is nearly as simple. You must create
service creds for the server in the form 'nfs/<fqdn>@<REALM>' and
install them in /etc/krb5.keytab. The standard heimdal utility ktutil
makes this fairly easy. After the service creds have been created, you
can add a '-sec=krb5' option to /etc/exports and restart both mountd
and nfsd.
The only other difference an administrator should notice is that nfsd
doesn't fork to create service threads any more. In normal operation,
there will be two nfsd processes, one in userland waiting for TCP
connections and one in the kernel handling requests. The latter
process will create as many kthreads as required - these should be
visible via 'top -H'. The code has some support for varying the number
of service threads according to load but initially at least, nfsd uses
a fixed number of threads according to the value supplied to its '-n'
option.
Sponsored by: Isilon Systems
MFC after: 1 month
- Implement real draining for vfs consumers by not relying on the
mnt_lock and using instead a refcount in order to keep track of lock
requesters.
- Due to the change above, remove the mnt_lock lockmgr because it is now
useless.
- Due to the change above, vfs_busy() is no more linked to a lockmgr.
Change so its KPI by removing the interlock argument and defining 2 new
flags for it: MBF_NOWAIT which basically replaces the LK_NOWAIT of the
old version (which was unlinked from the lockmgr alredy) and
MBF_MNTLSTLOCK which provides the ability to drop the mountlist_mtx
once the mnt interlock is held (ability still desired by most consumers).
- The stub used into vfs_mount_destroy(), that allows to override the
mnt_ref if running for more than 3 seconds, make it totally useless.
Remove it as it was thought to work into older versions.
If a problem of "refcount held never going away" should appear, we will
need to fix properly instead than trust on such hackish solution.
- Fix a bug where returning (with an error) from dounmount() was still
leaving the MNTK_MWAIT flag on even if it the waiters were actually
woken up. Just a place in vfs_mount_destroy() is left because it is
going to recycle the structure in any case, so it doesn't matter.
- Remove the markercnt refcount as it is useless.
This patch modifies VFS ABI and breaks KPI for vfs_busy() so manpages and
__FreeBSD_version will be modified accordingly.
Discussed with: kib
Tested by: pho
wait until the current suspension is lifted instead of silently returning
success immediately. The consequences of calling vfs_write() resume when
not owning the suspension are not well-defined at best.
Add the vfs_susp_clean() mount method to be called from
vfs_write_resume(). Set it to process_deferred_inactive() for ffs, and
stop calling it manually.
Add the thread flag TDP_IGNSUSP that allows to bypass the suspension
point in the vn_start_write. It is intended for use by VFS in the
situations where the suspender want to do some i/o requiring calls to
vn_start_write(), and this i/o cannot be done later.
Reviewed by: tegge
In collaboration with: pho
MFC after: 1 month
and nfs requests processing. Lockmgr lock provides the shared locking for
nfs requests, while exclusive mode is used for modifications. The writer
starvation is handled by lockmgr too.
Reported by: kris, pho, many
Based on the submission by: mohan
Tested by: pho
MFC after: 2 weeks
info about all currently mounted file systems. When an address is given
as an argument, prints detailed info about the given mount point.
MFC after: 2 weeks
than rely on the lockmgr support [1]:
* bump the waiters only if the interlock is held
* let brelvp() return the waiters count
* rely on brelvp() instead than BUF_LOCKWAITERS() in order to check
for the waiters number
- Remove a namespace pollution introduced recently with lockmgr.h
including lock.h by including lock.h directly in the consumers and
making it mandatory for using lockmgr.
- Modify flags accepted by lockinit():
* introduce LK_NOPROFILE which disables lock profiling for the
specified lockmgr
* introduce LK_QUIET which disables ktr tracing for the specified
lockmgr [2]
* disallow LK_SLEEPFAIL and LK_NOWAIT to be passed there so that it
can only be used on a per-instance basis
- Remove BUF_LOCKWAITERS() and lockwaiters() as they are no longer
used
This patch breaks KPI so __FreBSD_version will be bumped and manpages
updated by further commits. Additively, 'struct buf' changes results in
a disturbed ABI also.
[2] Really, currently there is no ktr tracing in the lockmgr, but it
will be added soon.
[1] Submitted by: kib
Tested by: pho, Andrea Barberio <insomniac at slackware dot it>
prevents insmntque() from placing reallocated syncer vnode on mount
list, that causes panic in vfs_allocate_syncvnode().
Introduce MNTK_NOINSMNTQ flag, that marks the period when instmntque is
not allowed to success, instead of MNTK_UNMOUNT. The MNTK_NOINSMNTQ is
set and cleared simultaneously with MNTK_UNMOUNT, except on umount error
path, where it is cleaned just before the syncer vnode is going to be
allocated.
Reported by: Peter Jeremy <peterjeremy optushome com au>
Suggested by: tegge
Approved by: re (rwatson)
64-bit counters) to a 4.x statfs structure (with long-sized counters).
- For block counters, we scale up the block size sufficiently large so
that the resulting block counts fit into a the long-sized (long for the
ABI, so 32-bit in freebsd32) counters. In 4.x the NFS client's statfs
VOP did this already. This can lie about the block size to 4.x binaries,
but it presents a more accurate picture of the ratios of free and
available space.
- For non-block counters, fix the freebsd32 stats converter to cap the
values at INT32_MAX rather than losing the upper 32-bits to match the
behavior of the 4.x statfs conversion routine in vfs_syscalls.c
Approved by: re (kensmith)
labels: the mount label (label of the mountpoint) and the fs label (label
of the file system). In practice, policies appear to only ever use one,
and the distinction is not helpful.
Combine mnt_mntlabel and mnt_fslabel into a single mnt_label, and
eliminate extra machinery required to maintain the additional label.
Update policies to reflect removal of extra entry points and label.
Obtained from: TrustedBSD Project
Sponsored by: SPARTA, Inc.
unmount jail-friendly file systems from within a jail.
Precisely it grants PRIV_VFS_MOUNT, PRIV_VFS_UNMOUNT and
PRIV_VFS_MOUNT_NONUSER privileges for a jailed super-user.
It is turned off by default.
A jail-friendly file system is a file system which driver registers
itself with VFCF_JAIL flag via VFS_SET(9) API.
The lsvfs(1) command can be used to see which file systems are
jail-friendly ones.
There currently no jail-friendly file systems, ZFS will be the first one.
In the future we may consider marking file systems like nullfs as
jail-friendly.
Reviewed by: rwatson
This way we may support multiple structures in v_data vnode field within
one file system without using black magic.
Vnode-to-file-handle should be VOP in the first place, but was made VFS
operation to keep interface as compatible as possible with SUN's VFS.
BTW. Now Solaris also implements vnode-to-file-handle as VOP operation.
VFS_VPTOFH() was left for API backward compatibility, but is marked for
removal before 8.0-RELEASE.
Approved by: mckusick
Discussed with: many (on IRC)
Tested with: ufs, msdosfs, cd9660, nullfs and zfs
support enabled.
Add mnt_gjprovider field which keeps gjournal provider's name on which
file system is placed on. This allows to not place file system on gjournal
directly and allows gjournal class to pair gjournal provider with file
system.
Sponsored by: home.pl
unmount when mp structure is reused while waiting for coveredvp lock.
Introduce struct mount generation count, increment it on each reuse and
compare the generations before and after obtaining the coveredvp lock.
Reviewed by: tegge, pjd
Approved by: pjd (mentor)
MFC after: 2 weeks
closing a window where a file system using softupdates could be async
for a short while if both MNT_UPDATE and MNT_ASYNC were passed as flags
to nmount(). Add MNTK_SOFTDEP flag to ensure that softdep_mount()
doesn't increase mnt_noasync multiple times.
sync() and sync_fsync() without losing MNT_ASYNC. Add MNTK_ASYNC flag
which is set only when MNT_ASYNC is set and mnt_noasync is zero, and
check that flag instead of MNT_ASYNC before initiating async io.
adds a FS type specific flag indicating that the FS supports shared
vnode lock lookups, adds some logic in vfs_lookup.c to test this flag
and set lock flags appropriately.
- amd on 6.x is a non-starter (without this change). Using amd under
heavy load results in a deadlock (with cascading vnode locks all the
way to the root) very quickly.
- This change should also fix the more general problem of cascading
vnode deadlocks when an NFS server goes down.
Ideally, we wouldn't need these changes, as enabling shared vnode lock
lookups globally would work. Unfortunately, UFS, for example isn't
ready for shared vnode lock lookups, crashing pretty quickly.
This change is the result of discussions with Stephan Uphoff (ups@).
Reviewed by: ups@
Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.
Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).
Perform deferred inactive handling after the file system is resumed.
replacement for vn_write_suspend_wait() to better account for secondary write
processing.
Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.
Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.
has many positive effects including improved smp locking, reducing
interdependencies between mounts that can lead to deadlocks, etc.
- Add the softdep worklist and various counters to the ufsmnt structure.
- Add a mount pointer to the workitem and remove mount pointers from the
various structures derived from the workitem as they are now redundant.
- Remove the poor-man's semaphore protecting softdep_process_worklist and
softdep_flushworklist. Several threads may now process the list
simultaneously.
- Add softdep_waitidle() to block the thread until all pending
dependencies being operated on by other threads have been flushed.
- Use softdep_waitidle() in unmount and snapshots to block either
operation until the fs is stable.
- Remove softdep worklist processing from the syncer and move it into the
softdep_flush() thread. This thread processes all softdep mounts
once each second and when it is called via the new softdep_speedup()
when there is a resource shortage. This removes the softdep hook
from the kernel and various hacks in header files to support it.
Reviewed by/Discussed with: tegge, truckman, mckusick
Tested by: kris
vfs_mount_destroy waiting for this ref to hit 0. We don't print an
error if we are rebooting as the root mount always retains some refernces
by init proc.
- Acquire a mnt ref for every vnode allocated to a mount point. Drop this
ref only once vdestroy() has been called and the mount has been freed.
- No longer NULL the v_mount pointer in delmntque() so that we may release
the ref after vgone() has been called. This allows us to guarantee
that the mount point structure will be valid until the last vnode has
lost its last ref.
- Fix a few places that rely on checking v_mount to detect recycling.
Sponsored by: Isilon Systems, Inc.
MFC After: 1 week