Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.
Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads. A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.
Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk. This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk. The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.
Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.
Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.
See <http://www.holm.cc/stress/log/cons143.html>
vlrureclaim() in vfs_subr.c 1.636 because waiting for the vnode
lock aggravates an existing race condition. It is also undesirable
according to the commit log for 1.631.
Fix the tiny race condition that remains by rechecking the vnode
state after grabbing the vnode lock and grabbing the vnode interlock.
Fix the problem of other threads being starved (which 1.636 attempted
to fix by removing LK_NOWAIT) by calling uio_yield() periodically
in vlrureclaim(). This should be more deterministic than hoping
that VOP_LOCK() without LK_NOWAIT will block, which may not happen
in this loop.
Reviewed by: kan
MFC after: 5 days
is a workaround for non-symetric teardown of the file systems at
shutdown with respect to the mount order at boot. The proper long term
fix is to properly detach devfs from the root mount before unmounting
each, and should be implemented, but since the problem is non-harmful,
this temporary band-aid will prevent false positive bug reports and
unnecessary error output for 6.0-RELEASE.
MFC after: 3 days
Tested by: pav, pjd
vnode is inactivated), possibly leading to a NULL dereference when
checking if the mount wants knotes to be activated in the VOP hooks.
So, we add a new vnode flag VV_NOKNOTE that is only set in getnewvnode(),
if necessary, and check it when activating knotes.
Since the flags are not erased when a vnode is being held, we can safely
read them.
Reviewed by: kris@
MFC after: 3 days
vnlru proc is extremely inefficient, potentially iteration over tens of
thousands of vnodes without blocking. Droping Giant allows other threads
to preempt us although we should revisit the algorithm to fix the runtime
problems especially since this may hold up all vnode allocations.
- Remove the LK_NOWAIT from the VOP_LOCK in vlrureclaim. This provides
a natural blocking point to help alleviate the situation described above
although it may not technically be desirable.
- yield after we make a pass on all mount points to prevent us from
blocking other threads which require Giant.
MFC after: 2 weeks
- Introducing the possibility of using locks different than mutexes
for the knlist locking. In order to do this, we add three arguments to
knlist_init() to specify the functions to use to lock, unlock and
check if the lock is owned. If these arguments are NULL, we assume
mtx_lock, mtx_unlock and mtx_owned, respectively.
- Using the vnode lock for the knlist locking, when doing kqueue operations
on a vnode. This way, we don't have to lock the vnode while holding a
mutex, in filt_vfsread.
Reviewed by: jmg
Approved by: re (scottl), scottl (mentor override)
Pointyhat to: ssouhlal
Will be happy: everyone
are actually caused by a buf with both VNCLEAN and VNDIRTY set. In
the traces it is clear that the buf is removed from the dirty queue while
it is actually on the clean queue which leaves the tail pointer set.
Assert that both flags are not set in buf_vlist_add and buf_vlist_remove.
Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)
ref while we're calling vgone(). This prevents transient refs from
re-adding us to the free list. Previously, a vfree() triggered via
vinvalbuf() getting rid of all of a vnode's pages could place a partially
destructed vnode on the free list where vtryrecycle() could find it. The
first call to vtryrecycle would hang up on the vnode lock, but when it
failed it would place a now dead vnode onto the free list, and another
call to vtryrecycle() would free an already free vnode. There were many
complications of having a zero ref count while freeing which can now go
away.
- Change vdropl() to release the interlock before returning. All callers
now respect this, so vdropl() directly frees VI_DOOMED vnodes once the
last ref is dropped. This means that we'll never have VI_DOOMED vnodes
on the free list.
- Seperate v_incr_usecount() into v_incr_usecount(), v_decr_usecount() and
v_decr_useonly(). The incr/decr split is so that incr usecount can
return with the interlock still held while decr drops the interlock so
it can call vdropl() which will potentially free the vnode. The calling
function can't drop the lock of an already free'd node. v_decr_useonly()
drops a usecount without droping the hold count. This is done so the
usecount reaches zero in vput() before we recycle, however the holdcount
is still 1 which prevents any new references from placing the vnode
back on the free list.
- Fix vnlrureclaim() to vhold the vnode since it doesn't do a vget(). We
wouldn't want vnlrureclaim() to bump the usecount since this has
different semantics. Also change vnlrureclaim() to do a NOWAIT on the
vn_lock. When this function runs we're usually in a desperate situation
and we wouldn't want to wait for any specific vnode to be released.
- Fix a bunch of misc comments to reflect the new behavior.
- Add vhold() and vdrop() to vflush() for the same reasons that we do in
vlrureclaim(). Previously we held no reference and a vnode could have
been freed while we were waiting on the lock.
- Get rid of vlruvp() and vfreehead(). Neither are used. vlruvp() should
really be rethought before it's reintroduced.
- vgonel() always returns with the vnode locked now and never puts the
vnode back on a free list. The vnode will be freed as soon as the last
reference is released.
Sponsored by: Isilon Systems, Inc.
Debugging help from: Kris Kennaway, Peter Holm
Approved by: re (blanket vfs)
of the clean and dirty lists. This is in an attempt to catch the wrong
bufobj problem sooner.
- In vgonel() don't acquire an extra reference in the active case, the
vnode lock and VI_DOOMED protect us from recursively cleaning.
- Also in vgonel() clean up some stale comments.
Sponsored by: Isilon Systems, Inc.
Approved by: re (blanket vfs)
anyway and it's not used outside of vfs_subr.c.
- Change vgonel() to accept a parameter which determines whether or not
we'll put the vnode on the free list when we're done.
- Use the new vgonel() parameter rather than VI_DOOMED to signal our
intentions in vtryrecycle().
- In vgonel() return if VI_DOOMED is already set, this vnode has already
been reclaimed.
Sponsored by: Isilon Systems, Inc.
events could be added to cover other interesting details.
- Add some VNASSERTs to discover places where we access vnodes after
they have been uma_zfree'd before we try to free them again.
- Add a few more VNASSERTs to vdestroy() to be certain that the vnode is
really unused.
Sponsored by: Isilon Systems, Inc.
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()
My benchmarks have not revealed any performance degradation.
Reviewed by: jeff, bde
Approved by: rwatson, jmg (kqueue changes), grehan (mentor)
1. Copy a NULL-terminated string into a fixed-length buffer, and
2. copyout that buffer to userland,
we really ought to
0. Zero the entire buffer
first.
Security: FreeBSD-SA-05:08.kmem
drop the check+initialization for a straight initialization. Also
assert that curthread will never be NULL just to be sure.
Discussed with: rwatson, peter
MFC after: 1 week
are set when we attempt to remove a buffer from a queue we should panic.
Hopefully this will catch the source of the wrong bufobj panics.
Sponsored by: Isilon Systems, Inc.
vtryrecycle(). We could sometimes get into situations where two threads
could try to recycle the same vnode before this.
- vtryrecycle() is now responsible for returning the vnode to the free list
if it fails and someone else hasn't done it.
- Make a new function vfreehead() which moves a vnode to the head of the
free list and use it in vgone() to clean up that code a bit.
Sponsored by: Isilon Systems, Inc.
Reported by: pho, kkenn
do not correctly deal with failures. This presently risks deadlock
problems if dependency processing is held up by failures to allocate
a vnode, however, this is better than the situation with the failures.
Sponsored by: Isilon Systems, Inc.
potential lock order reversal. Also, don't unlock the vnode if this
fails, lockmgr has already unlocked it for us.
- Restructure vget() now that vn_lock() does all of VI_DOOMED checking
for us and also handles the case where there is no real lock type.
- If VI_OWEINACT is set, we need to upgrade the lock request to EXCLUSIVE
so that we can call inactive. It's not legal to vget a vnode that hasn't
had INACTIVE called yet.
Sponsored by: Isilon Systems, Inc.
to root cause on exactly how this happens.
- If the assert is disabled, we presently try to handle this case, but the
BUF_UNLOCK was missing. Thus, if this condition ever hit we would leak
a buf lock.
Many thanks to Peter Holm for all his help in finding this bug. He really
put more effort into it than I did.
one to become available for one second and then return ENFILE. We
can run out of vnodes, and there must be a hard limit because without
one we can quickly run out of KVA on x86. Presently the system can
deadlock if there are maxvnodes directories in the namecache. The
original 4.x BSD behavior was to return ENFILE if we reached the max,
but 4.x BSD did not have the vnlru proc so it was less profitable to
wait.
on filesystems which safely support them. It appears that many
network filesystems specifically are not shared lock safe.
Sponsored by: Isilon Systems, Inc.
vnodes whose names it caches, so we no longer need a `generation
number' to tell us if a referenced vnode is invalid. Replace the use
of the parent's v_id in the hash function with the address of the
parent vnode.
Tested by: Peter Holm
Glanced at by: jeff, phk
vhold()s us.
- Avoid an extra mutex acquire and release in the common case of vgonel()
by checking for OWEINACT at the start of the function.
- Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our
shared lock if it fails.
Sponsored by: Isilon Systems, Inc.
now always allocates a new vnode.
- Define a new function, vnlru_free, which frees vnodes from the free list.
It takes as a parameter the number of vnodes to free, which is
wantfreevnodes - freevnodes when called from vnlru_proc or 1 when
called from getnewvnode(). For now, getnewvnode() still tries to reclaim
a free vnode before creating a new one when we are near the limit.
- Define a function, vdestroy, which handles the actual release of memory
and teardown of locks, etc. This could become a uma_dtor() routine.
- Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This
keeps more unreferenced vnodes around so that files which have only
been stat'd are less likely to be kicked out of the system before we
have a chance to read them, etc. These vnodes may still be freed via
the normal vnlru_proc() routines which may some day become a real lru.
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.
Sponsored by: Isilon Systems, Inc.
- Move VSHOULDBUSY, VSHOULDFREE, and VTRYRECYCLE into vfs_subr.c so
no one else attempts to grow a dependency on them.
- Now that objects with pages hold the vnode we don't have to do unlocked
checks for the page count in the vm object in VSHOULDFREE. These three
macros could simply check for holdcnt state transitions to determine
whether the vnode is on the free list already, but the extra safety
the flag affords us is probably worth the minimal cost.
- The leafonly sysctl and code have been dead for several years now,
remove the sysctl and the code that employed it from vtryrecycle().
- vtryrecycle() also no longer has to check the object's page count as
the object holds the vnode until it reaches 0.
Sponsored by: Isilon Systems, Inc.
to use only the holdcnt to determine whether a vnode may be recycled,
simplifying the V* macros as well as vtryrecycle(), etc.
Sponsored by: Isilon Systems, Inc.
vtryrecycle(). All obj refs also ref the vnode.
- Consistently use v_incr_usecount() to increment the usecount. This will
be more important later.
Sponsored by: Isilon Systems, Inc.
- Add a vn_start_write/vn_finished_write around vlrureclaim so we don't do
writing ops without suspending. This could suspend the vlruproc which
should not be a problem under normal circumstances.
- Manually implement VMIGHTFREE in vlrureclaim as this was the only instance
where it was used.
- Acquire a lock before calling vgone() as it now requires it.
- Move the acquisition of the vnode interlock from vtryrecycle() to
getnewvnode() so that if it fails we don't drop and reacquire the
vnode_free_list_mtx.
- Check for a usecount or holdcount at the end of vtryrecycle() in case
someone grabbed a ref while we were recycling. Abort the recycle, and
on the final ref drop this vnode will be placed on the head of the free
list.
- Move the redundant VOP_INACTIVE protection code into the local
vinactive() routine to avoid code bloat.
- Keep the vnode lock held across calls to vgone() in several places.
- vgonel() no longer uses XLOCK, instead callers must hold an exclusive
vnode lock. The VI_DOOMED flag is set to allow other threads to detect
a vnode which is no longer valid. This flag is set until the last
reference is gone, and there are no chances for a new ref. vgonel()
holds this lock across the entire function, which greatly simplifies
logic.
_ Only vfree() in one place in vgone() not three.
- Adjust vget() to check the VI_DOOMED flag prior to waiting on the lock
in the LK_NOWAIT case. In other cases, check after we have slept and
acquired an exlusive lock. This will simulate the old vx_wait()
behavior.
Sponsored by: Isilon Systems, Inc.
has been set. Assert that this is the case so that we catch filesystems
who are using naked VOP_LOCKs in illegal cases.
Sponsored by: Isilon Systems, Inc.
considered to be as good as an exclusive lock, although there is still a
possibility of someone acquiring a VOP LOCK while xlock is held.
Sponsored by: Isilon Systems, Inc.
List devfs_dirents rather than vnodes off their shared struct cdev, this
saves a pointer field in the vnode at the expense of a field in the
devfs_dirent. There are often 100 times more vnodes so this is bargain.
In addition it makes it harder for people to try to do stypid things like
"finding the vnode from cdev".
Since DEVFS handles all VCHR nodes now, we can do the vnode related
cleanup in devfs_reclaim() instead of in dev_rel() and vgonel().
Similarly, we can do the struct cdev related cleanup in dev_rel()
instead of devfs_reclaim().
rename idestroy_dev() to destroy_devl() for consistency.
Add LIST_ENTRY de_alias to struct devfs_dirent.
Remove v_specnext from struct vnode.
Change si_hlist to si_alist in struct cdev.
String new devfs vnodes' devfs_dirent on si_alist when
we create them and take them off in devfs_reclaim().
Fix devfs_revoke() accordingly. Also don't clear fields
devfs_reclaim() will clear when called from vgone();
Let devfs_reclaim() call dev_rel() instead of vgonel().
Move the usecount tracking from dev_rel() to devfs_reclaim(),
and let dev_rel() take a struct cdev argument instead of vnode.
Destroy SI_CHEAPCLONE devices in dev_rel() (instead of
devfs_reclaim()) when they are no longer used. (This
should maybe happen in devfs_close() instead.)
patch from kan@).
Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on
close. This is not yet a generally safe function, but for this very
specific use it is safe. This solves the problem with buffers not
being flushed by unmount or after failed mount attempts.
to off.
- Protect access to mnt_kern_flag with the mointpoint mutex.
- Remove some KASSERTs which are not legal checks without the appropriate
locks held.
- Use VCANRECYCLE() rather than rolling several slightly different
checks together.
- Return from vtryrecycle() with a recycled vnode rather than a locked
vnode. This simplifies some locking.
- Remove several GIANT_REQUIRED lines.
- Add a few KASSERTs to help with INACT debugging.
Sponsored By: Isilon Systems, Inc.
unhappiness lately.
As far as I can tell, no files that have made it safely to disk
have been endangered, but stuff in transit has been in peril.
Pointy hat: phk
TAILQ_FOREACH_SAFE().
Loose the error pointer argument and return any errors the normal way.
Return EAGAIN for the case where more work needs to be done.
I'm not sure why a credential was added to these in the first place, it is
not used anywhere and it doesn't make much sense:
The credentials for syncing a file (ability to write to the
file) should be checked at the system call level.
Credentials for syncing one or more filesystems ("none")
should be checked at the system call level as well.
If the filesystem implementation needs a particular credential
to carry out the syncing it would logically have to the
cached mount credential, or a credential cached along with
any delayed write data.
Discussed with: rwatson
split the conversion of the remaining three filesystems out from the root
mounting changes, so in one go:
cd9660:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
nfs(client):
Convert to nmount (the simple way, mount_nfs(8) is still necessary).
Add omount compat shims.
Drop COMPAT_PRELITE2 mount arg compatibility.
ffs:
Convert to nmount.
Add omount compat shims.
Remove dedicated rootfs mounting code.
Use vfs_mountedfrom()
Rely on vfs_mount.c calling VFS_STATFS()
Remove vfs_omount() method, all filesystems are now converted.
Remove MNTK_WANTRDWR, handling RO/RW conversions is a filesystem
task, and they all do it now.
Change rootmounting to use DEVFS trampoline:
vfs_mount.c:
Mount devfs on /. Devfs needs no 'from' so this is clean.
symlink /dev to /. This makes it possible to lookup /dev/foo.
Mount "real" root filesystem on /.
Surgically move the devfs mountpoint from under the real root
filesystem onto /dev in the real root filesystem.
Remove now unnecessary getdiskbyname().
kern_init.c:
Don't do devfs mounting and rootvnode assignment here, it was
already handled by vfs_mount.c.
Remove now unused bdevvp(), addaliasu() and addalias(). Put the
few necessary lines in devfs where they belong. This eliminates the
second-last source of bogo vnodes, leaving only the lemming-syncer.
Remove rootdev variable, it doesn't give meaning in a global context and
was not trustworth anyway. Correct information is provided by
statfs(/).
initializations but we did have lofty goals and big ideals.
Adjust to more contemporary circumstances and gain type checking.
Replace the entire vop_t frobbing thing with properly typed
structures. The only casualty is that we can not add a new
VOP_ method with a loadable module. History has not given
us reason to belive this would ever be feasible in the the
first place.
Eliminate in toto VOCALL(), vop_t, VNODEOP_SET() etc.
Give coda correct prototypes and function definitions for
all vop_()s.
Generate a bit more data from the vnode_if.src file: a
struct vop_vector and protype typedefs for all vop methods.
Add a new vop_bypass() and make vop_default be a pointer
to another struct vop_vector.
Remove a lot of vfs_init since vop_vector is ready to use
from the compiler.
Cast various vop_mumble() to void * with uppercase name,
for instance VOP_PANIC, VOP_NULL etc.
Implement VCALL() by making vdesc_offset the offsetof() the
relevant function pointer in vop_vector. This is disgusting
but since the code is generated by a script comparatively
safe. The alternative for nullfs etc. would be much worse.
Fix up all vnode method vectors to remove casts so they
become typesafe. (The bulk of this is generated by scripts)
At some point later the syncer will unlearn about vnodes and the filesystems
method called by the syncer will know enough about what's in bo_private to
do the right thing.
[1] Ok, I know, but I couldn't resist the pun.
We keep si_bsize_phys around for now as that is the simplest way to pull
the number out of disk device drivers in devfs_open(). The correct solution
would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth
when filesystems sit on GEOM, so don't bother for now.
Extend it with a strategy method.
Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY
song and dance.
Rename ibwrite to bufwrite().
Move the two NFS buf_ops to more sensible places, add bufstrategy
to them.
Add inlines for bwrite() and bstrategy() which calls through
buf->b_bufobj->b_ops->b_{write,strategy}().
Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().
without a mountpoint. In this scenario, there's no useful source for
a label on the vnode, since we can't query the mountpoint for the
labeling strategy or default label.
jest, of most excellent fancy: he hath taught me lessons a thousand
times; and now, how abhorred in my imagination it is! my gorge rises
at it. Here were those hacks that I have curs'd I know not how
oft. Where be your kludges now? your workarounds? your layering
violations, that were wont to set the table on a roar?
Move the skeleton of specfs into devfs where it now belongs and
bury the rest.
Initialize b_bufobj for all buffers.
Make incore() and gbincore() take a bufobj instead of a vnode.
Make inmem() local to vfs_bio.c
Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj)
also VI_MTX() to BO_MTX(),
Make buf_vlist_add() take a bufobj instead of a vnode.
Eliminate other uses of bp->b_vp where bp->b_bufobj will do.
Various minor polishing: remove "register", turn panic into KASSERT,
use new function declarations, TAILQ_FOREACH_SAFE() etc.
Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write
count on a bufobj. Bufobj_wdrop() replaces vwakeup().
Use these functions all relevant places except in ffs_softdep.c where
the use if interlocked_sleep() makes this impossible.
Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.
Initialize the bo_mtx when we allocate a vnode i getnewvnode() For
now we point to the vnodes interlock mutex, that retains the exact
same locking sematics.
Move v_numoutput from vnode to bufobj. Add renaming macro to
postpone code sweep.
Discussion: this panic (or waning) only occurs when the kernel is
compiled with INVARIANTS. Otherwise the problem (which means that
the vp->v_data field isn't NULL, and represents a coding error and
possibly a memory leak) is silently ignored by setting it to NULL
later on.
Panicking here isn't very helpful: by this time, we can only find
the symptoms. The panic occurs long after the reason for "not
cleaning" has been forgotten; in the case in point, it was the
result of severe file system corruption which left the v_type field
set to VBAD. That issue will be addressed by a separate commit.
of the number of threads which are inside whatever is behind the
cdevsw for this particular cdev.
Make the device mutex visible through dev_lock() and dev_unlock().
We may want finer granularity later.
Replace spechash_mtx use with dev_lock()/dev_unlock().
field.
Replace three instances of longhaired initialization va_filerev fields.
Added XXX comment wondering why we don't use random bits instead of
uptime of the system for this purpose.
shutdown_pre_sync state if the RB_NOSYNC flag is set. This is the
likely cause of hangs after a system panic that are keeping crash
dumps from being done.
This is a MFC candidate for RELENG_5.
MFC after: 3 days
a more complete subsystem, and removes the knowlege of how things are
implemented from the drivers. Include locking around filter ops, so a
module like aio will know when not to be unloaded if there are outstanding
knotes using it's filter ops.
Currently, it uses the MTX_DUPOK even though it is not always safe to
aquire duplicate locks. Witness currently doesn't support the ability
to discover if a dup lock is ok (in some cases).
Reviewed by: green, rwatson (both earlier versions)
we may sleep when doing so; check that we didn't race with another thread
allocating storage for the vnode after allocation is made to a local
pointer, and only update the vnode pointer if it's still NULL. Otherwise,
accept that another thread got there first, and release the local storage.
Discussed with: jmg
somewhat clearer, but more importantly allows for a consistent naming
scheme for suser_cred flags.
The old name is still defined, but will be removed in a few days (unless I
hear any complaints...)
Discussed with: rwatson, scottl
Requested by: jhb
This is to allow filesystems to decide based on the passed thread
which vnode to return.
Several filesystems used curthread, they now use the passed thread.
Rebind the client socket when we experience a timeout. This fixes
the case where our IP changes for some reason.
Signal a VFS event when NFS transitions from up to down and vice
versa.
Add a placeholder vfs_sysctl where we will put status reporting
shortly.
Also:
Make down NFS mounts return EIO instead of EINTR when there is a
soft timeout or force unmount in progress.
so that last_work_seen has a reasonable value at the transition
to the SYNCER_SHUTTING_DOWN state, even if net_worklist_len happened
to be zero at the time.
Initialize last_work_seen to zero as a safety measure in case the
syncer never ran in the SYNCER_RUNNING state.
Tested by: phk
Speed up the syncer when shutting down by sleeping for a shorter
period of time instead of cranking up rushjob and using the
normal one second sleep.
Skip empty worklist slots when shutting down to avoid lengthy
intervals of inactivity.
Give I/O more time to complete between steps by not speeding the
syncer quite as much.
Terminate the syncer after one full pass through the worklist
plus one second with the worklist containing nothing but syncer
vnodes.
Print an indication of shutdown progress to the console.
Add a sysctl, vfs.worklist_len, to allow the size of the syncer worklist
to be monitored.
around in the vnodes surroundings when we allocate a block.
Assign a blocksize when we create a vnode, and yell a warning (and ignore it)
if we got the wrong size.
Please email all such warnings to me.
generic filesystem events to userspace. Currently only mount and unmount
of filesystems are signalled. Soon to be added, up/down status of NFS.
Introduce a sysctl node used to route requests to/from filesystems
based on filesystem ids.
Introduce a new vfsop, vfs_sysctl(mp, req) that is used as the callback/
entrypoint by the sysctl code to change individual filesystems.
ffs_mount -> bdevvp -> getnewvnode(..., mp = NULL, ...) ->
insmntqueue(vp, mp = NULL) -> KASSERT -> panic
Make getnewvnode() only call insmntqueue() if the mountpoint parameter
is not NULL.
our cached 'next vnode' being removed from this mountpoint. If we
find that it was recycled, we restart our traversal from the start
of the list.
Code to do that is in all local disk filesystems (and a few other
places) and looks roughly like this:
MNT_ILOCK(mp);
loop:
for (vp = TAILQ_FIRST(&mp...);
(vp = nvp) != NULL;
nvp = TAILQ_NEXT(vp,...)) {
if (vp->v_mount != mp)
goto loop;
MNT_IUNLOCK(mp);
...
MNT_ILOCK(mp);
}
MNT_IUNLOCK(mp);
The code which takes vnodes off a mountpoint looks like this:
MNT_ILOCK(vp->v_mount);
...
TAILQ_REMOVE(&vp->v_mount->mnt_nvnodelist, vp, v_nmntvnodes);
...
MNT_IUNLOCK(vp->v_mount);
...
vp->v_mount = something;
(Take a moment and try to spot the locking error before you read on.)
On a SMP system, one CPU could have removed nvp from our mountlist
but not yet gotten to assign a new value to vp->v_mount while another
CPU simultaneously get to the top of the traversal loop where it
finds that (vp->v_mount != mp) is not true despite the fact that
the vnode has indeed been removed from our mountpoint.
Fix:
Introduce the macro MNT_VNODE_FOREACH() to traverse the list of
vnodes on a mountpoint while taking into account that vnodes may
be removed from the list as we go. This saves approx 65 lines of
duplicated code.
Split the insmntque() which potentially moves a vnode from one mount
point to another into delmntque() and insmntque() which does just
what the names say.
Fix delmntque() to set vp->v_mount to NULL while holding the
mountpoint lock.
faster and iterate to over its work list a few times in an attempt
to empty the work list before the syncer terminates. This leaves
fewer dirty blocks to be written at the "syncing disks" stage and
keeps the the "giving up on N buffers" problem from being triggered
by the presence of a large soft updates work list at system shutdown
time. The downside is that the syncer takes noticeably longer to
terminate.
Tested by: "Arjan van Leeuwen" <avleeuwen AT piwebs DOT com>
Approved by: mckusick
The big lines are:
NODEV -> NULL
NOUDEV -> NODEV
udev_t -> dev_t
udev2dev() -> findcdev()
Various minor adjustments including handling of userland access to kernel
space struct cdev etc.
sleep queue interface:
- Sleep queues attempt to merge some of the benefits of both sleep queues
and condition variables. Having sleep qeueus in a hash table avoids
having to allocate a queue head for each wait channel. Thus, struct cv
has shrunk down to just a single char * pointer now. However, the
hash table does not hold threads directly, but queue heads. This means
that once you have located a queue in the hash bucket, you no longer have
to walk the rest of the hash chain looking for threads. Instead, you have
a list of all the threads sleeping on that wait channel.
- Outside of the sleepq code and the sleep/cv code the kernel no longer
differentiates between cv's and sleep/wakeup. For example, calls to
abortsleep() and cv_abort() are replaced with a call to sleepq_abort().
Thus, the TDF_CVWAITQ flag is removed. Also, calls to unsleep() and
cv_waitq_remove() have been replaced with calls to sleepq_remove().
- The sched_sleep() function no longer accepts a priority argument as
sleep's no longer inherently bump the priority. Instead, this is soley
a propery of msleep() which explicitly calls sched_prio() before
blocking.
- The TDF_ONSLEEPQ flag has been dropped as it was never used. The
associated TDF_SET_ONSLEEPQ and TDF_CLR_ON_SLEEPQ macros have also been
dropped and replaced with a single explicit clearing of td_wchan.
TD_SET_ONSLEEPQ() would really have only made sense if it had taken
the wait channel and message as arguments anyway. Now that that only
happens in one place, a macro would be overkill.
the syscall arguments and does the suser() permission check, and
kern_mlock(), which does the resource limit checking and calls
vm_map_wire(). Split munlock() in a similar way.
Enable the RLIMIT_MEMLOCK checking code in kern_mlock().
Replace calls to vslock() and vsunlock() in the sysctl code with
calls to kern_mlock() and kern_munlock() so that the sysctl code
will obey the wired memory limits.
Nuke the vslock() and vsunlock() implementations, which are no
longer used.
Add a member to struct sysctl_req to track the amount of memory
that is wired to handle the request.
Modify sysctl_wire_old_buffer() to return an error if its call to
kern_mlock() fails. Only wire the minimum of the length specified
in the sysctl request and the length specified in its argument list.
It is recommended that sysctl handlers that use sysctl_wire_old_buffer()
should specify reasonable estimates for the amount of data they
want to return so that only the minimum amount of memory is wired
no matter what length has been specified by the request.
Modify the callers of sysctl_wire_old_buffer() to look for the
error return.
Modify sysctl_old_user to obey the wired buffer length and clean up
its implementation.
Reviewed by: bms
This is what we came here for: Hang dev_t's from their cdevsw,
refcount cdevsw and dev_t and generally keep track of things a lot
better than we used to:
Hold a cdevsw reference around all entrances into the device driver,
this will be necessary to safely determine when we can unload driver
code.
Hold a dev_t reference while the device is open.
KASSERT that we do not enter the driver on a non-referenced dev_t.
Remove old D_NAG code, anonymous dev_t's are not a problem now.
When destroy_dev() is called on a referenced dev_t, move it to
dead_cdevsw's list. When the refcount drops, free it.
Check that cdevsw->d_version is correct. If not, set all methods
to the dead_*() methods to prevent entrance into driver. Print
warning on console to this effect. The device driver may still
explode if it is also incompatible with newbus, but in that case
we probably didn't get this far in the first place.
Remove the unused second argument from udev2dev().
Convert all remaining users of makedev() to use udev2dev(). The
semantic difference is that udev2dev() will only locate a pre-existing
dev_t, it will not line makedev() create a new one.
Apart from the tiny well controlled windown in D_PSEUDO drivers,
there should no longer be any "anonymous" dev_t's in the system
now, only dev_t's created with make_dev() and make_dev_alias()
Add empty line before first code line in functions with no local
variables.
Properly terminate comment sentences.
Indent lines which are longer that 80 characters.
Move v_addpollinfo closer to the rest of poll-related functions.
Move DEBUG_VFS_LOCKS ifdefed block to the end of file.
Obtained from: bde (partly)
reassigning their v_ops field to specfs, detaching from the mountpoint, etc.
However, this is not sufficient. If we vclean() the vnode the pages owned
by the vnode are lost, potentially while buffers reference them. Implement
parts of vclean() seperately in vgonechrl() so that the pages and bufs
associated with a device vnode are not destroyed while in use.
thread being waken up. The thread waken up can run at a priority as
high as after tsleep().
- Replace selwakeup()s with selwakeuppri()s and pass appropriate
priorities.
- Add cv_broadcastpri() which raises the priority of the broadcast
threads. Used by selwakeuppri() if collision occurs.
Not objected in: -arch, -current
Introduce two new macros MNT_ILOCK(mp)/MNT_IUNLOCK(mp) to
operate on this mutex transparently.
Eventually new mutex will be protecting more fields in
struct mount, not only vnode list.
Discussed with: jeff
LK_RETRY either, we don't want this vnode if it turns into another.
- Remove the code that checks the mount point after acquiring the lock
we are guaranteed to either fail or get the vnode that we wanted.
- In vtryrecycle() try to vgonel the vnode if all of the previous checks
passed. We won't vgonel if someone has either acquired a hold or usecount
or started the vgone process elsewhere. This is because we may have been
removed from the free list while we were inspecting the vnode for
recycling.
- The VI_TRYLOCK stops two threads from entering getnewvnode() and recycling
the same vnode. To further reduce the likelyhood of this event, requeue
the vnode on the tail of the list prior to calling vtryrecycle(). We can
not actually remove the vnode from the list until we know that it's
going to be recycled because other interlock holders may see the VI_FREE
flag and try to remove it from the free list.
- Kill a bogus XXX comment. If XLOCK is set we shouldn't wait for it
regardless of MNT_WAIT because the vnode does not actually belong to
this filesystem.
purge, the purge in vclean, and the filesystems purge, we had 3 purges
per vnode.
- Move the insmntque(vp, 0) to vclean() so that we may remove it from the
two vgone() functions and reduce the number of lock operations required.
whether or not the sync failed. This could potentially get set between
the time that we VOP_UNLOCK and VI_LOCK() but the race would harmelssly
lead to the sync being delayed by an extra 30 seconds. If we do not move
the vnode it could cause an endless loop if it continues to fail to sync.
- Use vhold and vdrop to stop the vnode from changing identities while we
have it unlocked. Other internal vfs lists are likely to follow this
scheme.
- Create a new function, vgonechrl(), which performs vgone for an in-use
character device. Move the code from vflush() that did this into
vgonechrl().
- Hold the xlock across the entirety of vgonel() and vgonechrl() so that
at no point will an invalid vnode exist on any list without XLOCK set.
- Move the xlock code out of vclean() now that it is in the vgone*()
functions.
This is so that we may grab the interlock while still holding the
sync_mtx. We have to VI_TRYLOCK() because in all other cases the lock
order runs the other way.
- If we don't meet any of the preconditions, reinsert the vp into the
list for the next second.
- We don't need to panic if we fail to sync here because each FSYNC
function handles this case. Removing this redundant code also
simplifies locking.
fail. Remove the panic from that case and document why it might fail.
- Document the reason for calling cache_purge() on a newly created vnode.
- In insmntque() order the operations so that we can call mtx_unlock()
one fewer times. This makes the code somewhat clearer as well.
- Add XXX comments in sched_sync() and vflush().
- In vget(), do not sleep while waiting for XLOCK to clear if LK_NOWAIT is
set.
- In vclean() we don't need to acquire a lock around a single TAILQ_FIRST
call. It's ok if we race here, the vinvalbuf will just do nothing.
- Increase the scope of the lock in vgonel() to reduce the number of lock
operations that are performed.
Doing so creates a race where the buf is on neither list.
- Only vfree() in an error case in vclean() if VSHOULDFREE() thinks we
should.
- Convert the error case in vclean() to INVARIANTS from DIAGNOSTIC as this
really should not happen and is fast to check.
size and the kernel's heap size, specifically, vm_kmem_size. This
function allows a maximum of 40% of the vm_kmem_size to be used for
vnodes and vm objects. This is a conservative bound based upon recent
problem reports. (In other words, a slight increase in this percentage
may be safe.)
Finally, machines with less than ~3GB of RAM should be unaffected
by this change, i.e., the maximum number of vnodes should remain
the same. If necessary, machines with 3GB or more of RAM can increase
the maximum number of vnodes by increasing vm_kmem_size.
Desired by: scottl
Tested by: jake
Approved by: re (rwatson,scottl)
the vnode and restart the loop. Vflush() is vulnerable since it does not
hold a reference to the vnode and it holds no other locks while waiting
for the vnode lock. The vnode will no longer be on the list when the
loop is restarted.
Approved by: re (rwatson)
desired buffer is found at one of the roots more than 60% of the time.
Thus, checking both roots before performing either splay eliminates
unnecessary splays on the first tree splayed.
Approved by: re (jhb)
synchronization primitives from inside DDB is generally a bad idea,
and in this case it frequently results in panics due to DDB commands
being executed from the sio fast interrupt context on a serial
console. Replace the locking with a note that a lack of locking
means that DDB may get see inconsistent views of the mount and vnode
lists, which could also result in a panic. More frequently,
though, this avoids a panic than causes it.
Discussed with ages ago: bde
Approved by: re (scottl)
trustworthy for vnode-backed objects.
- Restore the old behavior of vm_object_page_remove() when the end
of the given range is zero. Add a comment to vm_object_page_remove()
regarding this behavior.
Reported by: iedowse
- Eliminate an odd, special-case feature:
if start == end == 0 then all pages are removed. Only one caller
used this feature and that caller can trivially pass the object's
size.
- Assert that the vm_object is locked on entry; don't bother testing
for a NULL vm_object.
- Style: Fix lines that are longer than 80 characters.
- Add a parameter to vm_pageout_flush() that tells vm_pageout_flush()
whether its caller has locked the vm_object. (This is a temporary
measure to bootstrap vm_object locking.)
possible for some time.
- Lock the buf before accessing fields. This should very rarely be locked.
- Assert that B_DELWRI is set after we acquire the buf. This should always
be the case now.
requiring locked bufs in vfs_bio_awrite(). Previously the buf could
have been written out by fsync before we acquired the buf lock if it
weren't for giant. The cluster_wbuild() handles this race properly but
the single write at the end of vfs_bio_awrite() would not.
- Modify flushbufqueues() so there is only one copy of the loop. Pass a
parameter in that says whether or not we should sync bufs with deps.
- Call flushbufqueues() a second time and then break if we couldn't find
any bufs without deps.
Remove extraneous uses of vop_null, instead defering to the default op.
Rename vnode type "vfs" to the more descriptive "syncer".
Fix formatting for various filesystems that use vop_print.
track of the number of dirty buffers held by a vnode. When a
bdwrite is done on a buffer, check the existing number of dirty
buffers associated with its vnode. If the number rises above
vfs.dirtybufthresh (currently 90% of vfs.hidirtybuffers), one
of the other (hopefully older) dirty buffers associated with
the vnode is written (using bawrite). In the event that this
approach fails to curb the growth in it the vnode's number of
dirty buffers (due to soft updates rollback dependencies),
the more drastic approach of doing a VOP_FSYNC on the vnode
is used. This code primarily affects very large and actively
written files such as snapshots. This change should eliminate
hanging when taking snapshots or doing background fsck on
very large filesystems.
Hopefully, one day it will be possible to cache filesystem
metadata in the VM cache as is done with file data. As it
stands, only the buffer cache can be used which limits total
metadata storage to about 20Mb no matter how much memory is
available on the system. This rather small memory gets badly
thrashed causing a lot of extra I/O. For example, taking a
snapshot of a 1Tb filesystem minimally requires about 35,000
write operations, but because of the cache thrashing (we only
have about 350 buffers at our disposal) ends up doing about
237,540 I/O's thus taking twenty-five minutes instead of four
if it could run entirely in the cache.
Reported by: Attila Nagy <bra@fsn.hu>
Sponsored by: DARPA & NAI Labs.
- Remove the buftimelock mutex and acquire the buf's interlock to protect
these fields instead.
- Hold the vnode interlock while locking bufs on the clean/dirty queues.
This reduces some cases from one BUF_LOCK with a LK_NOWAIT and another
BUF_LOCK with a LK_TIMEFAIL to a single lock.
Reviewed by: arch, mckusick
in massive locking issues on diskless systems.
It is also not clear that this sysctl is non-dangerous in its
requirements for locked down memory on large RAM systems.
call is in progress on the vnode. When vput() or vrele() sees a
1->0 reference count transition, it now return without any further
action if this flag is set. This flag is necessary to avoid recursion
into VOP_INACTIVE if the filesystem inactive routine causes the
reference count to increase and then drop back to zero. It is also
used to guarantee that an unlocked vnode will not be recycled while
blocked in VOP_INACTIVE().
There are at least two cases where the recursion can occur: one is
that the softupdates code called by ufs_inactive() via ffs_truncate()
can call vput() on the vnode. This has been reported by many people
as "lockmgr: draining against myself" panics. The other case is
that nfs_inactive() can call vget() and then vrele() on the vnode
to clean up a sillyrename file.
Reviewed by: mckusick (an older version of the patch)
to treat desiredvnodes much more like a limit than as a vague concept.
On a 2GB RAM machine where desired vnodes is 130k, we run out of
kmem_map space when we hit about 190k vnodes.
If we wake up the vnode washer in getnewvnode(), sleep until it is done,
so that it has a chance to offer us a washed vnode. If we don't sleep
here we'll just race ahead and allocate yet a vnode which will never
get freed.
In the vnodewasher, instead of doing 10 vnodes per mountpoint per
rotation, do 10% of the vnodes distributed evenly across the
mountpoints.
"refreshing" the label on the vnode before use, just get the label
right from inception. For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system. With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance. This
also corrects sematics for shared vnode locks, which were not
previously present in the system. This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form. With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception. We'll introduce a work around for this shortly.
Approved by: re
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned. This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.
Sponsored by: DARPA & NAI Labs.
be sure to exit the loop with vp == NULL if no candidates are found.
Formerly, this bug would cause the last vnode inspected to be used,
even if it was not available. The result was a panic "vn_finished_write:
neg cnt".
Sponsored by: DARPA & NAI Labs.
vclean() function (e.g., vp->v_vnlock = &vp->v_lock) rather
than requiring filesystems that use alternate locks to do so
in their vop_reclaim functions. This change is a further cleanup
of the vop_stdlock interface.
Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk>
Sponsored by: DARPA & NAI Labs.
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).
In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.
Sponsored by: DARPA & NAI Labs.
vcanrecycle to check a free vnode's availability. If it is
available, vcanrecycle returns an error code of zero and the
vnode in question locked. The getnewvnode routine then used
to call vn_start_write with the V_NOWAIT flag. If the filesystem
was suspended while taking a snapshot, the vn_start_write would
fail but getnewvnode would fail to unlock the vnode, instead
leaving it locked on the freelist. The result would be that the
vnode would be locked forever and would eventually hang the
system with a race to the root when it was attempted to recycle
it. This fix moves the vn_start_write check into vcanrecycle
where it will properly unlock the vnode if it is unavailable
for recycling due to filesystem suspension.
Sponsored by: DARPA & NAI Labs.
interlock in getnewvnode() to avoid possible sleeps while holding
the mutex. Note that the warning from Witness is a slight false
positive since we know there will be no contention on the interlock
since we haven't made the vnode available for use yet, but the theory
is not a bad one.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, Network Associates Laboratories
- Make the VI asserts more orthogonal to the rest of the asserts by using a
new, common vfs_badlock() function and adding a 'str' arg.
- Adjust generated ASSERTS to match the new prototype.
- Adjust explicit ASSERTS to match the new prototype.
- Enable vfs_badlock_mutex by default.
- Assert that the vp is locked in VOP_UNLOCK.
- Use standard interlock macros in remaining code.
- Correct a race in getnewvnode().
- Lock access to v_numoutput with interlock.
- Lock access to buf lists and splay tree with interlock.
- Add VOP and VI asserts.
- Lock b_vnbufs with the vnode interlock.
- Add vrefcnt() for callers who want to retreive the vnode ref without
holding a lock. Add a comment that describes when this is safe.
- Add vholdl() and vdropl() so that callers who already own the interlock
can avoid race conditions and unnecessary unlocking.
- Move the VOP_GETATTR() in vflush() into the WRITECLOSE conditional case.
- Hold the interlock before droping the mntlist_mtx in vflush() to avoid
a race.
- Fix locking in vfs_msync().
v_tag is now const char * and should only be used for debugging.
Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.
Suggested by: phk
Reviewed by: bde, rwatson (earlier version)
LK_INTERLOCK. The interlock will never be held on return from these
functions even when there is an error. Errors typically only occur when
the XLOCK is held which means this isn't the vnode we want anyway. Almost
all users of these interfaces expected this behavior even though it was
not provided before.
with interlock held in error conditions when the caller did not specify
LK_INTERLOCK.
- Add several comments to vn_lock() describing the rational behind the code
flow since it was not immediately obvious.
released. vcanrecycle() failed to unlock interlock under this condition.
- Remove an extra VOP_UNLOCK from a failure case in vcanrecycle().
Pointed out by: rwatson
- Use the new VI asserts in place of the old mtx_assert checks.
- Add the VI asserts to the automated lock checking in the VOP calls. The
interlock should not be held across vops with a few exceptions.
- Add the vop_(un)lock_{pre,post} functions to assert that interlock is held
when LK_INTERLOCK is set.
- Make getvfsbyname() take a struct xvfsconf *.
- Convert several consumers of getvfsbyname() to use struct xvfsconf.
- Correct the getvfsbyname.3 manpage.
- Create a new vfs.conflist sysctl to dump all the struct xvfsconf in the
kernel, and rewrite getvfsbyname() to use this instead of the weird
existing API.
- Convert some {set,get,end}vfsent() consumers to use the new vfs.conflist
sysctl.
- Convert a vfsload() call in nfsiod.c to kldload() and remove the useless
vfsisloadable() and endvfsent() calls.
- Add a warning printf() in vfs_sysctl() to tell people they are using
an old userland.
After these changes, it's possible to modify struct vfsconf without
breaking the binary compatibility. Please note that these changes don't
break this compatibility either.
When bp will have updated mount_smbfs(8) with the patch I sent him, there
will be no more consumers of the {set,get,end}vfsent(), vfsisloadable()
and vfsload() API, and I will promptly delete it.
- v_vflag is protected by the vnode lock and is used when synchronization
with VOP calls is needed.
- v_iflag is protected by interlock and is used for dealing with vnode
management issues. These flags include X/O LOCK, FREE, DOOMED, etc.
- All accesses to v_iflag and v_vflag have either been locked or marked with
mp_fixme's.
- Many ASSERT_VOP_LOCKED calls have been added where the locking was not
clear.
- Many functions in vfs_subr.c were restructured to provide for stronger
locking.
Idea stolen from: BSD/OS
kernel access control.
Invoke the necessary MAC entry points to maintain labels on vnodes.
In particular, initialize the label when the vnode is allocated or
reused, and destroy the label when the vnode is going to be released,
or reused. Wow, an object where there really is exactly one place
where it's allocated, and one other where it's freed. Amazing.
Obtained from: TrustedBSD Project
Sponsored by: DARPA, NAI Labs
SYSCTL_OUT() from blocking while locks are held. This should
only be done when it would be inconvenient to make a temporary copy of
the data and defer calling SYSCTL_OUT() until after the locks are
released.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).
The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.
Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:
if (ioflags & IO_SYNC)
flags |= BA_SYNC;
For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).
I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.
Sponsored by: DARPA & NAI Labs.
support creation times such as UFS2) to the value of the
modification time if the value of the modification time is older
than the current creation time. See utimes(2) for further details.
Sponsored by: DARPA & NAI Labs.
methodology similar to the vm_map_entry splay and the VM splay that Alan
Cox is working on. Extensive testing has appeared to have shown no
increase in overhead.
Disadvantages
Dirties more cache lines during lookups.
Not as fast as a hash table lookup (but still N log N and optimal
when there is locality of reference).
Advantages
vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem
syncer operate more efficiently.
I get to rip out all the old hacks (some of which were mine) that tried
to keep the v_dirtyblkhd tailq sorted.
The per-vnode splay tree should be easier to lock / SMPng pushdown on
vnodes will be easier.
This commit along with another that Alan is working on for the VM page
global hash table will allow me to implement ranged fsync(), optimize
server-side nfs commit rpcs, and implement partial syncs by the
filesystem syncer (aka filesystem syncer would detect that someone is
trying to get the vnode lock, remembers its place, and skip to the
next vnode).
Note that the buffer cache splay is somewhat more complex then other splays
due to special handling of background bitmap writes (multiple buffers with
the same lblkno in the same vnode), and B_INVAL discontinuities between the
old hash table and the existence of the buffer on the v_cleanblkhd list.
Suggested by: alc
Tell vop_strategy_pre() to use this instead.
- Ignore B_CLUSTER bufs. Their components are locked but they don't really
exist so they don't have to be. This isn't ideal but it is safe.
- Cache a pointer to the vnode's object in the buf.
- Hold a reference to that object in addition to the vnode's reference just
to be consistent.
- Cleanup code that got the object indirectly through the vp and VOP calls.
This fixes at least one case where we were calling GETVOBJECT without a lock.
It also avoids an expensive layered call at the cost of another pointer in
struct buf.
- Disable original vop_strategy lock specification.
- Switch to the new vop_strategy_pre for lock validation.
VOP_STRATEGY requires only that the buf is locked UNLESS the block numbers need
to be translated. There may be other reasons, but as long as the underlying
layer uses a VOP to perform the operations they will be caught later.
The file vfs_conf.c which was dealing with root mounting has
been repo-copied into vfs_mount.c to preserve history.
This makes nmount related development easier, and help reducing
the size of vfs_syscalls.c, which is still an enormous file.
Reviewed by: rwatson
Repo-copy by: peter
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.
Reviewed by: mckusick
- Add vfs_badlock_print to control whether or not we print lock violations
- Add vfs_badlock_panic to control whether we panic on lock violations
Both default to on to mimic the original behavior if DEBUG_VFS_LOCKS is on.
a linked list. This is to allow the merging of the mount
options in the MNT_UPDATE case, as the current data structure
is unsuitable for this.
There are no functional differences in this commit.
Reviewed by: phk
Don't try to create a vm object before the file system has a chance to finish
initializing it. This is incorrect for a number of reasons. Firstly, that
VOP requires a lock which the file system may not have initialized yet. Also,
open and others will create a vm object if it is necessary later.
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.
Tested on: i386, alpha, sparc64
general cleanup of the API. The entire API now consists of two functions
similar to the pre-KSE API. The suser() function takes a thread pointer
as its only argument. The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0. The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.
Discussed on: smp@
new vfs_getopt()/vfs_copyopt() API. This is intended to be used
later, when there will be filesystems implementing the VFS_NMOUNT
operation. The mount(2) system call will disappear when all
filesystems will be converted to the new API. Documentation will
be committed in a while.
Reviewed by: phk
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.
This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.
this is a low-functionality change that changes the kernel to access the main
thread of a process via the linked list of threads rather than
assuming that it is embedded in the process. It IS still embeded there
but remove all teh code that assumes that in preparation for the next commit
which will actually move it out.
Reviewed by: peter@freebsd.org, gallatin@cs.duke.edu, benno rice,
call VOP_INACTIVE before placing the vnode back on the free list.
Otherwise there is a race condition on SMP machines between
getnewvnode() locking the vnode to reclaim it and vrele()
locking the vnode to inactivate it. This window of vulnerability
becomes exaggerated in the presence of filesystems that have
been suspended as the inactive routine may need to temporarily
release the lock on the vnode to avoid deadlock with the syncer
process.
operation. The vgonel() code has always called vclean() but until we
started proactively freeing vnodes it would never actually be called with
a dirty vnode, so this situation did not occur prior to the vnlru() code.
Now that we proactively free vnodes when kern.maxvnodes is hit, however,
vclean() winds up with work to do and improperly generates the warnings.
Reviewed by: peter
Approved by: re (for MFC)
MFC after: 1 day
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.
This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck. With this change, such files are found at the time of the
down-grade. Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.
Submitted by: "Sam Leffler" <sam@errno.com>
We calculate a trigger point that both guarentees we will find a
sufficient number of vnodes to recycle and prevents us from recycling
vnodes with lots of resident pages. This particular section of
code is designed to recycle vnodes, not do unnecessary frees of
cached VM pages.
against VM_WAIT in the pageout code. Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.
Hopefully MFC: before the 4.5 release
the shutdown request at reboot/halt time.
Disable the printf 'vnlru process getting nowhere, pausing...' and instead
export the count to the debug.vnlru_nowhere sysctl.
by me to make it more efficient. The original code had serious balancing
problems and could also deadlock easily. This code relegates the vnode
reclamation to its own kproc and relaxes the vnode reclamation requirements
to better maintain kern.maxvnodes. This code still doesn't balance as well
as it could, but it does a much better job then the original code.
Approved by: re@freebsd.org
Obtained from: ps, peter, dillon
MFS Assuming: Assuming no problems crop up in Yahoo testing
MFC after: 7 days
structure changes now rather then piecemeal later on. mnt_nvnodelist
currently holds all the vnodes under the mount point. This will eventually
be split into a 'dirty' and 'clean' list. This way we only break kld's once
rather then twice. nvnodelist will eventually turn into the dirty list
and should remain compatible with the klds.
o POSIX.1e capabilities authorize overriding of VEXEC for VDIR based
on CAP_DAC_READ_SEARCH, but of !VDIR based on CAP_DAC_EXECUTE. Add
appropriate conditionals to vaccess() to take that into account.
o Synchronization cap_check_xxx() -> cap_check() change.
Obtained from: TrustedBSD Project
real effect.
Optimize vfs_msync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. Improves looping case by 500%.
Optimize ffs_sync(). Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list. This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held. Improves looping case by 500%.
(more optimization work is needed on top of these fixes)
MFC after: 1 week
wait for both read AND write I/O to complete. Only NFS calls vinvalbuf()
on an active vnode (when the server indicates that the file is stale), so
this bug fix only effects NFS clients.
MFC after: 3 days
to avoid removing higher level directory vnodes from the namecache has
no perceivable effect and will be removed. This is especially true
when vmiodirenable is turned on, which it is by default now. ( vmiodirenable
makes a huge difference in directory caching ). The vfs.vmiodirenable and
vfs.nameileafonly sysctls have been left in to allow further testing, but
I expect to rip out vfs.nameileafonly soon too.
I have also determined through testing that the real problem with numvnodes
getting too large is due to the VM Page cache preventing the vnode from
being reclaimed. The directory stuff made only a tiny dent relative
to Poul's original code, enough so that some tests succeeded. But tests
with several million small files show that the bigger problem is the VM Page
cache. This will have to be addressed by a future commit.
MFC after: 3 days
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.
Sorry john! (your next MFC will be a doosie!)
Reviewed by: peter@freebsd.org, dillon@freebsd.org
X-MFC after: ha ha ha ha
file. ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them. This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.
VM caching of disks through mmap() and stopping syncing of open files
that had their last reference in the fs removed (ie: their unsync'ed
pages get discarded on close already, so I made it stop syncing too).
(this commit is just the first stage). Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
vm_mtx does not recurse and is required for most low level
vm operations.
faults can not be taken without holding Giant.
Memory subsystems can now call the base page allocators safely.
Almost all atomic ops were removed as they are covered under the
vm mutex.
Alpha and ia64 now need to catch up to i386's trap handlers.
FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).
Reviewed (partially) by: jake, jhb
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.
All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.
This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.
Reviewed by: phk, bp
KASSERT when vp->v_usecount is zero or negative. In this case, the
"v*: negative ref cnt" panic that follows is much more appropriate.
Reviewed by: mckusick
to struct mount.
This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.
Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().
At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
available.
Only directory vnodes holding no child directory vnodes held in
v_cache_src are recycled, so that directory vnodes near the root of
the filesystem hierarchy remain in namecache and directory vnodes are
not reclaimed in cascade.
The period of vnode reclaiming attempt and the number of vnodes
attempted to reclaim can be tuned via sysctl(2).
Suggested by: tegge
Approved by: phk
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.
This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.
The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine. For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
actually in the kernel. This structure is a different size than
what is currently in -CURRENT, but should hopefully be the last time
any application breakage is caused there. As soon as any major
inconveniences are removed, the definition of the in-kernel struct
ucred should be conditionalized upon defined(_KERNEL).
This also changes struct export_args to remove dependency on the
constantly-changing struct ucred, as well as limiting the bounds
of the size fields to the correct size. This means: a) mountd and
friends won't break all the time, b) mountd and friends won't crash
the kernel all the time if they don't know what they're doing wrt
actual struct export_args layout.
Reviewed by: bde
mtx_enter(lock, type) becomes:
mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)
similarily, for releasing a lock, we now have:
mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.
The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.
Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:
MTX_QUIET and MTX_NOSWITCH
The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:
mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.
Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.
Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.
Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.
Finally, caught up to the interface changes in all sys code.
Contributors: jake, jhb, jasone (in no particular order)
calls returning EACCES instead of EPERM. This patch modifies vaccess()
to return EPERM instead of EACCES if VADMIN is among the requested
rights. This affects functions normally limited to the owners of
a file, such as chmod(), as EPERM is the error indicating that
privilege would allow the operation, rather than a chance in mandatory
or discretionary rights.
Reported by: bde