when vrele() acquires the directory lock in the wrong order. Fix this
via the following changes:
- Keep the directory locked after VOP_LOOKUP() until we've determined
what we're going to do with the child. This allows us to remove the
complicated post LOOKUP code which determins whether we should lock or
unlock the parent. This means we may have to vput() in the appropriate
cases later, rather than doing an unsafe vrele.
- in NDFREE() keep two flags to indicate whether we need to unlock vp or
dvp. This allows us to vput rather than vrele in the appropriate
cases without rechecking the flags. Move the code to handle dvp after
we handle vp.
- Remove some dead code from namei() that was the result of changes to
VFS_LOCK_GIANT().
Sponsored by: Isilon Systems, Inc.
to root cause on exactly how this happens.
- If the assert is disabled, we presently try to handle this case, but the
BUF_UNLOCK was missing. Thus, if this condition ever hit we would leak
a buf lock.
Many thanks to Peter Holm for all his help in finding this bug. He really
put more effort into it than I did.
critical_enter() and critical_exit() are now solely a mechanism for
deferring kernel preemptions. They no longer have any affect on
interrupts. This means that standalone critical sections are now very
cheap as they are simply unlocked integer increments and decrements for the
common case.
Spin mutexes now use a separate KPI implemented in MD code: spinlock_enter()
and spinlock_exit(). This KPI is responsible for providing whatever MD
guarantees are needed to ensure that a thread holding a spin lock won't
be preempted by any other code that will try to lock the same lock. For
now all archs continue to block interrupts in a "spinlock section" as they
did formerly in all critical sections. Note that I've also taken this
opportunity to push a few things into MD code rather than MI. For example,
critical_fork_exit() no longer exists. Instead, MD code ensures that new
threads have the correct state when they are created. Also, we no longer
try to fixup the idlethreads for APs in MI code. Instead, each arch sets
the initial curthread and adjusts the state of the idle thread it borrows
in order to perform the initial context switch.
This change is largely a big NOP, but the cleaner separation it provides
will allow for more efficient alternative locking schemes in other parts
of the kernel (bare critical sections rather than per-CPU spin mutexes
for per-CPU data for example).
Reviewed by: grehan, cognet, arch@, others
Tested on: i386, alpha, sparc64, powerpc, arm, possibly more
instances in a given devclass. This is useful for systems that want to
call code in driver static methods, similar to device_identify().
Reviewed by: dfr
MFC after: 2 weeks
one to become available for one second and then return ENFILE. We
can run out of vnodes, and there must be a hard limit because without
one we can quickly run out of KVA on x86. Presently the system can
deadlock if there are maxvnodes directories in the namecache. The
original 4.x BSD behavior was to return ENFILE if we reached the max,
but 4.x BSD did not have the vnlru proc so it was less profitable to
wait.
in a devclass. All the other uses of maxunit are correct and this one was
safe since it checks the return value of devclass_get_device(), which would
always say that the highest unit device doesn't exist.
Reviewed by: dfr
MFC after: 3 days
generate dirty bufs even with a locked vnode, 100 retries is not that
many. This should probably change from a retry count to an abort when
we are no longer cleaning any buffers.
- Don't call vprint() while we still hold the vnode locked. Move the call
to later in the function.
- Clean up a comment.
the type of object represented by the handle argument.
- Allow vm_mmap() to map device memory via cdev objects in addition to
vnodes and anonymous memory. Note that mmaping a cdev directly does not
currently perform any MAC checks like mapping a vnode does.
- Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the
cdev the ioctl is acting on rather than trying to find a suitable vnode
to map from.
Reviewed by: alc, arch@
on filesystems which safely support them. It appears that many
network filesystems specifically are not shared lock safe.
Sponsored by: Isilon Systems, Inc.
since simply unlocking a mutex does not ensure that one of the waiters
will run and acquire it. We're more likely to reacquire the mutex
before anyone else has a chance. It has also bit me three times now, as
it's not safe to drop the interlock before sleeping in many cases.
Sponsored by: Isilon Systems, Inc.
vnodes whose names it caches, so we no longer need a `generation
number' to tell us if a referenced vnode is invalid. Replace the use
of the parent's v_id in the hash function with the address of the
parent vnode.
Tested by: Peter Holm
Glanced at by: jeff, phk
except for places where people forget to update one of them. We now
collect only one set of stats for both of these routines. Other
changes in this commit include:
- Start acquiring Giant again in vn_fullpath(), since it is required
when crossing a mount point.
- Expand the scope of the cache lock to avoid dropping it and
picking it up again for every pathname component. This also
makes it trivial to avoid races in stats collection.
- Assert that nc_dvp == v_dd for directories instead of returning
an error to userland when this is not true. AFAIK, it should
always be true when v_dd is non-null.
- For vn_fullpath(), handle the first (non-directory) vnode
separately.
Glanced at by: jeff, phk
to cache_lookup(). This allows us to acquire the vnode interlock before
dropping the cache lock. This protects the vnodes identity until we
have locked it.
Sponsored by: Isilon Systems, Inc.
Don't remove the now unused element from cdev yet, wait until
we have a better reason to bump the version.
There is now no longer any upper limit on how many device drivers
a FreeBSD kernel can have.
acquire shared locks on intermediate directories.
- For the LASTCN, we may have to LK_UPGRADE the parent directory before
we lookup the last component.
- Acquire VFS_ROOT and dp locks based on the cn_lkflag.
Sponsored by: Isilon Systems, Inc.
vhold()s us.
- Avoid an extra mutex acquire and release in the common case of vgonel()
by checking for OWEINACT at the start of the function.
- Fix the case where we set OWEINACT in vput(). LK_EXCLUPGRADE drops our
shared lock if it fails.
Sponsored by: Isilon Systems, Inc.
- Assert that REMOVE, CREATE, and RENAME callers have WANTPARENT
or LOCKPARENT set. You can't complete any of these operations without
at least a reference to the parent. Many filesystems check for this case
even though it isn't possible in the current system.
calling VOP_LOOKUP(). Rather than having each filesystem check the
LOCKPARENT flag, we simply check it once here and unlock as required.
The only unusual case is ISDOTDOT, where we require an unlocked vnode
on return. Relocking this vnode with the child locked is allowed since
the child is actually its parent.
- Add a few asserts for some unusual conditions that I do not believe can
happen. These will later go away and turn into implementations for these
conditions.
Sponsored by: Isilon Systems, Inc.
now always allocates a new vnode.
- Define a new function, vnlru_free, which frees vnodes from the free list.
It takes as a parameter the number of vnodes to free, which is
wantfreevnodes - freevnodes when called from vnlru_proc or 1 when
called from getnewvnode(). For now, getnewvnode() still tries to reclaim
a free vnode before creating a new one when we are near the limit.
- Define a function, vdestroy, which handles the actual release of memory
and teardown of locks, etc. This could become a uma_dtor() routine.
- Get rid of minvnodes. Now wantfreevnodes is 1/4th the max vnodes. This
keeps more unreferenced vnodes around so that files which have only
been stat'd are less likely to be kicked out of the system before we
have a chance to read them, etc. These vnodes may still be freed via
the normal vnlru_proc() routines which may some day become a real lru.
actual root file system is mounted, the first entry on the mountlist
is not the root file system and the timestamp for that entry is
typically 0. Passing that to inittodr() caused annoying errors on
alpha and ia64.
So, call inittodr() for all file systems on mountlist, but only when
the timestamp (mnt_time) is non-zero.
lockmgr locks that this thread owns. This is complicated due to
LK_KERNPROC and because lockmgr tolerates unlocking an unlocked lock.
Sponsored by: Isilon Systes, Inc.
before it can call VOP_INACTIVE(). This must use the EXCLUPGRADE path
because we may violate some lock order with another locked vnode if
we drop and reacquire the lock. If EXCLUPGRADE fails, we mark the
vnode with VI_OWEINACT. This case should be very rare.
- Clear VI_OWEINACT in vinactive() and vbusy().
- If VI_OWEINACT is set in vgone() do the VOP_INACTIVE call here as well.
Sponsored by: Isilon Systems, Inc.
necessary since we disable the shared locks in vfs_cache, but it is
prefered that the option not leak out into filesystems when it is
disabled.
Sponsored by: Isilon Systems, Inc.
config option have now been fixed. All filesystems are properly locked
and checked via DEBUG_VFS_LOCKS. Remove the workaround code.
Sponsored by: Isilon Systems, Inc.
last in the list rather than first.
This makes the resouces print in the 4.x order rather than the 5.x order
(eg fdc0 at 0x3f0-0x3f5,0x3f7 is 4.x, but 0x3f7,0x3f0-0x3f5 is 5.x). This
also means that the pci code will once again print the resources in BAR
ascending order.
w/o problems than I was before... This simply brings back the knote_delete
as knlist_delete which will also drop the knote's, instead of just clearing
the list and seeing _ONESHOT...
Fix a race where if a note was _INFLUX and _DETACHED, it could end up being
modified... whoopse..
MFC after: 1 week
Prodded by: ambrisko and dwhite
alignment restrictive, and help performance on some ethernet cards which
currently copy the entire packet a couple bytes to get the packet aligned
properly...
Wordsmithing by: dwhite
Obtained from: NetBSD (code only)
I'll clean it up later: rwatson
in the window between the beginning of panic() and entering the debugger,
it's possible to receive interrupts. If we receive an interrupt, don't
preempt if panicstr != NULL, as the system is in the process of failing, and
the preempting thread is likely to stumble over the failure. The typical
scenario is during the printf() in panic() prior to entering the debugger,
but when running with a slower console type such as serial console.
It could be that the panic string should be passed to the debugger to print,
so that it can run from the debugger's environment rather than a regular
kernel printf.
Glanced at by: jhb
session in tprintf(). SESSRELE() needs to properly dispose of the
sessions mutex.
Add sessrele() which does the proper cleanup and have SESSRELE() call it.
Use SESSRELE also in pgdelete().
Found by: Coverity (ID:526)
it to get better hashing in vfs_hash.
In case of an insert collision in vfs_hash_insert(), put the loosing vnode
on a special list so that vfs_hash_remove() can just assume that it is on
a list.
Drop the VI_HASHED flag.
instead of failing.
When looking for a region to allocate, we used to check to see if the
start address was < end. In the case where A..B is allocated already,
and one wants to allocate A..C (B < C), then this test would
improperly fail (which means we'd examine that region as a possible
one), and we'd return the region B+1..C+(B-A+1) rather than NULL.
Since C+(B-A+1) is necessarily larger than C (end argument), this is
incorrect behavior for rman_reserve_resource_bound().
The fix is to exclude those regions where r->r_start + count - 1 > end
rather than r->r_start > end. This bug has been in this code for a
very long time. I believe that all other tests against end are
correctly done.
This is why sio0 generated a message about interrupts not being
enabled properly for the device. When fdc had a bug that allocated
from 0x3f7 to 0x3fb, sio0 was then given 0x3fc-0x404 rather than the
0x3f8-0x3ff that it wanted. Now when fdc has the same bug, sio0 fails
to allocate its ports, which is the proper behavior. Since the probe
failed, we never saw the messed up resources reported.
I suspect that there are other places in the tree that have weird
looping or other odd work arounds to try to cope with the observed
weirdness this bug can introduce. These workarounds should be located
and eliminated.
Minor debug write fix to match the above test done as well.
'nice' by: mdodd
Sponsored by: timing solutions (http://www.timing.com/)