Commit Graph

1441 Commits

Author SHA1 Message Date
Ken Smith
39fac37953 Fix panic() message to give the right function name. 2006-04-17 07:43:56 +00:00
Tor Egge
68e8466655 Eliminate softdep_flush() livelock by accounting for number of worklist items
marked as being in progress.
2006-04-03 22:23:23 +00:00
Jeff Roberson
3bbd6d8ae6 - Release the references acquired by VOP_GETWRITEMOUNT and vfs_getvfs().
Discussed with:	tegge
Tested by:	kris
Sponsored by:	Isilon Systems, Inc.
2006-03-31 03:54:20 +00:00
Tor Egge
700118c72f Allow compilation when not using softupdates. 2006-03-19 22:16:44 +00:00
Tor Egge
7de3839d0d Let snapshots make a copy of old contents for all buffers taking part in a
cluster instead of just the first buffer.

Delay buf_start() calls until snapshots have a copy of old content.

PR:		kern/93942
2006-03-19 21:43:36 +00:00
Tor Egge
30b3a49fab Add kludge to avoid deadlock when unlinking snapshot. 2006-03-19 21:29:20 +00:00
Tor Egge
95e7a3c3ac Reduce probability of unmount failing after having unmounted snapshots. 2006-03-19 21:09:19 +00:00
Tor Egge
8c86028f11 Ensure that vnode for directory isn't reclaimed before ffs_snapshot() has
completed expunging unlinked files.  It could come back at another memory
location causing a lock order reversal.
2006-03-19 21:05:10 +00:00
Jeff Roberson
8db357205c - Remove the call to softdep_waitidle after suspending the filesystem.
This does not do what I wanted as all dirty buffers must be flushed
   by the call to ffs_sync and any remaining dependency work would mean
   that this failed.

Pointed out by: tegge
2006-03-12 05:26:12 +00:00
Jeff Roberson
2eedeb7e60 - Remove the call to softdep_waitidle after suspending the filesystem.
This does not do what I wanted as all dirty buffers must be flushed
   by the call to ffs_sync and any remaining dependency work would mean
   that this failed.

Pointed out by:	tegge
2006-03-12 05:24:14 +00:00
Tor Egge
ca2fa80767 Block secondary writes while expunging active unlinked files.
Fix detection of active unlinked files by checking VI_OWEINACT and
VI_DOINGINACT in addition to v_usecount.

Defer inactive handling for unlinked files if the file system is mostly
suspended (secondary writes being blocked).

Perform deferred inactive handling after the file system is resumed.
2006-03-11 01:08:37 +00:00
Tor Egge
1e70cd7fc7 Remove unneeded (and broken) usage of MNT_REF()/MNT_REL(). 2006-03-10 02:31:12 +00:00
Tor Egge
791dd2fade Use vn_start_secondary_write() and vn_finished_secondary_write() as a
replacement for vn_write_suspend_wait() to better account for secondary write
processing.

Close race where secondary writes could be started after ffs_sync() returned
but before the file system was marked as suspended.

Detect if secondary writes or softdep processing occurred during vnode sync
loop in ffs_sync() and retry the loop if needed.
2006-03-08 23:43:39 +00:00
Tor Egge
a695d54404 Don't set IN_CHANGE and IN_UPDATE on inodes for potentially suspended
file systems.  This could cause deadlocks when creating snapshots.

Reviewed by:	jeff
2006-03-08 02:14:39 +00:00
Tor Egge
3b582b4e72 Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must
be called without any vnode locks held.  Remove calls to vn_start_write() and
vn_finished_write() in vnode_pager_putpages() and add these calls before the
vnode lock is obtained to most of the callers that don't already have them.
2006-03-02 22:13:28 +00:00
Jeff Roberson
b9b12498fd - Acquire lk in softdep_slowdown so that it's owned when we call
softdep_speedup().
 - Assert that lk is held in softdep_speedup() rather than acquiring it.
   This avoids a potential lock recursion.
2006-03-02 08:52:53 +00:00
Jeff Roberson
eb2ea10590 - Move softdep from using a global worklist to per-mount worklists. This
has many positive effects including improved smp locking, reducing
   interdependencies between mounts that can lead to deadlocks, etc.
 - Add the softdep worklist and various counters to the ufsmnt structure.
 - Add a mount pointer to the workitem and remove mount pointers from the
   various structures derived from the workitem as they are now redundant.
 - Remove the poor-man's semaphore protecting softdep_process_worklist and
   softdep_flushworklist.  Several threads may now process the list
   simultaneously.
 - Add softdep_waitidle() to block the thread until all pending
   dependencies being operated on by other threads have been flushed.
 - Use softdep_waitidle() in unmount and snapshots to block either
   operation until the fs is stable.
 - Remove softdep worklist processing from the syncer and move it into the
   softdep_flush() thread.  This thread processes all softdep mounts
   once each second and when it is called via the new softdep_speedup()
   when there is a resource shortage.  This removes the softdep hook
   from the kernel and various hacks in header files to support it.

Reviewed by/Discussed with:	tegge, truckman, mckusick
Tested by:	kris
2006-03-02 05:50:23 +00:00
Jeff Roberson
f5a4db791d - Using LK_NOWAIT in qsync() can get us into infinite loop situations that
lead to deadlocks.  Remove it.

MFC After:	1 week
2006-02-22 06:12:53 +00:00
Robert Watson
5652c15c24 In quotaoff(), lock the vnode instead of asserting it when manipulating
v_vflags.

MFC after:	1 week
Submitted by:	Antoine Brodin <antoine at brodin at laposte dot net>
2006-02-12 13:20:06 +00:00
Robert Watson
4a99d6f90a Instead of asserting the vnode lock before manipulating v_vflag, acquire
it and drop it afterwards.

Found by:	kris
MFC after:	1 week
2006-02-11 21:09:27 +00:00
Jeff Roberson
89b0e10910 - Reorder calls to vrele() after calls to vput() when the vrele is a
directory.  vrele() may lock the passed vnode, which in these cases would
   give an invalid lock order of child -> parent.  These situations are
   deadlock prone although do not typically deadlock because the vrele
   is typically not releasing the last reference to the vnode.  Users of
   vrele must consider it as a call to vn_lock() and order it appropriately.

MFC After: 	1 week
Sponsored by:	Isilon Systems, Inc.
Tested by:	kkenn
2006-02-01 00:25:26 +00:00
Tor Egge
82be0a5a24 Add marker vnodes to ensure that all vnodes associated with the mount point are
iterated over when using MNT_VNODE_FOREACH.

Reviewed by:	truckman
2006-01-09 20:42:19 +00:00
Tor Egge
6c62b2acd0 If the lock passed to getdirtybuf() is the softdep lock then the background
write completed wakeup could be missed.  Close the race by grabbing the lock
normally used for protection of bp->b_xflags.

Reviewed by:	truckman
2006-01-09 19:32:21 +00:00
Tor Egge
c8c7711d66 Broaden scope of softdep_worklist_busy rwlock protection of softdep processing
to avoid some dependencies being missed by softdep_flushworklist().

Reviewed by:	truckman
2006-01-09 19:16:56 +00:00
Warner Losh
5c65ae3a88 New option: NO_FFS_SNAPSHOT. I did this in p4 about the same time
that NetBSD implemented it independently of them (don't know which one
was actually first).  This saves about 24k for those times you don't
need snapshot support (like when running off a ram disk, or in an
embedded environment where size matters).
2006-01-06 04:44:09 +00:00
Xin LI
cd34c8b6a2 Typo. 2005-12-23 15:50:57 +00:00
Dag-Erling Smørgrav
0430a5e289 Eradicate caddr_t from the VFS API. 2005-12-14 00:49:52 +00:00
Craig Rodrigues
b6bd025c35 Fix parsing of atime, clusterr, clusterw, exec, suid, symfollow
mount options.

Noticed by:	Amir Shalem < amir at boom dot org dot il>
2005-11-24 15:06:40 +00:00
Craig Rodrigues
cea903627f If export mount flag is not passed in, set default parameters
for export structure and pass that to vfs_export().
Currently in userland mount(8), an export structure is unconditionally
passed in, only for UFS.  This is an attempt to move that UFS-specific
behavior out of mount(8) and into the UFS filesystem code.
2005-11-20 17:04:50 +00:00
Craig Rodrigues
359d438885 Add more options to ffs_opts, so that vfs_filteropts() will not
complain when we pass these options to a UFS filesystem as strings
via nmount():  noexec, nosuid, nosymfollow, sync, suiddir
2005-11-19 23:28:19 +00:00
Craig Rodrigues
26f59b6455 - Add parsing for the following existing UFS/FFS mount options in the nmount()
callpath via vfs_getopt(), and set the appropriate MNT_* flag:
  -> acls, async, force, multilabel, noasync, noatime,
  -> noclusterr, noclusterw, snapshot, update

- Allow errmsg as a valid mount option via vfs_getopt(),
  so we can later add a hook to propagate mount errors back
  to userspace via vfs_mount_error().
2005-11-18 06:06:10 +00:00
Xin LI
fad951e3a0 Slightly reorganize to reduce duplicated code.
Reviewed by:	rwatson
2005-11-07 18:25:23 +00:00
Paul Saab
e1cef62715 Rate limit filesystem full and out of inodes messages to once a
second.
2005-10-31 20:33:28 +00:00
Robert Watson
5bb84bc84b Normalize a significant number of kernel malloc type names:
- Prefer '_' to ' ', as it results in more easily parsed results in
  memory monitoring tools such as vmstat.

- Remove punctuation that is incompatible with using memory type names
  as file names, such as '/' characters.

- Disambiguate some collisions by adding subsystem prefixes to some
  memory types.

- Generally prefer lower case to upper case.

- If the same type is defined in multiple architecture directories,
  attempt to use the same name in additional cases.

Not all instances were caught in this change, so more work is required to
finish this conversion.  Similar changes are required for UMA zone names.
2005-10-31 15:41:29 +00:00
Xin LI
eb2893ec18 Remove an unneeded "a" from comment. 2005-10-25 19:46:15 +00:00
Nate Lawson
8680d6985f Adjust maxfilesize for UFS1 and old 4.4 FFS. For UFS1, increase the limit
to (max block - 1) * bsize.  For DEV_BSIZE, this doubles the limit from
0.5 TB to 1 TB.  For the old 4.4 FFS case, decrease the limit from 0.5 TB
to 2 GB - 1.  Older systems had a 32 bit off_t so they couldn't access the
larger files anyway.

Collaboration with:	bde
2005-10-21 01:54:00 +00:00
Don Lewis
875e108755 Correct the type of the temporary variable used by ufs_lookup.c:1.78
to fix the race condition in the ufs_lookup() ISDOTDOT code.

Noticed by:	bde
MFC after:	12 days
2005-10-16 21:31:46 +00:00
Don Lewis
12d360453c Close a race in the ufs_lookup() code that handles the ISDOTDOT
case by saving the value of dp->i_ino before unlocking the vnode
for the current directory and passing the saved value to VFS_VGET().

Without this change, another thread can overwrite dp->i_ino after
the current directory is unlocked, causing  ufs_lookup() to lock
and return the wrong vnode in place of the vnode for its parent
directory.  A deadlock can occur if dp->i_ino was changed to a
subdirectory of the current directory because the root to leaf vnode
lock ordering will be violated.  A vnode lock can be leaked if
dp->i_ino was changed to point to the current directory, which
causes the current vnode lock for the current directory to be
recursed, which confuses lookup() into calling vrele() when it
should be calling vput().

The probability of this bug being triggered seems to be quite low
unless the sysctl variable debug.vfscache is set to 0.

Reviewed by:	jhb
MFC after:	2 weeks
2005-10-14 22:13:33 +00:00
Robert Watson
606dcf085f When performing a VOP_LOOKUP() as part of UFS1 extended attribute
auto-start, set cnp.cn_lkflags to LK_EXCLUSIVE.  This flag must now
be set so that lockmgr knows what kind of lock to acquire, and it
will panic if not specified.  This resulted in a panic when using
extended attributes on UFS1 as of locking work present in the 6.x
branch.

This is a RELENG_6_0 merge candidate.

Reported by:	lofi
MFC after:	3 days
2005-10-12 14:18:58 +00:00
Diomidis Spinellis
9f5c1d1955 Move execve's access time update functionality into a new
vfs_mark_atime() function, and use the new function for
performing efficient atime updates in mmap().

Reviewed by:	bde
MFC after:	2 weeks
2005-10-12 06:56:00 +00:00
Tor Egge
48c2ac4539 Avoid unintended VMIO on directories and symlinks due to leftover object
not having been destroyed.
2005-10-10 19:02:04 +00:00
Tor Egge
4e0cd00988 Adjust totread argument passed to cluster_read() to account for offset not
being block aligned.
2005-10-09 21:11:25 +00:00
Tor Egge
9248a8271c Don't pretend that a failed sync write was succesful. 2005-10-09 20:49:01 +00:00
Tor Egge
021869b542 Reduce probability for a deadlock that can occur when a snapshot inode is
updated by a process holding the snapshot lock.  Another process updating a
different inode in the same inodeblock will do copy on write checks and lock in
the opposite direction.

The snapshot code force a copy on write of these blocks manually (cf. start of
expunge_ufs[12]) and these inode blocks are later put on snapblklist.

This partial fix is to 'drain' the relevant ffs_copyonwrite() operation after
installing new snapblklist.  This is not a 100% solution since a failed block
allocation can cause implicit fsync() which might deadlock before the new
snapblklist has been installed.
2005-10-09 20:15:15 +00:00
Tor Egge
d4d530da96 Eliminate a deadlock that can occur when a dirty block belonging to a snapshot
file is flushed by a process not holding snaplk (e.g. bufdaemon).  Another
process might hold snaplk and try to access the block due to ffs_copyonwrite
processing.
2005-10-09 20:07:51 +00:00
Tor Egge
45f91051da Eliminate a deadlock that can occur during the cgaccount() processing due to
the cg map buffer being held when writing indirect blocks.  The process ends up
in ffs_copyonwrite(), attempting to get snaplk while holding the cg map buffer
lock.

Another process might be in ffs_copyonwrite(), trying to allocate a new block
for a copy.  It would hold snaplk while trying to get the cg map buffer lock.

Release the cg map buffer early and use the copy for most of the cgaccount
processing to avoid this deadlock.
2005-10-09 20:00:16 +00:00
Tor Egge
17026ff61a Reduce the probability of low block numbers passed to ffs_snapblkfree() by
skipping the call from ffs_snapremove() if the block number is zero.

Simplify snapshot locking in ffs_copyonwrite() and ffs_snapblkfree() by using
the same locking protocol for low block numbers as for larger block numbers.
This removes a lock leak that could happen if vn_lock() succeeded after
lockmgr() failed in ffs_snapblkfree().

Check if snapshot is gone before retrying a lock in ffs_copyonwrite().
2005-10-09 19:45:01 +00:00
Tor Egge
c73e9e9c7b Reinitialize v_type and v_op fields in case vnode has been reused without
reclamation.  If the vnode previously was a fifo then v_op would point to
ffs_fifoops[12] instead of the expected ffs_vnodeops[12], causing a panic at
the end of ffsext_strategy.
2005-10-09 19:06:34 +00:00
Don Lewis
448434c3c9 Initialize the inode i_flag field in ffs_valloc() to clean up any
stale flag bits left over from before the inode was recycled.

Without this change, a leftover IN_SPACECOUNTED flag could prevent
softdep_freefile() and softdep_releasefile() from incrementing
fs_pendinginodes.  Because handle_workitem_freefile() unconditionally
decrements fs_pendinginodes, a negative value could be reported at
file system unmount time with a message like:
        unmount pending error: blocks 0 files -3
The pending block count in fs_pendingblocks could also be negative
for similar reasons.  These errors can cause the data returned by
statfs() to be slightly incorrect.  Some other cleanup code in
softdep_releasefile() could also be incorrectly bypassed.

MFC after:	3 days
2005-10-03 21:57:43 +00:00
Don Lewis
460858e9ef Correct previous commit to fix the sense of the TDP_NORUNNINGBUF
check in ffs_copyonwrite() that is a precondition for calling
waitrunningbufspace().

Pointed out by:	tegge
Pointy hat to:	truckman
MFC after:	3 days
2005-10-01 19:10:48 +00:00
Don Lewis
bd3c2d867d Un-staticize waitrunningbufspace() and call it before returning from
ffs_copyonwrite() if any async writes were launched.

Restore the threads previous TDP_NORUNNINGBUF state before returning
from ffs_copyonwrite().
2005-09-30 18:07:41 +00:00
Don Lewis
6c8b634f1d Un-staticize runningbufwakeup() and staticize updateproc.
Add a new private thread flag to indicate that the thread should
not sleep if runningbufspace is too large.

Set this flag on the bufdaemon and syncer threads so that they skip
the waitrunningbufspace() call in bufwrite() rather than than
checking the proc pointer vs. the known proc pointers for these two
threads.  A way of preventing these threads from being starved for
I/O but still placing limits on their outstanding I/O would be
desirable.

Set this flag in ffs_copyonwrite() to prevent bufwrite() calls from
blocking on the runningbufspace check while holding snaplk.  This
prevents snaplk from being held for an arbitrarily long period of
time if runningbufspace is high and greatly reduces the contention
for snaplk.  The disadvantage is that ffs_copyonwrite() can start
a large amount of I/O if there are a large number of snapshots,
which could cause a deadlock in other parts of the code.

Call runningbufwakeup() in ffs_copyonwrite() to decrement runningbufspace
before attempting to grab snaplk so that I/O requests waiting on
snaplk are not counted in runningbufspace as being in-progress.
Increment runningbufspace again before actually launching the
original I/O request.

Prior to the above two changes, the system could deadlock if enough
I/O requests were blocked by snaplk to prevent runningbufspace from
falling below lorunningspace and one of the bawrite() calls in
ffs_copyonwrite() blocked in waitrunningbufspace() while holding
snaplk.

See <http://www.holm.cc/stress/log/cons143.html>
2005-09-30 01:30:01 +00:00
Don Lewis
445193b887 After a rmdir()ed directory has been truncated, force an update of
the directory's inode after queuing the dirrem that will decrement
the parent directory's link count.  This will force the update of
the parent directory's actual link to actually be scheduled.  Without
this change the parent directory's actual link count would not be
updated until ufs_inactive() cleared the inode of the newly removed
directory, which might be deferred indefinitely.  ufs_inactive()
will not be called as long as any process holds a reference to the
removed directory, and ufs_inactive() will not clear the inode if
the link count is non-zero, which could be the result of an earlier
system crash.

If a background fsck is run before the update of the parent directory's
actual link count has been performed, or at least scheduled by
putting the dirrem on the leaf directory's inodedep id_bufwait list,
fsck will corrupt the file system by decrementing the parent
directory's effective link count, which was previously correct
because it already took the removal of the leaf directory into
account, and setting the actual link count to the same value as the
effective link count after the dangling, removed, leaf directory
has been removed.  This happens because fsck acts based on the
actual link count, which will be too high when fsck creates the
file system snapshot that it references.

This change has the fortunate side effect of more quickly cleaning
up the large number dirrem structures that linger for an extended
time after the removal of a large directory tree.  It also fixes a
potential problem with the shutdown of the syncer thread timing out
if the system is rebooted immediately after removing a large directory
tree.

Submitted by:	tegge
MFC after:	3 days
2005-09-29 21:50:26 +00:00
Robert Watson
5f419982c2 Back out alpha/alpha/trap.c:1.124, osf1_ioctl.c:1.14, osf1_misc.c:1.57,
osf1_signal.c:1.41, amd64/amd64/trap.c:1.291, linux_socket.c:1.60,
svr4_fcntl.c:1.36, svr4_ioctl.c:1.23, svr4_ipc.c:1.18, svr4_misc.c:1.81,
svr4_signal.c:1.34, svr4_stat.c:1.21, svr4_stream.c:1.55,
svr4_termios.c:1.13, svr4_ttold.c:1.15, svr4_util.h:1.10,
ext2_alloc.c:1.43, i386/i386/trap.c:1.279, vm86.c:1.58,
unaligned.c:1.12, imgact_elf.c:1.164, ffs_alloc.c:1.133:

Now that Giant is acquired in uprintf() and tprintf(), the caller no
longer leads to acquire Giant unless it also holds another mutex that
would generate a lock order reversal when calling into these functions.
Specifically not backed out is the acquisition of Giant in nfs_socket.c
and rpcclnt.c, where local mutexes are held and would otherwise violate
the lock order with Giant.

This aligns this code more with the eventual locking of ttys.

Suggested by:	bde
2005-09-28 07:03:03 +00:00
John Baldwin
7e9e371f2d Use the refcount API to manage the reference count for user credentials
rather than using pool mutexes.

Tested on:	i386, alpha, sparc64
2005-09-27 18:09:42 +00:00
Xin LI
0e4f6eecd7 Restore a historical ufs_inactive behavior that has been changed
in rev. 1.40 of ufs_inode.c, which allows an inode being truncated
even when the filesystem itself is marked RDONLY.  A subsequent
call of UFS_TRUNCATE (ffs_truncate) would panic the system as it
asserts that it can only be called when the filesystem is mounted
read-write (same changeset, rev. 1.74 of sys/ufs/ffs/ffs_inode.c).

Because ffs_mount() already takes care of sync'ing the filesystem
to disk before being downgraded to readonly, it appears to be more
desirable that we should not permit this sort of writes to disk.

This change would fix a panic that occours when read-only mounted
a corrupted filesystem and doing some file operations.

MT6/5/4 candidate

Reviewed by:	mckusick
2005-09-23 20:49:57 +00:00
Robert Watson
84d2b7df26 Add GIANT_REQUIRED and WITNESS sleep warnings to uprintf() and tprintf(),
as they both interact with the tty code (!MPSAFE) and may sleep if the
tty buffer is full (per comment).

Modify all consumers of uprintf() and tprintf() to hold Giant around
calls into these functions.  In most cases, this means adding an
acquisition of Giant immediately around the function.  In some cases
(nfs_timer()), it means acquiring Giant higher up in the callout.

With these changes, UFS no longer panics on SMP when either blocks are
exhausted or inodes are exhausted under load due to races in the tty
code when running without Giant.

NB: Some reduction in calls to uprintf() in the svr4 code is probably
desirable.

NB: In the case of nfs_timer(), calling uprintf() while holding a mutex,
or even in a callout at all, is a bad idea, and will generate warnings
and potential upset.  This needs to be fixed, but was a problem before
this change.

NB: uprintf()/tprintf() sleeping is generally a bad ideas, as is having
non-MPSAFE tty code.

MFC after:	1 week
2005-09-19 16:51:43 +00:00
Tor Egge
2f0ffabcf4 Giant is no longer needed here. 2005-09-12 01:21:42 +00:00
Christian S.J. Peron
d1dfd92177 Convert the primary ACL allocator from malloc(9) to using a UMA zone instead.
Also introduce an aclinit function which will be used to create the UMA zone
for use by file systems at system start up.

MFC after:	1 month
Discussed with:	rwatson
2005-09-06 00:06:30 +00:00
Tor Egge
d536ff2edb Retain generation count when writing zeroes instead of an inode to disk.
Don't free a struct inodedep if another process is allocating saved inode
memory for the same struct inodedep in initiate_write_inodeblock_ufs[12]().

Handle disappearing dependencies in softdep_disk_io_initiation().

Reviewed by:	mckusick
2005-09-05 22:14:33 +00:00
Suleiman Souhlal
fdedad764a ffs_mountfs() needs devvp to be locked, so lock it.
Glanced at by:	phk
Tested by:	pjd
MFC after:	3 days
2005-09-02 13:52:55 +00:00
Suleiman Souhlal
93373c422f Set the mountpoint path in the superblock (fs_fsmnt) at mount-time
so that it appears in the various messages (not cleanly unmounted,
filesystem full, etc). This has been broken since rev 1.261.
2005-08-21 22:06:41 +00:00
Tor Egge
15da51f73c Don't set the COMPLETE flag in an inodedep structure before the related
inode has been written.
2005-08-21 18:19:06 +00:00
Ian Dowse
65ed954554 In the ufsdirhash_build() failure case for corrupted directories
or unreadable blocks, make sure to destroy the mutex we created.
Also fix an unrelated typo in a comment.

Found by:	Peter Holm's stress tests
Reviewed by:	dwmalone
MFC after:	3 days
2005-08-17 08:48:42 +00:00
Stephan Uphoff
ed8938e082 Delay freeing disk space for file system blocks until all dirty buffers
are safely released. This fixes softdep problems on truncation (deletion)
of files with dirty buffers.

Reviewed by:	jeff@, mckusick@, ps@, tegge@
Tested by: 	glebius@, ps@
MFC after:	3 weeks
2005-07-31 20:24:14 +00:00
Alan Cox
ec9c9e7363 Eliminate inconsistency in the setting of the B_DONE flag. Specifically,
make the b_iodone callback responsible for setting it if it is needed.
Previously, it was set unconditionally by bufdone() without holding
whichever lock is shared by the b_iodone callback and the corresponding
top-half function.  Consequently, in a race, the top-half function could
conclude that operation was done before the b_iodone callback finished.
See, for example, aio_physwakeup() and aio_fphysio().

Note: I don't believe that the other, more widely-used b_iodone callbacks
are affected.

Discussed with: jeff
Reviewed by: phk
MFC after: 2 weeks
2005-07-20 19:06:06 +00:00
Suleiman Souhlal
679985d03a Allow EVFILT_VNODE events to work on every filesystem type, not just
UFS by:
- Making the pre and post hooks for the VOP functions work even when
DEBUG_VFS_LOCKS is not defined.
- Moving the KNOTE activations into the corresponding VOP hooks.
- Creating a MNTK_NOKNOTE flag for the mnt_kern_flag field of struct
mount that permits filesystems to disable the new behavior.
- Creating a default VOP_KQFILTER function: vfs_kqfilter()

My benchmarks have not revealed any performance degradation.

Reviewed by:	jeff, bde
Approved by:	rwatson, jmg (kqueue changes), grehan (mentor)
2005-06-09 20:20:31 +00:00
Ken Smith
6341095e0d This patch addresses a standards violation issue. The standards say a
file's access time should be updated when it gets executed.  A while
ago the mechanism used to exec was changed to use a more mmap based
mechanism and this behavior was broken as a side-effect of that.

A new vnode flag is added that gets set when the file gets executed,
and the VOP_SETATTR() vnode operation gets called.  The underlying
filesystem is expected to handle it based on its own semantics, some
filesystems don't support access time at all.  Those that do should
handle it in a way that does not block, does not generate I/O if possible,
etc.  In particular vn_start_write() has not been called.  The UFS code
handles it the same way as it would normally handle the access time if
a file was read - the IN_ACCESS flag gets set in the inode but no other
action happens at this point.  The actual time update will happen later
during a sync (which handles all the necessary locking).

Got me into this:	cperciva
Discussed with:		a lot with bde, a little with kan
Showed patches to:	phk, jeffr, standards@, arch@
Minor discussion on:	arch@
2005-05-31 19:39:52 +00:00
Jeff Roberson
204ec66d38 - Don't set our bio op to be a READ when we've just completed a write. There
are subtle differences in the read and write completion path.  Instead,
   grab an extra write ref so the write path can drop it when we recursively
   call bufdone().  I believe this may be the source of the wrong bufobj
   panics.

Reported by:	pho, kkenn
2005-05-30 07:04:15 +00:00
Kirk McKusick
6a52a06851 Allow removal of empty directories with high link counts. These can
occur on a filesystem running with soft updates after a crash and
before a background fsck has been run. To prevent discrepancies
from arising in a background fsck that may already be running,
the directory is removed but its inode is not freed and is left
with the residual reference count. When encountered by the
background fsck it will be reclaimed.
2005-05-18 22:18:21 +00:00
Jeff Roberson
6c71a2208d - Don't restrict the softdep stats to DEBUG kernels, they cost nothing to
export.  This was happening anyway since this file manually sets DEBUG.
 - Add a sysctl for the number of items on the worklist.
 - Use a more canonical loop restart in softdep_fsync_mountdev, it saves
   some code at the expense of a goto and makes me worry less about
   modifying a variable that should be private to the TAILQ_FOREACH_SAFE
   macro.
2005-05-03 11:03:29 +00:00
Jeff Roberson
2524c26de8 - Use bdone() directly instead of calling it indirectly through
ffs_rawreaddone().

Sponsored by:	Isilon Systems, Inc.
2005-04-30 11:28:19 +00:00
Pawel Jakub Dawidek
231b1be179 - Plug memory leak.
- Fix two style nits.

Found by:	Coverity Prevent analysis tool
Reviewed by:	rwatson
MFC after:	1 week
2005-04-16 10:57:49 +00:00
Jeff Roberson
4585e3ac5a - Change all filesystems and vfs_cache to relock the dvp once the child is
locked in the ISDOTDOT case.  Se vfs_lookup.c r1.79 for details.

Sponsored by:	Isilon Systems, Inc.
2005-04-13 10:59:09 +00:00
Jeff Roberson
ece9473efa - Consistently call 'vp' vp rather than ovp sometimes in ffs_truncate().
Do the same for oip.

Pointed out by:	glebius
2005-04-05 08:49:41 +00:00
Jeff Roberson
bcc8f66c8b - Use M_ZERO rather than explicitly calling bzero().
- Don't intermingle direct calls to lockmgr and indirect calls through
   VOPs.  This will be important in the future.
 - Dont lock the devvp's interlock just to release it on the next line by
   passing LK_INTERLOCK to lockmgr.
 - Restructure ffs_snapshot_unmount so we don't call free() with the
   devvp's interlock locked.
2005-04-03 12:03:44 +00:00
Jeff Roberson
41d4783d49 - In ffs_sync we need to pass LK_SLEEPFAIL in when we lock the vnode
because it may change identities while we're sleeping on the lock.
   Otherwise we may bail out of ffs_sync() early due to an error from
   deadfs.
 - Collapse a VOP_UNLOCK, vrele into a single vput().
2005-04-03 10:38:18 +00:00
Jeff Roberson
153910e0f5 - Move the contents of softdep_disk_prewrite into ffs_geom_strategy to fix
two bugs.
 - ffs_disk_prewrite was pulling the vp from the buf and checking for
   COPYONWRITE, when really it wanted the vp from the bufobj that we're
   writing to, which is the devvp.  This lead to us skipping the copy on
   write to all file data, which significantly broke snapshots for the
   last few months.
 - When the SOFTUPDATES option was not included in the kernel config we
   would also skip the copy on write check, which would effectively disable
   snapshots.
 - Remove an invalid mp_fixme().

Debugging tips from:	mckusick
Reported by:		iedowse, others
Discussed with:		phk
2005-04-03 10:29:55 +00:00
Jeff Roberson
278c5a6efa - Fix botched LK_NOWAIT removal. I mistakenly thought this compiled as
part of GENERIC.
2005-03-31 05:58:14 +00:00
Jeff Roberson
aa7ba42796 - FFS supports shared locks, clear LK_NOSHARE from our vnode locks.
Sponsored by:	Isilon Systems, Inc.
2005-03-31 05:23:20 +00:00
Jeff Roberson
ec3db02a3e - Set LK_NOSHARE for snapshot locks. snapshots require exclusive only
access.
 - Remove the hack from ffs_lock() to implement LK_NOSHARE in a ffs
   specific way.

Sponsored by:	Isilon Systems, Inc.
2005-03-31 05:21:17 +00:00
Jeff Roberson
f247a5240d - LK_NOPAUSE is a nop now.
Sponsored by:   Isilon Systems, Inc.
2005-03-31 04:37:09 +00:00
Jeff Roberson
52f6886551 - Remove wantparent, it is no longer necessary. An assert in vfs_lookup.c
prevents any callers from doing a modifying op without
   LOCKPARENT or WANTPARENT.  It wasn't even properly used in the CREATE
   or DELETE cases.
2005-03-29 13:16:38 +00:00
Jeff Roberson
d6919865fa - Upgrade a shared lock request to exclusive in ffs_vget() if we have
to create the vnode.

Sponsored by:	Isilon Systems, Inc.
2005-03-29 10:10:51 +00:00
Jeff Roberson
a69c43548d - Honor the cn_lkflags passed from namei() when locking the leaf.
Sponsored by:	Isilon Systems, Inc.
2005-03-29 10:10:01 +00:00
Jeff Roberson
e19881ff08 - UFS no longer uses PDIRUNLOCK to track the parent state. Instead, we now
rely on ufs to always leave the parent locked except in the ISDOTDOT
   case.  Adjust asserts to deal with these changes.

Sponsored by:	Isilon Systems, Inc.
2005-03-28 09:35:58 +00:00
Jeff Roberson
eddcb03d02 - We no longer have to bother with PDIRUNLOCK, lookup() handles it for us.
Sponsored by:   Isilon Systems, Inc.
2005-03-28 09:34:36 +00:00
David Schultz
188f6433f6 When the softupdates worklist gets too long, threads that attempt to
add more work are forced to process two worklist items first.
However, processing an item may generate additional work, causing the
unlucky thread to recursively process the worklist.  Add a per-thread
flag to detect this situation and avoid the recursion.  This should
fix the stack overflows that could occur while removing large
directory trees.

Tested by:	kris
Reviewed by:	mckusick
2005-03-25 17:30:31 +00:00
Jeff Roberson
080c061ad0 - Call VFS_ROOT() with LK_EXCLUSIVE.
Sponsored by:	Isilon Systems, Inc.
2005-03-24 07:33:45 +00:00
Jeff Roberson
469ec10c1e - Update the ufs_root() prototype.
- Pass the ufs_root() flags argument to VFS_VGET() to allow callers to
   specify shared locks.

Sponsored by:	Isilon Systems, Inc.
2005-03-24 07:32:50 +00:00
Jeff Roberson
23d15e852d - Lock the clearing of v_data in ufs_reclaim() to prevent a pagefault
in ffs_lock() when it acesses v_data without the vnlock.

Sponsored by:	Isilon Systems, Inc.
2005-03-17 11:58:43 +00:00
Poul-Henning Kamp
51f5ce0c8c Add two arguments to the vfs_hash() KPI so that filesystems which do
not have unique hashes (NFS) can also use it.
2005-03-16 11:20:51 +00:00
Poul-Henning Kamp
de68347b1b Don't hold a reference on the disk vnode for each inode. 2005-03-15 20:50:58 +00:00
Poul-Henning Kamp
45c26fa2b6 Improve the vfs_hash() API: vput() the unneeded vnode centrally to
avoid replicating the vput in all the filesystems.
2005-03-15 20:00:03 +00:00
Poul-Henning Kamp
e82ef95c11 Simplify the vfs_hash calling convention. 2005-03-15 08:07:07 +00:00
Jeff Roberson
4483fe9227 - Destroy the vnode object earlier in VOP_RECLAIM as we need more of
the vnode valid before the vm flushes pages.
 - Get rid of some extraneous uses of the vnode interlock.

Sponsored by:	Isilon Systems, Inc.
2005-03-15 01:42:58 +00:00
Poul-Henning Kamp
14bc0685ac Use vfs_hash instead of home-rolled. 2005-03-14 10:21:16 +00:00
Jeff Roberson
9cbe5da9d5 - It is not legal to access v_data without the vnode lock or interlock
held.  Grab the vnode interlock if LK_INTERLOCK has not been passed in
   so that we can inspect v_data in ffs_lock().

Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:04:12 +00:00
Jeff Roberson
fe68abe291 - The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem.  Check that rather than VI_XLOCK.
 - Shorten ffs_reload by one step.  The old check for an inactive vnode
   was slightly racey, and the code which deals with still active vnodes
   is not much more expensive.

Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:03:14 +00:00
Jeff Roberson
fdcc82276e - The VI_DOOMED flag now signals the end of a vnode's relationship with
the filesystem.  Check that rather than VI_XLOCK.

Sponsored by:	Isilon Systems, Inc.
2005-03-13 12:01:50 +00:00