Commit Graph

615 Commits

Author SHA1 Message Date
peter
b1cf415d98 Use a fixed type for times in on-disk structures for ufs rather than
something that could potentially change like time_t.
2001-07-16 00:55:27 +00:00
jhb
62d93ab4ad Fix more mntvnode and vnode interlock order reversals. 2001-06-28 22:21:33 +00:00
jhb
d19d170601 - Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.
2001-06-28 04:12:56 +00:00
peter
b38f2aba5e Fix warning:
1973: warning: int format, long int arg (arg 5)
2001-06-15 07:44:39 +00:00
mckusick
ff309fb095 Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.
2001-06-13 23:13:13 +00:00
obrien
9ef33c8546 There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency.  This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by:	Tor.Egge@fast.no
2001-06-05 01:49:37 +00:00
jhb
80cd95fd1a Revert the previous commit in favor of the fix in rev 1.42 of
ufs/ffs/ffs_extern.h instead.

Requested by:	bde
2001-05-30 23:09:19 +00:00
jhb
38492f2308 Forward declare struct cg to quiet a warning.
Submitted by:	bde
2001-05-30 23:08:40 +00:00
jhb
38e1153eed Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a
warning.
2001-05-29 23:53:16 +00:00
phk
88aea0d7ff Remove last vestiges of MFS. 2001-05-29 21:21:53 +00:00
mckusick
0f917f00a5 Update softdep_setup_directory_add prototype to reflect changes in
actual function.

Obtained from:	Jim Bloom <bloom@jbloom.jbloom.org>
2001-05-20 15:59:55 +00:00
mckusick
576ed04ba3 Must ensure that all the entries on the pd_pendinghd list have been
committed to disk before clearing them. More specifically, when
free_newdirblk is called, we know that the inode claims the new
directory block. However, if the associated pagedep is still linked
onto the directory buffer dependency chain, then some of the entries
on the pd_pendinghd list may not be committed to disk yet. In this
case, we will simply note that the inode claims the block and let
the pd_pendinghd list be processed when the pagedep is next written.
If the pagedep is no longer on the buffer dependency chain, then
all the entries on the pd_pending list are committed to disk and
we can free them in free_newdirblk. This corrects a window of
vulnerability introduced in the code added in version 1.95.
2001-05-19 19:24:26 +00:00
mckusick
6e83687b77 Must be a bit less aggressive about freeing pagedep structures.
Obtained from:	Robert Watson <rwatson@FreeBSD.org> and
		Matthew Jacob <mjacob@feral.com>
2001-05-18 22:16:28 +00:00
mckusick
a3c514dadf When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written.  When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.
2001-05-17 07:24:03 +00:00
iedowse
14dd931aa7 Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by:	phk, bp
2001-05-16 18:04:37 +00:00
mckusick
bed12dd525 Further fixes for deadlock in the presence of multiple snapshots.
There are still more to find, but this fix should cover the
common cases that folks are hitting.
2001-05-14 17:16:49 +00:00
mckusick
6f851aa835 Remove yet another deadlock case. 2001-05-11 07:12:03 +00:00
mckusick
f9ea5f696e When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.
2001-05-08 07:42:20 +00:00
mckusick
97d5d5bddb Several fixes for units errors:
1) Do not assume that the superblock will be of size fs->fs_bsize.
   This fixes a panic when taking a snapshot on a filesystem with
   a block size bigger than 8K.
2) Properly calculate the number of fragments that follow the
   superblock summary information. This fixes a bug with inconsistent
   snapshots.
3) When cleaning up a snapshot that is about to be removed, properly
   calculate the number of blocks that need to be checked. This fixes
   a bug that created partially allocated inodes.
4) When moving blocks from a snapshot that is about to be removed
   to another snapshot, properly account for the reduced number of
   blocks in the snapshot from which they are taken. This fixes a
   bug in which the number of blocks released from a snapshot did not
   match the number that it claimed to have.
2001-05-08 07:29:03 +00:00
mckusick
c48e2c8b82 When syncing out snapshot metadata, we must temporarily allow recursive
buffer locking so as to avoid locking against ourselves if we need to
write filesystem metadata.
2001-05-08 07:13:00 +00:00
mckusick
4bd6a98c02 Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce
the amount of time that the filesystem must be suspended. The
current snapshot is elided as well as the earlier snapshots.
2001-05-04 05:49:28 +00:00
phk
e249b8764b Remove blatantly pointless call to VOP_BMAP().
Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
2001-05-01 09:12:31 +00:00
phk
9f0c2a965a Implement vop_std{get|put}pages() and add them to the default vop[].
Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.
2001-05-01 08:34:45 +00:00
phk
aa58659992 VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".
2001-04-29 12:36:52 +00:00
phk
aa1facaa59 Remove faint traces of non-existant ffs_bmap(). 2001-04-29 10:23:32 +00:00
grog
609bc7e870 Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
mckusick
61ba5bd5a3 Rather than copying all the indirect blocks of the snapshot,
simply mark them as BLK_NOCOPY. This trick cuts the initial
size of the snapshot in half and cuts the time to take a
snapshot by a third.
2001-04-26 00:50:53 +00:00
mckusick
472f7b8265 When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.
2001-04-25 08:11:18 +00:00
phk
e5523c794f Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
2001-04-25 07:07:52 +00:00
iedowse
b37cd4dec0 Pre-dirpref versions of fsck may zero out the new superblock fields
fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause
panics if these fields were zeroed while a filesystem was mounted
read-only, and then remounted read-write.

Add code to ffs_reload() which copies the fs_contigdirs pointer
from the previous superblock, and reinitialises fs_avgf* if necessary.

Reviewed by:	mckusick
2001-04-24 00:37:16 +00:00
grog
405d532596 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
mckusick
4501b8e393 Add debugging option to always read/write cylinder groups as full
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:

	ffs_clusteralloc: map mismatch
	ffs_nodealloccg: map corrupted
	ffs_nodealloccg: block not in map
	ffs_alloccg: map corrupted
	ffs_alloccg: block not in map
	ffs_alloccgblk: cyl groups corrupted
	ffs_alloccgblk: can't find blk in cyl
	ffs_checkblk: partially free fragment

The following panics are less likely to be related to this problem,
but might be helped by these debugging options:

	ffs_valloc: dup alloc
	ffs_blkfree: freeing free block
	ffs_blkfree: freeing free frag
	ffs_vfree: freeing free inode

If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.
2001-04-17 05:37:51 +00:00
mckusick
e06af95642 Background fsck sysctl operations must use vn_start_write and
vn_finished_write so that they do not attempt to modify a
suspended filesystem.
2001-04-17 05:06:37 +00:00
mckusick
9bf65b6596 Update to describe use of mdconfig instead of deprecated vnconfig.
Submitted by:	Steve Ames <steve@virtual-voodoo.com>
2001-04-14 18:32:09 +00:00
mckusick
5d96cc9cf4 This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag.
It is described in ufs/ffs/fs.h as follows:

/*
 * Filesystem flags.
 *
 * Note that the FS_NEEDSFSCK flag is set and cleared only by the
 * fsck utility. It is set when background fsck finds an unexpected
 * inconsistency which requires a traditional foreground fsck to be
 * run. Such inconsistencies should only be found after an uncorrectable
 * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
 * it has successfully cleaned up the filesystem. The kernel uses this
 * flag to enforce that inconsistent filesystems be mounted read-only.
 */
#define FS_UNCLEAN    0x01	/* filesystem not clean at mount */
#define FS_DOSOFTDEP  0x02	/* filesystem using soft dependencies */
#define FS_NEEDSFSCK  0x04	/* filesystem needs sync fsck before mount */
2001-04-14 05:26:28 +00:00
mckusick
d84cca13c9 Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>.
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.

------

  One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.

  First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:

1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
   test is at wd1. Size of test file system is 8 Gb, number of cg=991,
   size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
   from Dec 2000 with BUFCACHEPERCENT=35

2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
   at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
   number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
   OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50

You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html

                              Test Results

             tar -xzf ports.tar.gz               rm -rf ports
  mode  old dirpref new dirpref speedup old dirprefnew dirpref speedup
                             First system
 normal     667         472      1.41       477        331       1.44
 async      285         144      1.98       130         14       9.29
 sync       768         616      1.25       477        334       1.43
 softdep    413         252      1.64       241         38       6.34
                             Second system
 normal     329         81       4.06       263.5       93.5     2.81
 async      302         25.7    11.75       112          2.26   49.56
 sync       281         57.0     4.93       263         90.5     2.9
 softdep    341         40.6     8.4        284          4.76   59.66

"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.

------

Algorithm description

The old dirpref algorithm is described in comments:

/*
 * Find a cylinder to place a directory.
 *
 * The policy implemented by this algorithm is to select from
 * among those cylinder groups with above the average number of
 * free inodes, the one with the smallest number of directories.
 */

A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.

What I mean by a big file system ?

  1. A big filesystem is a filesystem which occupy 20-30 or more percent
     of total drive space, i.e. first and last cylinder are physically
     located relatively far from each other.
  2. It has a relatively large number of cylinder groups, for example
     more cylinder groups than 50% of the buffers in the buffer cache.

The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.

My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
 * Find a cylinder group to place a directory.
 *
 * The policy implemented by this algorithm is to allocate a
 * directory inode in the same cylinder group as its parent
 * directory, but also to reserve space for its files inodes
 * and data. Restrict the number of directories which may be
 * allocated one after another in the same cylinder group
 * without intervening allocation of files.
 *
 * If we allocate a first level directory then force allocation
 * in another cylinder group.
 */

  My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.

  My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.

  The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:

        int32_t  fs_avgfilesize;   /* expected average file size */
        int32_t  fs_avgfpdir;      /* expected # of files per directory */

These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.

I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.

Obtained from:	Grigoriy Orlov <gluk@ptci.ru>
2001-04-10 08:38:59 +00:00
asmodai
de38bad452 Fix typo ); -> , 2001-03-24 15:25:04 +00:00
mckusick
9ebaece818 Check that background fsck operation is being done on a ufs filesystem.
Obtained from:	Robert Watson <rwatson@FreeBSD.org>
2001-03-23 20:58:25 +00:00
mckusick
a768b3faf9 Add kernel support for running fsck on active filesystems. 2001-03-21 04:09:01 +00:00
mckusick
bdf6c34569 Clear the fs_clean flag only when the FS_UNCLEAN flag is not set
(as is done in unmount).

Remove a snapshot inode from the superblock list when its last
name goes away rather than when its last reference goes away.
That way it will be properly reclaimed by fsck after a crash
rather than reenabled when the filesystem is mounted.
2001-03-21 04:05:20 +00:00
mckusick
684cbbdfad Report the correct inode number when panicing with freeing free inode.
Report the correct block number when panicing with freeing free block.
2001-03-21 04:01:02 +00:00
rwatson
bf17769feb o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to
options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively.  This change
  reflects the fact that our EA support is implemented entirely at the
  UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount
  events).  This also better reflects the fact that [shortly] MFS will also
  support EAs, as well as possibly IFS.

o Consumers of the EA support in FFS are reminded that as a result, they
  must change kernel config files to reflect the new option names.

Obtained from:	TrustedBSD Project
2001-03-19 04:35:40 +00:00
rwatson
37684f8763 o Implement "options FFS_EXTATTR_AUTOSTART", which depends on
"options FFS_EXTATTR".  When extended attribute auto-starting
  is enabled, FFS will scan the .attribute directory off of the
  root of each file system, as it is mounted.  If .attribute
  exists, EA support will be started for the file system.  If
  there are files in the directory, FFS will attempt to start
  them as attribute backing files for attributes baring the same
  name.  All attributes are started before access to the file
  system is permitted, so this permits race-free enabling of
  attributes.  For attributes backing support for security
  features, such as ACLs, MAC, Capabilities, this is vital, as
  it prevents the file system attributes from getting out of
  sync as a result of file system operations between mount-time
  and the enabling of the extended attribute.  The userland
  extattrctl tool will still function exactly as previously.
  Files must be placed directly in .attribute, which must be
  directly off of the file system root: symbolic links are
  not permitted.  FFS_EXTATTR will continue to be able
  to function without FFS_EXTATTR_AUTOSTART for sites that do not
  want/require auto-starting.  If you're using the UFS_ACL code
  available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART
  is recommended.

o This support is implemented by adding an invocation of
  ufs_extattr_autostart() to ffs_mountfs().  In addition,
  several new supporting calls are introduced in
  ufs_extattr.c:

    ufs_extattr_autostart(): start EAs on the specified mount
    ufs_extattr_lookup(): given a directory and filename,
                          return the vnode for the file.
    ufs_extattr_enable_with_open(): invoke ufs_extattr_enable()
                          after doing the equililent of vn_open()
                          on the passed file.
    ufs_extattr_iterate_directory(): iterate over a directory,
                          invoking ufs_extattr_lookup() and
                          ufs_extattr_enable_with_open() on each
                          entry.

o This feature is not widely tested, and therefore may contain
  bugs, caution is advised.  Several changes are in the pipeline
  for this feature, including breaking out of EA namespaces into
  subdirectories of .attribute (this is waiting on the updated
  EA API), as well as a per-filesystem flag indicating whether
  or not EAs should be auto-started.  This is required because
  administrators may not want .attribute auto-started on all
  file systems, especially if non-administrators have write access
  to the root of a file system.

Obtained from:	TrustedBSD Project
2001-03-14 05:32:31 +00:00
mckusick
f01ec099d7 Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).
2001-03-07 07:09:55 +00:00
mckusick
1e6e2e153e Free lock before returning from process_worklist_item.
Obtained from:	Constantine Sapuntzakis <csapuntz@stanford.edu>
2001-03-01 21:43:46 +00:00
adrian
8650fa6cdb Reviewed by: jlemon
An initial tidyup of the mount() syscall and VFS mount code.

This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.

* the guts of the mount work has been moved into vfs_mount().

* move `type', `path' and `flags' from being userland variables into being
  kernel variables in vfs_mount(). `data' remains a pointer into
  userspace.

* Attempt to verify the `type' and `path' strings passed to vfs_mount()
  aren't too long.

* rework mount() and linux_mount() to take the userland parameters
  (besides data, as mentioned) and pass kernel variables to vfs_mount().
  (linux_mount() already did this, I've just tidied it up a little more.)

* remove the copyin*() stuff for `path'. `data' still requires copyin*()
  since its a pointer into userland.

* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
  filesystem.  This variable is generally initialised with `path', and
  each filesystem can override it if they want to.

* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
2001-03-01 21:00:17 +00:00
mckusick
bc35986d9d Free lock before calling panic so that subsequent attempt to write out
buffers does not re-panic with `locking against myself'. This change
should not affect normal operations of soft updates in any way.
2001-02-23 09:01:31 +00:00
mckusick
3c351c2d02 When cleaning up excess inode dependencies, check for being done.
Reviewed by:	Jan Koum <jkb@yahoo-inc.com>
2001-02-22 10:17:57 +00:00
mckusick
b5b51d050c This patch corrects two problems with the rate limiting code
that was introduced in revision 1.80. The problem manifested
itself with a `locking against myself' panic and could also
result in soft updates inconsistences associated with inodedeps.
The two problems are:

1) One of the background operations could manipulate the bitmap
while holding it locked with intent to create. This held lock
results in a `locking against myself' panic, when the background
processing that we have been coopted to do tries to lock the bitmap
which we are already holding locked. To understand how to fix this
problem, first, observe that we can do the background cleanups in
inodedep_lookup only when allocating inodedeps (DEPALLOC is set in
the call to inodedep_lookup). Second observe that calls to
inodedep_lookup with DEPALLOC set can only happen from the following
calls into the softdep code:

        softdep_setup_inomapdep
        softdep_setup_allocdirect
        softdep_setup_remove
        softdep_setup_freeblocks
        softdep_setup_directory_change
        softdep_setup_directory_add
        softdep_change_linkcnt

Only the first two of these can come from ffs_alloc.c while holding
a bitmap locked. Thus, inodedep_lookup must not go off to do
request_cleanups when being called from these functions. This change
adds a flag, NODELAY, that can be passed to inodedep_lookup to let
it know that it should not do background processing in those cases.

2) The return value from request_cleanup when helping out with the
cleanup was 0 instead of 1. This meant that despite the fact that
we may have slept while doing the cleanups, the code did not recheck
for the appearance of an inodedep (e.g., goto top in inodedep_lookup).
This lead to the softdep inconsistency in which we ended up with
two inodedep's for the same inode.

Reviewed by:	Peter Wemm <peter@yahoo-inc.com>,
		Matt Dillon <dillon@earth.backplane.com>
2001-02-20 11:14:38 +00:00
asmodai
d8144cf92b Preceed/preceeding are not english words. Use precede and preceding. 2001-02-18 10:43:53 +00:00
jake
45e3d6a42c Implement a unified run queue and adjust priority levels accordingly.
- All processes go into the same array of queues, with different
  scheduling classes using different portions of the array.  This
  allows user processes to have their priorities propogated up into
  interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
  32.  We used to have 4 separate arrays of 32 queues each, so this
  may not be optimal.  The new run queue code was written with this
  in mind; changing the number of run queues only requires changing
  constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter.  This
  is intended to be used to create per-cpu run queues.  Implement
  wrappers for compatibility with the old interface which pass in
  the global run queue structure.
- Group the priority level, user priority, native priority (before
  propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
  symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
  This was used to detect when a process' priority had lowered and
  it should yield.  We now effectively yield on every interrupt.
- Activate propogate_priority().  It should now have the desired
  effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
  idle loop.  It interfered with propogate_priority() because
  the idle process needed to do a non-blocking acquire of Giant
  and then other processes would try to propogate their priority
  onto it.  The idle process should not do anything except idle.
  vm_page_zero_idle() will return in the form of an idle priority
  kernel thread which is woken up at apprioriate times by the vm
  system.
- Update struct kinfo_proc to the new priority interface.  Deliberately
  change its size by adjusting the spare fields.  It remained the same
  size, but the layout has changed, so userland processes that use it
  would parse the data incorrectly.  The size constraint should really
  be changed to an arbitrary version number.  Also add a debug.sizeof
  sysctl node for struct kinfo_proc.
2001-02-12 00:20:08 +00:00
bmilekic
e67bcfcaf3 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
phk
780a12ba92 Another round of the <sys/queue.h> FOREACH transmogriffer.
Created with:   sed(1)
Reviewed by:    md5(1)
2001-02-04 16:08:18 +00:00
phk
8363061cc7 Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
2001-02-04 13:13:25 +00:00
phk
5888a3757b Use <sys/queue.h> macro API. 2001-02-04 12:37:48 +00:00
dillon
fa650a9da8 Fix a race between the syncer and umount. When you umount a softupdates
filesystem softdep_process_worklist() is called in a loop until it indicates
that no dependancies remain, but the determination of that fact depends on
there only being one softdep_process_worklist() instance running.  It was
possible for the syncer to also be running softdep_process_worklist()
and the pre-existing checks in the code to prevent this were not sufficient
to prevent the race.  This patch solves the problem.

Approved-by: mckusick
2001-01-30 06:31:59 +00:00
jasone
cd97c8f6f2 Convert all simplelocks to mutexes and remove the simplelock implementations. 2001-01-24 12:35:55 +00:00
iedowse
5c5ba0de81 The ffs superblock includes a 128-byte region for use by temporary
in-core pointers to summary information. An array in this region
(fs_csp) could overflow on filesystems with a very large number of
cylinder groups (~16000 on i386 with 8k blocks). When this happens,
other fields in the superblock get corrupted, and fsck refuses to
check the filesystem.

Solve this problem by replacing the fs_csp array in 'struct fs'
with a single pointer, and add padding to keep the length of the
128-byte region fixed. Update the kernel and userland utilities
to use just this single pointer.

With this change, the kernel no longer makes use of the superblock
fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c
to indicate that these fields must be calculated for compatibility
with older kernels.

Reviewed by:	mckusick
2001-01-15 18:30:40 +00:00
mckusick
a33f21633a Properly compute the size of the final block of superblock summary information.
Submitted by:	Ian Dowse <iedowse@maths.tcd.ie>
2001-01-12 21:56:55 +00:00
mckusick
89297e2abf Several small but important fixes for snapshots:
1) Be more tolerant of missing snapshot files by only trying to decrement
   their reference count if they are registered as active.

2) Fix for snapshots of filesystems with block sizes larger than 8K
   (from Ollivier Robert <roberto@eurocontrol.fr>).

3) Fix to avoid losing last block in snapshot file when calculating blocks
   that need to be copied (from Don Coleman <coleman@coleman.org>).
2000-12-19 04:41:09 +00:00
mckusick
d48434fdef Get rid of spurious check in ffs_truncate for i_size == length
which fails to set the modification time on the file. The same
check a few lines later takes the correct action.

Submitted by:	Ian Dowse <iedowse@maths.tcd.ie>
2000-12-19 04:20:13 +00:00
assar
b14e706f98 add a stub for softdep_slowdown so that it's possible to build the
kernel without SOFTUPDATES
2000-12-17 23:59:56 +00:00
tanimura
35edfb12ca Do not race for the lock of an inode hash.
Reviewed by:	jhb
2000-12-13 10:04:01 +00:00
mckusick
79714fbd04 Preventing runaway kernel soft updates memory, take three.
Previously, the syncer process was the only process in the
system that could process the soft updates background work
list. If enough other processes were adding requests to that
list, it would eventually grow without bound. Because some of
the work list requests require vnodes to be locked, it was
not generally safe to let random processes process the work
list while they already held vnodes locked. By adding a flag
to the work list queue processing function to indicate whether
the calling process could safely lock vnodes, it becomes possible
to co-opt other processes into helping out with the work list.
Now when the worklist gets too large, other processes can safely
help out by picking off those work requests that can be handled
without locking a vnode, leaving only the small number of
requests requiring a vnode lock for the syncer process. With
this change, it appears possible to keep even the nastiest
workloads under control.

Submitted by:	Paul Saab <ps@yahoo-inc.com>
2000-12-13 08:30:35 +00:00
dwmalone
418e7e45d7 Convert more malloc+bzero to malloc+M_ZERO.
Submitted by:	josh@zipperup.org
Submitted by:	Robert Drehmel <robd@gmx.net>
2000-12-08 21:51:06 +00:00
phk
3daded5af8 Staticize some malloc M_ instances. 2000-12-08 20:09:00 +00:00
mckusick
1ac4366788 More aggressively rate limit the growth of soft dependency structures
in the face of multiple processes doing massive numbers of filesystem
operations. While this patch will work in nearly all situations, there
are still some perverse workloads that can overwhelm the system.
Detecting and handling these perverse workloads will be the subject
of another patch.

Reviewed by:	Paul Saab <ps@yahoo-inc.com>
Obtained from:	Ethan Solomita <ethan@geocast.com>
2000-11-20 06:22:39 +00:00
dillon
54d2fd2d6a Implement a low-memory deadlock solution.
Removed most of the hacks that were trying to deal with low-memory
    situations prior to now.

    The new code is based on the concept that I/O must be able to function in
    a low memory situation.  All major modules related to I/O (except
    networking) have been adjusted to allow allocation out of the system
    reserve memory pool.  These modules now detect a low memory situation but
    rather then block they instead continue to operate, then return resources
    to the memory pool instead of cache them or leave them wired.

    Code has been added to stall in a low-memory situation prior to a vnode
    being locked.

    Thus situations where a process blocks in a low-memory condition while
    holding a locked vnode have been reduced to near nothing.  Not only will
    I/O continue to operate, but many prior deadlock conditions simply no
    longer exist.

Implement a number of VFS/BIO fixes

	(found by Ian): in biodone(), bogus-page replacement code, the loop
        was not properly incrementing loop variables prior to a continue
        statement.  We do not believe this code can be hit anyway but we
        aren't taking any chances.  We'll turn the whole section into a
        panic (as it already is in brelse()) after the release is rolled.

	In biodone(), the foff calculation was incorrectly
        clamped to the iosize, causing the wrong foff to be calculated
        for pages in the case of an I/O error or biodone() called without
        initiating I/O.  The problem always caused a panic before.  Now it
        doesn't.  The problem is mainly an issue with NFS.

	Fixed casts for ~PAGE_MASK.  This code worked properly before only
        because the calculations use signed arithmatic.  Better to properly
        extend PAGE_MASK first before inverting it for the 64 bit masking
        op.

	In brelse(), the bogus_page fixup code was improperly throwing
        away the original contents of 'm' when it did the j-loop to
        fix the bogus pages.  The result was that it would potentially
        invalidate parts of the *WRONG* page(!), leading to corruption.

	There may still be cases where a background bitmap write is
        being duplicated, causing potential corruption.  We have identified
        a potentially serious bug related to this but the fix is still TBD.
        So instead this patch contains a KASSERT to detect the problem
  	and panic the machine rather then continue to corrupt the filesystem.
	The problem does not occur very often..  it is very hard to
	reproduce, and it may or may not be the cause of the corruption
	people have reported.

Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>)
Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>
2000-11-18 23:06:26 +00:00
mckusick
1b34fd0bd9 When deleting a file, the ordering of events imposed by soft updates
is to first write the deleted directory entry to disk, second write
the zero'ed inode to disk, and finally to release the freed blocks
and the inode back to the cylinder-group map. As this ordering
requires two disk writes to occur which are normally spaced about
30 seconds apart (except when memory is under duress), it takes
about a minute from the time that a file is deleted until its inode
and data blocks show up in the cylinder-group map for reallocation.
If a file has had only a brief lifetime (less than 30 seconds from
creation to deletion), neither its inode nor its directory entry
may have been written to disk. If its directory entry has not been
written to disk, then we need not wait for that directory block to
be written as the on-disk directory block does not reference the
inode. Similarly, if the allocated inode has never been written to
disk, we do not have to wait for it to be written back either as
its on-disk representation is still zero'ed out. Thus, in the case
of a short lived file, we can simply release the blocks and inode
to the cylinder-group map immediately. As the inode and its blocks
are released immediately, they are immediately available for other
uses. If they are not released for a minute, then other inodes and
blocks must be allocated for short lived files, cluttering up the
vnode and buffer caches. The previous code was a bit too aggressive
in trying to release the blocks and inode back to the cylinder-group
map resulting in their being made available when in fact the inode
on disk had not yet been zero'ed. This patch takes a more conservative
approach to doing the release which avoids doing the release prematurely.
2000-11-14 09:00:25 +00:00
adrian
087d6d7297 Initial commit of IFS - a inode-namespaced FFS. Here is a short
description:

How it works:
--

Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.)
I didn't see the need in duplicating all of sys/ufs/ffs to get this
off the ground.

File creation is done through a special file - 'newfile' . When newfile
is called, the system allocates and returns an inode. Note that newfile
is done in a cloning fashion:

fd = open("newfile", O_CREAT|O_RDWR, 0644);
fstat(fd, &st);

printf("new file is %d\n", (int)st.st_ino);

Once you have created a file, you can open() and unlink() it by its returned
inode number retrieved from the stat call, ie:

fd = open("5", O_RDWR);

The creation permissions depend entirely if you have write access to the
root directory of the filesystem.

To get the list of currently allocated inodes, VOP_READDIR has been added
which returns a directory listing of those currently allocated.

--

What this entails:

* patching conf/files and conf/options to include IFS as a new compile
  option (and since ifs depends upon FFS, include the FFS routines)

* An entry in i386/conf/NOTES indicating IFS exists and where to go for
  an explanation

* Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS
  routines require (ffs_mount() and ffs_reload())

* a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS
  routines. IFS replaces some of the vfsops, and a handful of vnops -
  most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR().
  Any other directory operation is marked as invalid.

What this results in:

* an IFS partition's create permissions are controlled by the perm/ownership of
  the root mount point, just like a normal directory

* Each inode has perm and ownership too

* IFS does *NOT* mean an FFS partition can be opened per inode. This is a
  completely seperate filesystem here

* Softupdates doesn't work with IFS, and really I don't think it needs it.
  Besides, fsck's are FAST. (Try it :-)

* Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC).
  Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against
  this particular inode, and unravelling THAT code isn't trivial. Therefore,
  useful inodes start at 3.

Enjoy, and feedback is definitely appreciated!
2000-10-14 03:02:30 +00:00
eivind
e3d38eae8e Blow away the v_specmountpoint define, replacing it with what it was
defined as (rdev->si_mountpoint)
2000-10-09 17:31:39 +00:00
rwatson
c963c0dd92 o Move initialization of ump from mp to the top of the function so that
it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic
  due to a NULL pointer dereference.

Submitted by:	Wesley Morgan <morganw@chemicals.tacorp.com>
2000-10-06 15:31:28 +00:00
rwatson
664e07fa25 o Add call to ufs_extattr_uepm_destroy() in ffs_unmount() so as to clean
up lock on extattrs.
o Get for free a comment indicating where auto-starting of extended
  attributes will eventually occur, as it was in my commit tree also.
  No implementation change here, only a comment.
2000-10-04 04:44:51 +00:00
jasone
46ca3ece23 Convert lockmgr locks from using simple locks to using mutexes.
Add lockdestroy() and appropriate invocations, which corresponds to
lockinit() and must be called to clean up after a lockmgr lock is no
longer needed.
2000-10-04 01:29:17 +00:00
bp
e5f935eef3 Add a lock structure to vnode structure. Previously it was either allocated
separately (nfs, cd9660 etc) or keept as a first element of structure
referenced by v_data pointer(ffs). Such organization leads to known problems
with stacked filesystems.

From this point vop_no*lock*() functions maintain only interlock lock.
vop_std*lock*() functions maintain built-in v_lock structure using lockmgr().
vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared
lock on vnode.

If filesystem wishes to export lockmgr compatible lock, it can put an address
of this lock to v_vnlock field. This indicates that the upper filesystem
can take advantage of it and use single lock structure for entire (or part)
of stack of vnodes. This field shouldn't be examined or modified by VFS code
except for initialization purposes.

Reviewed in general by:	mckusick
2000-09-25 15:24:04 +00:00
rwatson
cb5476bee5 o Permit UFS Extended Attributes to be associated with special devices
and FIFOs.

Obtained from:	TrustedBSD Project
2000-09-21 19:06:02 +00:00
des
5d5e082496 Silence a warning. 2000-09-17 19:41:26 +00:00
mckusick
20521b65fe Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK
Obtained from:	Ethan Solomita <ethan@geocast.com>
2000-09-07 23:02:55 +00:00
jasone
daa58ba7a4 Major update to the way synchronization is done in the kernel. Highlights
include:

* Mutual exclusion is used instead of spl*().  See mutex(9).  (Note: The
  alpha port is still in transition and currently uses both.)

* Per-CPU idle processes.

* Interrupts are run in their own separate kernel threads and can be
  preempted (i386 only).

Partially contributed by:	BSDi (BSD/OS)
Submissions by (at least):	cp, dfr, dillon, grog, jake, jhb, sheldonh
2000-09-07 01:33:02 +00:00
tegge
7042946f94 Initialize *countp to 0 in stub for softdep_flushworklist().
This allows ffs_fsync() to break out of a loop that might otherwise
be infinite on kernels compiled without the SOFTUPDATES option.
The observed symptom was a system hang at the first unmount attempt.
2000-08-09 00:41:54 +00:00
roberto
eaae886bf4 Fix the lockmgr panic everyone is seeing at shutdown time.
vput assumes curproc is the lock holder, but it's not true in this case.

Thanks a lot Luoqi !

Submitted by:	luoqi
Tested by:	phk
2000-08-01 14:15:07 +00:00
peter
c3328f1778 Minor change: fix warning - move a 'struct vnode *vp' declaration inside a
#ifdef DIAGNOSTIC to match its corresponding usage.
2000-07-28 22:27:00 +00:00
mckusick
71a5e91b09 Clean up the snapshot code so that it no longer depends on the use of
the SF_IMMUTABLE flag to prevent writing. Instead put in explicit
checking for the SF_SNAPSHOT flag in the appropriate places. With
this change, it is now possible to rename and link to snapshot files.
It is also possible to set or clear any of the owner, group, or
other read bits on the file, though none of the write or execute
bits can be set. There is also an explicit test to prevent the
setting or clearing of the SF_SNAPSHOT flag via chflags() or
fchflags(). Note also that the modify time cannot be changed as
it needs to accurately reflect the time that the snapshot was taken.

Submitted by:	Robert Watson <rwatson@FreeBSD.org>
2000-07-26 23:07:01 +00:00
mckusick
e6ede683c0 Add stub for softdep_flushworklist() so that kernels compiled
without the SOFTUPDATES option will load correctly.

Obtained from:	John Baldwin <jhb@bsdi.com>
2000-07-25 05:28:59 +00:00
mckusick
29497f482d This patch corrects the first round of panics and hangs reported
with the new snapshot code.

Update addaliasu to correctly implement the semantics of the old
checkalias function. When a device vnode first comes into existence,
check to see if an anonymous vnode for the same device was created
at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than
creating a new vnode for the device. This corrects a problem which
caused the kernel to panic when taking a snapshot of the root
filesystem.

Change the calling convention of vn_write_suspend_wait() to be the
same as vn_start_write().

Split out softdep_flushworklist() from softdep_flushfiles() so that
it can be used to clear the work queue when suspending filesystem
operations.

Access to buffers becomes recursive so that snapshots can recursively
traverse their indirect blocks using ffs_copyonwrite() when checking
for the need for copy on write when flushing one of their own indirect
blocks. This eliminates a deadlock between the syncer daemon and a
process taking a snapshot.

Ensure that softdep_process_worklist() can never block because of a
snapshot being taken. This eliminates a problem with buffer starvation.

Cleanup change in ffs_sync() which did not synchronously wait when
MNT_WAIT was specified. The result was an unclean filesystem panic
when doing forcible unmount with heavy filesystem I/O in progress.

Return a zero'ed block when reading a block that was not in use at
the time that a snapshot was taken. Normally, these blocks should
never be read. However, the readahead code will occationally read
them which can cause unexpected behavior.

Clean up the debugging code that ensures that no blocks be written
on a filesystem while it is suspended. Snapshots must explicitly
label the blocks that they are writing during the suspension so that
they do not cause a `write on suspended filesystem' panic.

Reorganize ffs_copyonwrite() to eliminate a deadlock and also to
prevent a race condition that would permit the same block to be
copied twice. This change eliminates an unexpected soft updates
inconsistency in fsck caused by the double allocation.

Use bqrelse rather than brelse for buffers that will be needed
soon again by the snapshot code. This improves snapshot performance.
2000-07-24 05:28:33 +00:00
bp
041d5e4f8e Prevent possible dereference of NULL pointer.
Submitted by:	Marius Bendiksen <mbendiks@eunet.no>
2000-07-13 02:17:14 +00:00
mckusick
740c0346bc Brain fault, forgot to update ffs_snapshot.c with the new calling convention
for vn_start_write.
2000-07-12 00:27:27 +00:00
mckusick
78cc524a14 Add snapshots to the fast filesystem. Most of the changes support
the gating of system calls that cause modifications to the underlying
filesystem. The gating can be enabled by any filesystem that needs
to consistently suspend operations by adding the vop_stdgetwritemount
to their set of vnops. Once gating is enabled, the function
vfs_write_suspend stops all new write operations to a filesystem,
allows any filesystem modifying system calls already in progress
to complete, then sync's the filesystem to disk and returns. The
function vfs_write_resume allows the suspended write operations to
begin again. Gating is not added by default for all filesystems as
for SMP systems it adds two extra locks to such critical kernel
paths as the write system call. Thus, gating should only be added
as needed.

Details on the use and current status of snapshots in FFS can be
found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness
is not included here. Unless and until you create a snapshot file,
these changes should have no effect on your system (famous last words).
2000-07-11 22:07:57 +00:00
mckusick
58ae6e2298 Clean up warning about undeclared function by declaring softdep_fsync
in mount.h instead of ffs_extern.h. The correct solution is to use
an indirect function pointer so that the kernel does not have to be
built with options FFS, but that will be left for another day.
2000-07-11 19:28:26 +00:00
mckusick
91032d2189 Delete README as it is now obsolete. Relevant information is in
README.softupdates.
2000-07-08 02:32:49 +00:00
mckusick
1b7a66cf46 Update to reflect current status. 2000-07-08 02:31:21 +00:00
mckusick
5af5b62d4e Get userland visible flags added for snapshots to give a few days
advance preparation for them to get migrated into place so that
subsequent changes in utilities will not fail to compile for lack
of up-to-date header files in /usr/include.
2000-07-04 04:58:34 +00:00
phk
c6692f763b Make the two calls from kern/* into softupdates #ifdef SOFTUPDATES,
that is way cleaner than using the softupdates_stub stunt, which
should be killed when convenient.

Discussed with:	mckusick
2000-07-03 13:26:54 +00:00
ache
add87bd36d Remove obsoleted info about linking from contrib 2000-06-24 13:29:25 +00:00
mckusick
1183a4dc99 Update to new copyright. 2000-06-22 00:29:53 +00:00
mckusick
f29217b823 When running with quotas enabled on a filesystem using soft updates,
the system would panic when a user's inode quota was exceeded (see
PR 18959 for details). This fixes that problem.

PR:		18959
Submitted by:	Jason Godsey <jason@unixguy.fidalgo.net>
2000-06-18 22:14:28 +00:00
mckusick
eb0270eca0 Some additional performance improvements. When freeing an inode
check to see if it has been committed to disk. If it has never
been written, it can be freed immediately. For short lived files
this change allows the same inode to be reused repeatedly.
Similarly, when upgrading a fragment to a larger size, if it
has never been claimed by an inode on disk, it too can be freed
immediately making it available for reuse often in the next slowly
growing block of the same file.
2000-06-18 22:05:57 +00:00
phk
59298adb9c Revert part of my bioops change which implemented panic(8). 2000-06-16 14:32:13 +00:00
phk
c1cf1fd414 ARGH! I have too many source trees :-(
Fix prototype errors in last commit.
2000-06-16 13:00:33 +00:00
phk
f26574f541 Virtualizes & untangles the bioops operations vector.
Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@
2000-06-16 08:48:51 +00:00
phk
e91d82a8fb Remove a comment which should never have made it in. 2000-06-14 21:48:19 +00:00
rwatson
a68ba9634d o If FFS_EXTATTR is defined, don't print out an error message on unmount
if an FFS partition returns EOPNOTSUPP, as it just means extended
  attributes weren't enabled on that partition.  Prevents spurious
  warning per-partition at shutdown.
2000-06-04 04:50:36 +00:00
jake
5e208b0c18 Back out the previous change to the queue(3) interface.
It was not discussed and should probably not happen.

Requested by:		msmith and others
2000-05-26 02:09:24 +00:00
jake
1d685644e0 Change the way that the queue(3) structures are declared; don't assume that
the type argument to *_HEAD and *_ENTRY is a struct.

Suggested by:	phk
Reviewed by:	phk
Approved by:	mdodd
2000-05-23 20:41:01 +00:00
rwatson
e92063f247 s/ffs_unmonut/ffs_unmount/ in a gratuitous ufs_extattr printf.
Reported by:	knu
2000-05-07 17:21:08 +00:00
phk
633deb3a69 Separate the struct bio related stuff out of <sys/buf.h> into
<sys/bio.h>.

<sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall
not be made a nested include according to bdes teachings on the
subject of nested includes.

Diskdrivers and similar stuff below specfs::strategy() should no
longer need to include <sys/buf.> unless they need caching of data.

Still a few bogus uses of struct buf to track down.

Repocopy by:    peter
2000-05-05 09:59:14 +00:00
phk
929ca23961 Remove unneeded #include <vm/vm_zone.h>
Generated by:	src/tools/tools/kerninclude
2000-04-30 18:52:11 +00:00
phk
5ba878c33f s/biowait/bufwait/g
Prodded by: several.
2000-04-29 16:25:22 +00:00
phk
d7e981e734 Remove unneeded #include <sys/kernel.h> 2000-04-29 15:36:14 +00:00
phk
43018e3fb6 Remove ~25 unneeded #include <sys/conf.h>
Remove ~60 unneeded #include <sys/malloc.h>
2000-04-19 14:58:28 +00:00
rwatson
60d30a3b82 Introduce extended attribute support for FFS, allowing arbitrary
(name, value) pairs to be associated with inodes.  This support is
used for ACLs, MAC labels, and Capabilities in the TrustedBSD
security extensions, which are currently under development.

In this implementation, attributes are backed to data vnodes in the
style of the quota support in FFS.  Support for FFS extended
attributes may be enabled using the FFS_EXTATTR kernel option
(disabled by default).  Userland utilities and man pages will be
committed in the next batch.  VFS interfaces and man pages have
been in the repo since 4.0-RELEASE and are unchanged.

o ufs/ufs/extattr.h: UFS-specific extattr defines
o ufs/ufs/ufs_extattr.c: bulk of support routines
o ufs/{ufs,ffs,mfs}/*.[ch]: hooks and extattr.h includes
o contrib/softupdates/ffs_softdep.c: extattr.h includes
o conf/options, conf/files, i386/conf/LINT: added FFS_EXTATTR

o coda/coda_vfsops.c: XXX required extattr.h due to ufsmount.h
(This should not be the case, and will be fixed in a future commit)

Currently attributes are not supported in MFS.  This will be fixed.

Reviewed by:	adrian, bp, freebsd-fs, other unthanked souls
Obtained from:	TrustedBSD Project
2000-04-15 03:34:27 +00:00
phk
6746c7cf0d Move B_ERROR flag to b_ioflags and call it BIO_ERROR.
(Much of this done by script)

Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED.

Move b_pblkno and b_iodone_chain to struct bio while we transition, they
will be obsoleted once bio structs chain/stack.

Add bio_queue field for struct bio aware disksort.

Address a lot of stylistic issues brought up by bde.
2000-04-02 15:24:56 +00:00
phk
37454307f3 Rename the existing BUF_STRATEGY() to DEV_STRATEGY()
substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo)

substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo)

This patch is machine generated except for the ccd.c and buf.h parts.
2000-03-20 11:29:10 +00:00
phk
f6b69faae4 Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new
field in struct buf: b_iocmd.  The b_iocmd is enforced to have
exactly one bit set.

B_WRITE was bogusly defined as zero giving rise to obvious coding
mistakes.

Also eliminate the redundant struct buf flag B_CALL, it can just
as efficiently be done by comparing b_iodone to NULL.

Should you get a panic or drop into the debugger, complaining about
"b_iocmd", don't continue.  It is likely to write on your disk
where it should have been reading.

This change is a step in the direction towards a stackable BIO capability.

A lot of this patch were machine generated (Thanks to style(9) compliance!)

Vinum users:  Greg has not had time to test this yet, be careful.
2000-03-20 10:44:49 +00:00
mckusick
be19d58887 Use 64-bit math to calculate if we have hit our freespace limit.
Necessary for coherent results on filesystems bigger than 0.5Tb.
2000-03-17 03:44:47 +00:00
mckusick
3faee8cdde Use 64-bit math to decide if optimization needs to be changed.
Necessary for coherent results on filesystems bigger than 0.5Tb.

Submitted by:	Paul Saab <ps@yahoo-inc.com>
2000-03-15 07:08:36 +00:00
dillon
5d28583063 Fix a 'freeing free block' panic in UFS. The problem occurs when the
filesystem fills up.  If the first indirect block exists and FFS is able
    to allocate deeper indirect blocks, but is not able to allocate the
    data block, FFS improperly unwinds the indirect blocks and leaves a
    block pointer hanging to a freed block.  This will cause a panic later
    when the file is removed.  The solution is to properly account for the
    first block-pointer-to-an-indirect-block we had to create in a balloc
    operation and then unwind it if a failure occurs.

Detective work by: Ian Dowse <iedowse@maths.tcd.ie>
Reviewed by: mckusick, Ian Dowse <iedowse@maths.tcd.ie>
Approved by: jkh
2000-02-24 20:43:20 +00:00
mckusick
65d7ee09df When writing out bitmap buffers, need to skip over ones that already
have a write in progress. Otherwise one can get in an infinite loop
trying to get them all flushed.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
2000-01-30 20:32:59 +00:00
mckusick
bc4b9238bd During fastpath processing for removal of a short-lived inode, the
set of restrictions for cancelling an inode dependency (inodedep)
is somewhat stronger than originally coded. Since this check appears
in two places, we codify it into the function check_inode_unwritten
which we then call from the two sites, one freeing blocks and the
other freeing directory entries.

Submitted by:	Steinar Haug via Matthew Dillon
2000-01-18 01:33:05 +00:00
mckusick
36b3a8c34b Need to reorganize the flushing of directory entry (pagedep) dependencies
so that they never try to lock an inode corresponding to ".." as this
can lead to deadlock. We observe that any inode with an updated link count
is always pushed into its buffer at the time of the link count change, so
we do not need to do a VOP_UPDATE, but merely find its buffer and write it.
The only time we need to get the inode itself is from the result of a
mkdir whose name will never be ".." and hence locking such an inode will
never request a lock above us in the filesystem tree. Thanks to Brian
Fundakowski Feldman for providing the test program that tickled soft updates
into hanging in "inode" sleep.

Submitted by:	Brian Fundakowski Feldman <green@FreeBSD.org>
2000-01-18 01:30:03 +00:00
mckusick
718a55292d Better bounding on softdep_flushfiles; other minor tweeks to checks. 2000-01-17 06:35:11 +00:00
mckusick
22c836fb26 Must track multiple uncommitted renames until one ultimately gets
committed to disk or is removed.
2000-01-17 06:28:18 +00:00
dillon
48383502a9 Non-operational change, fix compiler warning.
Reviewed by:  mckusick
2000-01-14 04:39:28 +00:00
mckusick
3f4c19879a Confirming Peter's fix (locking 101: release the lock before you go
to sleep). Locking 101, part 2: do not look at buffer contents after
you have been asleep. There is no telling what wonderous changes may
have occurred.
2000-01-13 20:03:22 +00:00
peter
bc1be24da9 Free the global softupdates lock prior to tsleep() in getdirtybuf().
This seems to be responsible for a bunch of panics where the process
sleeps and something else finds softupdates "locked" when it shouldn't
be.  This commit is unreviewed, but has been a big help here.
Previously my boxes would panic pretty much on the first fsync() that
wrote something to disk.
2000-01-13 18:48:12 +00:00
mckusick
9c01a8e874 Because cylinder group blocks are now written in background,
it is no longer sufficient to get a lock on a buffer to know
that its write has been completed. We have to first get the
lock on the buffer, then check to see if it is doing a
background write. If it is doing background write, we have
to wait for the background write to finish, then check to see
if that fullfilled our dependency, and if not to start another
write. Luckily the explanation is longer than the fix.
2000-01-13 07:20:01 +00:00
mckusick
320709d357 A panic occurs during an fsync when a dirty block associated with
a vnode has not been written (which would clear certain of its
dependencies). The problems arises because fsync with MNT_NOWAIT
no longer pushes all the dirty blocks associated with a vnode. It
skips those that require rollbacks, since they will just get instantly
dirty again. Such skipped blocks are marked so that they will not be
skipped a second time (otherwise circular dependencies would never
clear). So, we fsync twice to ensure that everything will be written
at least once.
2000-01-13 07:17:39 +00:00
mckusick
235de16d63 The only known cause of this panic is running out of disk space.
The problem occurs when an indirect block and a data block are
being allocated at the same time. For example when the 13th block
of the file is written, the filesystem needs to allocate the first
indirect block and a data block. If the indirect block allocation
succeeds, but the data block allocation fails, the error code
dellocates the indirect block as it has nothing at which to point.
Unfortunately, it does not deallocate the indirect block's associated
dependencies which then fail when they find the block unexpectedly
gone (ptr == 0 instead of its expected value). The fix is to fsync
the file before doing the block rollback, as the fsync will flush
out all of the dependencies. Once the rollback is done the file
must be fsync'ed again so that the soft updates code does not find
unexpected changes. This approach is much slower than writing the
code to back out the extraneous dependencies, but running out of
disk space is not expected to be a common occurence, so just getting
it right is the main criterion.

PR:		kern/15063
Submitted by:	Assar Westerlund <assar@stacken.kth.se>
2000-01-11 08:27:00 +00:00
mckusick
64a267bc5d We cannot proceed to free the blocks of the file until the dependencies
have been cleaned up by deallocte_dependencies(). Once that is done, it
is safe to post the request to free the blocks. A similar change is also
needed for the freefile case.
2000-01-11 06:52:35 +00:00
phk
8ca31b2aa9 Give vn_isdisk() a second argument where it can return a suitable errno.
Suggested by:	bde
2000-01-10 12:04:27 +00:00
mckusick
53d01a7543 Missing FREE_LOCK call before handle_workitem_freeblocks.
Submitted by:	"Kenneth D. Merry" <ken@kdm.org>
2000-01-10 08:39:03 +00:00
mckusick
f4b886a6fa Several performance improvements for soft updates have been added:
1) Fastpath deletions. When a file is being deleted, check to see if it
   was so recently created that its inode has not yet been written to
   disk. If so, the delete can proceed to immediately free the inode.
2) Background writes: No file or block allocations can be done while the
   bitmap is being written to disk. To avoid these stalls, the bitmap is
   copied to another buffer which is written thus leaving the original
   available for futher allocations.
3) Link count tracking. Constantly track the difference in i_effnlink and
   i_nlink so that inodes that have had no change other than i_effnlink
   need not be written.
4) Identify buffers with rollback dependencies so that the buffer flushing
   daemon can choose to skip over them.
2000-01-10 00:24:24 +00:00
mckusick
6a981cef93 Keep tighter control of removal dependencies by limiting the number
of dirrem structure rather than the collaterally created freeblks
and freefile structures. Limit the rate of buffer dirtying by the
syncer process during periods of intense file removal.
2000-01-09 23:35:38 +00:00
mckusick
a8affb76de Reorganize softdep_fsync so that it only does the inode-is-flushed
check before the inode is unlocked while grabbing its parent directory.
Once it is unlocked, other operations may slip in that could make
the inode-is-flushed check fail. Allowing other writes to the inode
before returning from fsync does not break the semantics of fsync
since we have flushed everything that was dirty at the time of the
fsync call.
2000-01-09 23:14:57 +00:00
mckusick
177bd07cb4 Get rid of unreferenced function. 2000-01-09 22:42:42 +00:00
mckusick
acad917516 Make static non-exported functions from soft updates. 2000-01-09 22:40:09 +00:00
peter
4a06465a4e Change #ifdef KERNEL to #ifdef _KERNEL in the public headers. "KERNEL"
is an application space macro and the applications are supposed to be free
to use it as they please (but cannot).  This is consistant with the other
BSD's who made this change quite some time ago.  More commits to come.
1999-12-29 05:07:58 +00:00
bde
b91fa7bd74 Update the unclean flag for mount -u. I forgot to handle this case
when I made the absence of the clean flag sticky in rev.1.88.  This
was a problem main for "mount /".  There is no way to mount "/" for
writing without using mount -u (normally implicitly), so after
"mount -f /" of an unclean filesystem, the absence of the clean flag
was sticky forever.
1999-12-23 15:42:14 +00:00
eivind
1955bfe04b Change incorrect NULLs to 0s 1999-12-21 11:14:12 +00:00
rwatson
4906482d49 Second pass commit to introduce new ACL and Extended Attribute system
calls, vnops, vfsops, both in /kern, and to individual file systems that
require a vfsop_ array entry.

Reviewed by:	eivind
1999-12-19 06:08:07 +00:00
mckusick
c57804a0f6 The function request_cleanup() had a tsleep() with PCATCH. It is
quite dangerous, since the process may hold locks at the point,
and if it is stopped in that tsleep the machine may hang. Because
the sleep is so short, the PCATCH is not required here, so it has
been removed. For the future, the FreeBSD team needs to decide
whether it is still reasonable to stop a process in tsleep, as that
may affect any other code that uses PCATCH while holding kernel locks.

Submitted by:	Dmitrij Tejblum <tejblum@arc.hq.cti.ru>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-12-16 22:02:09 +00:00
eivind
e7998d2653 Introduce NDFREE (and remove VOP_ABORTOP) 1999-12-15 23:02:35 +00:00
eivind
b0481fa219 Lock reporting and assertion changes.
* lockstatus() and VOP_ISLOCKED() gets a new process argument and a new
  return value: LK_EXCLOTHER, when the lock is held exclusively by another
  process.
* The ASSERT_VOP_(UN)LOCKED family is extended to use what this gives them
* Extend the vnode_if.src format to allow more exact specification than
  locked/unlocked.

This commit should not do any semantic changes unless you are using
DEBUG_VFS_LOCKS.

Discussed with:	grog, mch, peter, phk
Reviewed by:	peter
1999-12-11 16:13:02 +00:00
billf
f64d5d3aac Remove the 'alpha, use at your own risk' death-statement.
Reviewed by:	mckusick (verbally at FreeBSDcon)
1999-12-03 00:40:31 +00:00
billf
e117e1483c Fix typo, add $FreeBSD$ 1999-12-03 00:34:26 +00:00
mckusick
25d6c38022 Preferentially allocate the first indirect block in the same
cylinder group as the inode. This makes a 15% difference in
read speed for files in the 96K to 500K size range.
1999-12-01 19:33:12 +00:00
phk
5588714124 Convert various pieces of code to use vn_isdisk() rather than checking
for vp->v_type == VBLK.

In ccd: we don't need to call VOP_GETATTR to find the type of a vnode.

Reviewed by:    sos
1999-11-22 10:33:55 +00:00
eivind
7b9b24d95b We do not have ffs_checkexp, so remove the prototype 1999-11-20 16:44:44 +00:00
phk
baaa589c2a struct mountlist and struct mount.mnt_list have no business being
a CIRCLEQ.  Change them to TAILQ_HEAD and TAILQ_ENTRY respectively.

This removes ugly  mp != (void*)&mountlist  comparisons.

Requested by:   phk
Submitted by:   Jake Burkholder jake@checker.org
PR:             14967
1999-11-20 10:00:46 +00:00
phk
2314dc0151 Next step in the device cleanup process.
Correctly lock vnodes when calling VOP_OPEN() from filesystem mount code.

Unify spec_open() for bdev and cdev cases.

Remove the disabled bdev specific read/write code.
1999-11-09 14:15:33 +00:00
bde
a22dee8730 Quick fix for breakage of ext2fs link counts as reported by stat(2) by
the soft updates changes: only report the link count to be i_effnlink
in ufs_getattr() for file systems that maintain i_effnlink.

Tested by:	Mike Dracopoulos <mdraco@math.uoa.gr>
1999-11-03 12:05:39 +00:00
msmith
ad99487831 Newline-terminate the complaint message about not being able to find
the root vnode pointer.
1999-11-01 23:57:28 +00:00
phk
664ff4bcbf useracc() the prequel:
Merge the contents (less some trivial bordering the silly comments)
of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>.  This puts
the #defines for the vm_inherit_t and vm_prot_t types next to their
typedefs.

This paves the road for the commit to follow shortly: change
useracc() to use VM_PROT_{READ|WRITE} rather than B_{READ|WRITE}
as argument.
1999-10-29 18:09:36 +00:00
phk
1782c989a9 Remove the D_NOCLUSTER[RW] options which were added because vn had
problems.  Now that Matt has fixed vn, this can go.  The vn driver
should have used d_maxio (now si_iosize_max) anyway.
1999-09-30 07:11:30 +00:00
phk
763f4e5596 Remove v_maxio from struct vnode.
Replace it with mnt_iosize_max in struct mount.

Nits from:	bde
1999-09-29 20:05:33 +00:00
alfred
a2166e89ae Seperate the export check in VFS_FHTOVP, exports are now checked via
VFS_CHECKEXP.

Add fh(open|stat|stafs) syscalls to allow userland to query filesystems
based on (network) filehandle.

Obtained from:	NetBSD
1999-09-11 00:46:08 +00:00
peter
897b0d59e8 $Id$ -> $FreeBSD$ 1999-08-28 02:16:32 +00:00
peter
e4b04a2b21 $Id$ -> $FreeBSD$ 1999-08-28 01:08:13 +00:00
phk
89cb1b3c74 Introduce vn_isdisk(struct vnode *vp) function, and use it to test for diskness. 1999-08-25 12:24:39 +00:00
sheldonh
9be2025a61 Fix bug introduced in rev 1.28, which causes kernel build to break for
the case where DEBUG is defined but not DIAGNOSTIC. ffs_checkblk is
declared conditionally on DIAGNOSTIC, not DEBUG.

PR:	13314
Reviewed by:	bde
1999-08-24 08:39:41 +00:00
bde
e39b15a169 Use devtoname() to print dev_t's instead of casting them to long or u_long
for misprinting in %lx format.
1999-08-23 20:35:21 +00:00
phk
4e8f5d9a5c The bdevsw() and cdevsw() are now identical, so kill the former. 1999-08-13 10:29:38 +00:00
phk
d0f8b71049 Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>,
a few lines into <sys/vnode.h>.

Add a few fields to struct specinfo, paving the way for the fun part.
1999-08-08 18:43:05 +00:00
mckusick
97c9bbb1d7 Create the macro DOINGASYNC to check whether the MNT_ASYNC flag has
been set for a mount point. Insert missing checks to ensure that all
write operations are done asynchronously when the MNT_ASYNC option
has been requested.

Submitted by:	Craig A Soules <soules+@andrew.cmu.edu>
Reviewed by:	Kirk McKusick <mckusick@mckusick.com>
1999-07-13 18:20:13 +00:00
phk
070bab3aaf Use the fsid from the superblock, unless it looks bogus or has already
been taken by some other filesystem.
1999-07-11 19:16:50 +00:00
roberto
d6e4f6885e Add $Id$
Approved by:	kirk
1999-07-07 07:51:04 +00:00
jdp
d7e486f671 Update pathnames for new location of soft-updates sources. 1999-07-03 21:34:05 +00:00
mckusick
7151ad3643 No longer need to set B_ASYNC flag since BUF_KERNPROC now
unconditionally sets the identity of the buffer.
1999-06-29 15:57:40 +00:00
peter
b5d1061c97 Keep the inlines for <sys/buf.h> happy.. 1999-06-27 13:26:23 +00:00
mckusick
cc86997b99 Convert buffer locking from using the B_BUSY and B_WANTED flags to using
lockmgr locks. This commit should be functionally equivalent to the old
semantics. That is, all buffer locking is done with LK_EXCLUSIVE
requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will
be done in future commits.
1999-06-26 02:47:16 +00:00
mckusick
93ac61452a On our final pass through ffs_fsync, do all I/O synchronously so that
we can find out if our flush is failing because of write errors. This
change avoids a "flush failed" panic during unrecoverable disk errors.
1999-06-18 05:49:46 +00:00
mckusick
7c68e9b5c4 Add a vnode argument to VOP_BWRITE to get rid of the last vnode
operator special case. Delete special case code from vnode_if.sh,
vnode_if.src, umap_vnops.c, and null_vnops.c.
1999-06-16 23:27:55 +00:00
mckusick
f3746e495b Get rid of the global variable rushjob and replace it with a function in
kern/vfs_subr.c named speedup_syncer() which handles the speedup request.
Change the various clients of rushjob to use the new function.
1999-06-15 23:37:29 +00:00
phk
447bf5968f Simplify cdevsw registration.
The cdevsw_add() function now finds the major number(s) in the
struct cdevsw passed to it.  cdevsw_add_generic() is no longer
needed, cdevsw_add() does the same thing.

cdevsw_add() will print an message if the d_maj field looks bogus.

Remove nblkdev and nchrdev variables.  Most places they were used
bogusly.  Instead check a dev_t for validity by seeing if devsw()
or bdevsw() returns NULL.

Move bdevsw() and devsw() functions to kern/kern_conf.c

Bump __FreeBSD_version to 400006

This commit removes:
        72 bogus makedev() calls
        26 bogus SYSINIT functions

if_xe.c bogusly accessed cdevsw[], author/maintainer please fix.

I4b and vinum not changed.  Patches emailed to authors.  LINT
probably broken until they catch up.
1999-05-31 11:29:30 +00:00
julian
284c90ec6d Cosmetic changes to make it compile without errors in gcc -Wall 1999-05-22 04:43:04 +00:00
mckusick
04cee7c227 Add a hook to ffs_fsync to allow soft updates to get first chance at doing
a sync on the block device for the filesystem. That allows it to push the
bitmap blocks before the inode blocks which greatly reduces the number of
inode rollbacks that need to be done.
1999-05-14 01:26:46 +00:00
peter
18fa6ea913 Try and fix a dev_t/major/minor etc nit. 1999-05-12 22:32:07 +00:00
mckusick
06d92ae45f Put back changes that might be causing trouble on Alpha. 1999-05-09 19:39:54 +00:00
phk
f1da116583 I got tired of seeing all the cdevsw[major(foo)] all over the place.
Made a new (inline) function devsw(dev_t dev) and substituted it.

Changed to the BDEV variant to this format as well: bdevsw(dev_t dev)

DEVFS will eventually benefit from this change too.
1999-05-08 06:40:31 +00:00
phk
032245568a Continue where Julian left off in July 1998:
Virtualize bdevsw[] from cdevsw.  bdevsw() is now an (inline)
        function.

        Join CDEV_MODULE and BDEV_MODULE to DEV_MODULE (please pay attention
        to the order of the cmaj/bmaj arguments!)

        Join CDEV_DRIVER_MODULE and BDEV_DRIVER_MODULE to DEV_DRIVER_MODULE
        (ditto!)

(Next step will be to convert all bdev dev_t's to cdev dev_t's
before they get to do any damage^H^H^H^H^H^Hwork in the kernel.)
1999-05-07 10:11:40 +00:00
mckusick
c8d57b6289 Whitespace cleanup. 1999-05-07 05:21:16 +00:00
mckusick
3e9638ef1f Get rid of random debugging cruft; sync up with latest version. 1999-05-07 05:11:31 +00:00
mckusick
dfd46d845f Severe slowdowns have been reported when creating or removing many
files at once on a filesystem running soft updates. The root of
the problem is that soft updates limits the amount of memory that
may be allocated to dependency structures so as to avoid hogging
kernel memory. The original algorithm just waited for the disk I/O
to catch up and reduce the number of dependencies. This new code
takes a much more aggressive approach. Basically there are two
resources that routinely hit the limit. Inode dependencies during
periods with a high file creation rate and file and block removal
dependencies during periods with a high file removal rate. I have
attacked these problems from two fronts. When the inode dependency
limits are reached, I pick a random inode dependency, UFS_UPDATE
it together with all the other dirty inodes contained within its
disk block and then write that disk block. This trick usually
clears 5-50 inode dependencies in a single disk I/O. For block and
file removal dependencies, I pick a random directory page that has
at least one remove pending and VOP_FSYNC its directory. That
releases all its removal dependencies to the work queue. To further
hasten things along, I also immediately start the work queue process
rather than waiting for its next one second scheduled run.
1999-05-07 02:26:47 +00:00
peter
48287d84dc Add sufficient braces to keep egcs happy about potentially ambiguous
if/else nesting.
1999-05-06 18:13:11 +00:00
alc
d1c34fef9c The VFS/BIO subsystem contained a number of hacks in order to optimize
piecemeal, middle-of-file writes for NFS.  These hacks have caused no
end of trouble, especially when combined with mmap().  I've removed
them.  Instead, NFS will issue a read-before-write to fully
instantiate the struct buf containing the write.  NFS does, however,
optimize piecemeal appends to files.  For most common file operations,
you will not notice the difference.  The sole remaining fragment in
the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache
coherency issues with read-merge-write style operations.  NFS also
optimizes the write-covers-entire-buffer case by avoiding the
read-before-write.  There is quite a bit of room for further
optimization in these areas.

The VM system marks pages fully-valid (AKA vm_page_t->valid =
VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault.  This
is not correct operation.  The vm_pager_get_pages() code is now
responsible for marking VM pages all-valid.  A number of VM helper
routines have been added to aid in zeroing-out the invalid portions of
a VM page prior to the page being marked all-valid.  This operation is
necessary to properly support mmap().  The zeroing occurs most often
when dealing with file-EOF situations.  Several bugs have been fixed
in the NFS subsystem, including bits handling file and directory EOF
situations and buf->b_flags consistancy issues relating to clearing
B_ERROR & B_INVAL, and handling B_DONE.

getblk() and allocbuf() have been rewritten.  B_CACHE operation is now
formally defined in comments and more straightforward in
implementation.  B_CACHE for VMIO buffers is based on the validity of
the backing store.  B_CACHE for non-VMIO buffers is based simply on
whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear,
and vise-versa).  biodone() is now responsible for setting B_CACHE
when a successful read completes.  B_CACHE is also set when a bdwrite()
is initiated and when a bwrite() is initiated.  VFS VOP_BWRITE
routines (there are only two - nfs_bwrite() and bwrite()) are now
expected to set B_CACHE.  This means that bowrite() and bawrite() also
set B_CACHE indirectly.

There are a number of places in the code which were previously using
buf->b_bufsize (which is DEV_BSIZE aligned) when they should have
been using buf->b_bcount.  These have been fixed.  getblk() now clears
B_DONE on return because the rest of the system is so bad about
dealing with B_DONE.

Major fixes to NFS/TCP have been made.  A server-side bug could cause
requests to be lost by the server due to nfs_realign() overwriting
other rpc's in the same TCP mbuf chain.  The server's kernel must be
recompiled to get the benefit of the fixes.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
msmith
9e6a8a63ec Simplify the tunefs example, since tunefs uses getfsfile(). Lots of
people complain about working out what device their filesystems are
mounted on.
1999-04-27 21:11:19 +00:00
mckusick
a33170aeb3 Reorganize locking to avoid holding the lock during calls to bdwrite
and brelse (which may sleep in some systems).

Obtained from:	Matthew Dillon <dillon@apollo.backplane.com>
1999-03-02 06:38:07 +00:00
mckusick
b43aa47dc7 When fsync'ing a file on a filesystem using soft updates, we first try
to write all the dirty blocks. If some of those blocks have dependencies,
they will be remarked dirty when the I/O completes. On systems with
really fast I/O systems, it is possible to get in an infinite loop trying
to flush the buffers, because the I/O finishes before we can get all the
dirty buffers off the v_dirtyblkhd list and into the I/O queue. (The
previous algorithm looped over the v_dirtyblkhd list writing out buffers
until the list emptied.) So, now we mark each buffer that we try to
write so that we can distinguish the ones that are being remarked dirty
from those that we have not yet tried to flush. Once we have tried to
push every buffer once, we then push any associated metadata that is
causing the remaining buffers to be redirtied.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
1999-03-02 04:04:31 +00:00
mckusick
6932da536e Ensure that softdep_sync_metadata can handle bmsafemap and mkdir entries
if they ever arise (which should not happen as softdep_sync_metadata is
currently used).
1999-03-02 00:19:47 +00:00
mckusick
9a94f31915 fix double LIST_REMOVE; other cosmetic changes to match version 9.32.
Obtained from: Jeffrey Hsu <hsu@FreeBSD.ORG>
1999-02-17 20:01:20 +00:00
dillon
4089ccceff Fix warnings in preparation for adding -Wall -Wcast-qual to the
kernel compile
1999-01-28 00:57:57 +00:00
dg
b2ee04d692 Gutted softdep_deallocate_dependencies and replaced it with a panic. It
turns out to not be useful to unwind the dependencies and continue in
the face of a fatal error.
Also changed the log() to a printf() in softdep_error() so that it will
be output in the case of a impending panic.
Submitted by:	Kirk McKusick <mckusick@mckusick.com>
1999-01-22 09:07:32 +00:00
eivind
c0cd830ace Silence warning about unused debug function. (I'll turn this function
into a DDB command in my next staticization sweep).
1999-01-12 11:42:41 +00:00
eivind
e079899d1d Add a warning about the copyright restraints. 1999-01-08 16:03:12 +00:00
bde
620faec4dc Don't pass unused unused timestamp args to UFS_UPDATE() or waste
time initializing them.  This almost finishes centralizing (in-core)
timestamp updates in ufs_itimes().
1999-01-07 16:14:19 +00:00
bde
6b4cb42fc5 UFS_UPDATE() takes a boolean `waitfor' arg, so don't pass it the value
MNT_WAIT when we mean boolean `true' or check for that value not being
passed.  There was no problem in practice because MNT_WAIT had the
magic value of 1.
1999-01-06 18:18:06 +00:00
bde
1f6e98afde Ifdefed the conditionally used variable `prtrealloc'. Declare it
as volatile so that there is no chance that the code that it controls
is optimised away.
1999-01-06 17:04:33 +00:00
bde
988dc25f2e Backed out rev.1.47. It just broke my optimisations for lazy syncing
of timestamps in rev.1.45.  The soft updates bug was elsewhere.

Forgotten by:	luoqi
1999-01-06 16:52:38 +00:00
eivind
c9c2af0d4a Remove the 'waslocked' parameter to vfs_object_create(). 1999-01-05 18:50:03 +00:00
eivind
4205e5856a Remove the last clients of vfs_object_create(..., waslocked=1);
waslocked will go away shortly.

Reviewed by:	dg
1999-01-02 01:32:36 +00:00
julian
4252a29a60 Remove some compiler warnings. 1998-12-10 20:11:47 +00:00
bde
c486b59b75 Don't use the strange null pointer constant `(ufs_daddr_t)0' in a call
to VOP_BMAP().  Don't use uncast NULLs in the same call.
1998-11-29 03:12:06 +00:00
dg
2f1d2c0d26 Restored the "reallocblks" code to its former glory. What this does is
basically do a on-the-fly defragmentation of the FFS filesystem, changing
file block allocations to make them contiguous. Thanks to Kirk McKusick
for providing hints on what needed to be done to get this working.
1998-11-13 01:01:44 +00:00
peter
28edd064e3 Change dirty block list handling to use TAILQ macros. 1998-10-31 15:33:32 +00:00
peter
cb150e060c Use TAILQ macros for clean/dirty block list processing. Set b_xflags
rather than abusing the list next pointer with a magic number.
1998-10-31 15:31:29 +00:00
jkh
f9d5811600 Clarify a rather ambiguous debugging message. 1998-10-28 10:37:54 +00:00
bde
286d44364d Oops, the redundant tests for major numbers weren't redundant here.
They checked for the magic major number for the "device" behind mfs
mount points.  Use a more obvious check for this device.

Debugged by:		Andrew Gallatin <gallatin@cs.duke.edu>
1998-10-27 11:47:08 +00:00
bde
74c9f01b71 Don't follow null bdevsw pointers. The `major(dev) < nblkdev' test rotted
when bdevsw[] became sparse.  We still depend on magic to avoid having to
check that (v_rdev) device numbers in vnodes are not NODEV.

Removed redundant `major(dev) < nblkdev' tests instead of updating them.
1998-10-25 19:02:48 +00:00
phk
b1d651e9b4 Nitpicking and dusting performed on a train. Removes trivial warnings
about unused variables, labels and other lint.
1998-10-25 17:44:59 +00:00
nate
0704380261 Fix 'noatime' bug that was unrelated to use of noatime.
The problem is caused when a directory block is compacted.  When this
occurs, softdep_change_directoryentry_offset() is called to relocate each
directory entry and adjust its matching diradd structure, if any, to match
the new location of the entry.  The bug is that while
softdep_change_directoryentry_offset() correctly adjusts the offsets of
the diradd structures on the pd_diraddhd[] lists (which are not yet ready
to be committed to disk), it fails to adjust the offsets of the diradd
structures on the pd_pendinghd list (which are ready to be committed to
disk).  This causes the dependency structures to be inconsistent with
the buf contents.  Now, if the compaction has moved a directory entry to
the same offset as one of the diradd structures on the pd_pendinghd list
*and* a syscall is done that tries to remove this directory entry before
this directory block has been written to disk (which would empty
pd_pendinghd), a sanity check in newdirrem() will call panic() when it
notices that the inode number in the entry that it is to be removed doesn't
match the inode number in the diradd structure with that offset of that
entry.

Reviewed by:	Kirk McKusick <mckusick@McKusick.COM>
Submitted by:	Don Lewis <Don.Lewis@tsc.tdk.com>
1998-10-03 19:17:11 +00:00
bde
39b4f8d0c2 Fixed clean flag handling:
- don't set the clean flag on unmount of an unclean filesystem that was
  (forcibly) mounted rw.
- set the clean flag on rw -> ro update of a mounted initially-clean
  filesystem.
- fixed some style bugs (mostly long lines).

This uses the fs_flags field and FS_UNCLEAN state bit which were
introduced in the softdep changes.  NetBSD uses extra state bits in
fs_clean.

Reviewed by:	luoqui
1998-09-26 04:59:42 +00:00
luoqi
011a265214 Eliminate a race in VOP_FSYNC() when softupdates is enabled.
Submitted by:	Kirk McKusick	<mckusick@McKusick.COM>
Two minor changes are also included,
1. Remove gratuitious checks for error return from vn_lock with LK_RETRY set,
   vn_lock should always succeed in these cases.
2. Back out change rev. 1.36->1.37, which unnecessarily makes async mount
   a little more unstable. It also keeps us in sync with other BSDs.
Suggested by:	Bruce Evans	<bde@zeta.org.au>
1998-09-24 15:02:46 +00:00
luoqi
c4e80a41e3 Restore pre-v1.44 behavior: always copy modified in-core inode to disk
buffer. Otherwise some in-core inode changes might be lost, including
important meta data (e.g. size) if softupdates is enabled.
1998-09-15 14:45:28 +00:00
sos
cac3b1d56d Remove the SLICE code.
This clearly needs alot more thought, and we dont need this to hunt
us down in 3.0-RELEASE.
1998-09-14 19:56:42 +00:00
bde
59b9085350 Don't dereference an uninitialized pointer in dead code. The dead
code gets executed if it is compiled without optimization.
1998-09-12 14:46:15 +00:00
bde
1fa06f3088 Removed statically configured mount type numbers (MOUNT_*) and all
references to them.

The change a couple of days ago to ignore these numbers in statically
configured vfsconf structs was slightly premature because the cd9660,
cfs, devfs, ext2fs, nfs vfs's still used MOUNT_* instead of the number
in their vfsconf struct.
1998-09-07 13:17:06 +00:00
bde
09e7117248 Put the zombie ffs sysctl node in "notyet" state together with its few
remaining children.  Prepare it for MOUNT_UFS going away.
1998-09-07 11:50:19 +00:00
phk
f912bd71ae Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform
device drivers about sectors no longer in use.

Device-drivers receive the call through d_strategy, if they have
D_CANFREE in d_flags.

This allows flash based devices to erase the sectors and avoid
pointlessly carrying them around in compactions.

Reviewed by:	Kirk Mckusick, bde
Sponsored by:	M-Systems (www.m-sys.com)
1998-09-05 14:13:12 +00:00
bde
abcb39880b Removed unused includes. 1998-08-17 19:09:36 +00:00
julian
e8b1f3cb31 Handle the case of moving a directory onto the top of a sibling's
child of the same name.

Submitted by:	Kirk Mckusick with fixes from luoqi Chen
Obtained from:   Whistle test tree.
1998-08-12 20:46:47 +00:00
bde
fbb44714a9 Fixed printf format errors. 1998-07-11 07:46:16 +00:00
julian
611dea7485 Don't update superblock if mounted readonly,
also fixes some problems with softupdates on root.
More cleanups are needed here..
Submitted by: Luoqi Chen <luoqi@watermarkgroup.com>
1998-07-08 23:52:27 +00:00
julian
5bfbd9138f VOP_STRATEGY grows an (struct vnode *) argument
as the value in b_vp is often not really what you want.
(and needs to be frobbed). more cleanups will follow this.
Reviewed by: Bruce Evans <bde@freebsd.org>
1998-07-04 20:45:42 +00:00
bde
9261cc2cca Sync timestamp changes for inodes of special files to disk as late
as possible (when the inode is reclaimed).  Temporarily only do
this if option UFS_LAZYMOD configured and softupdates aren't enabled.
UFS_LAZYMOD is intentionally left out of /sys/conf/options.

This is mainly to avoid almost useless disk i/o on battery powered
machines.  It's silly to write to disk (on the next sync or when the
inode becomes inactive) just because someone hit a key or something
wrote to the screen or /dev/null.

PR:		5577
Previous version reviewed by:	phk
1998-07-03 22:17:03 +00:00
bde
cdee4d2fbc Centralized in-core inode update. Update the in-core inode directly
in ufs_setattr() so that there is no need to pass timestamps to
UFS_UPDATE() (everything else just needs the current time).  Ignore
the passed-in timestamps in UFS_UPDATE() and always call ufs_itimes()
(was: itimes()) to do the update.  The timestamps are still passed
so that all the callers don't need to be changed yet.
1998-07-03 18:46:52 +00:00
jkh
9123290e10 Flesh this document out just a little in response to some user
questions and also recommend linking over copying since, at this stage,
a stale copy is a real concern.
1998-06-26 10:35:55 +00:00
julian
62b55f0d8d Slight change to directory cleanup
Makes soft updates a bit cleaner. Eliminates some warnings about
'corrupted directories' from fsck.
1998-06-14 19:31:28 +00:00
julian
7618cf95f3 Note which version of Kirk's sources this corresponds to. 1998-06-12 21:21:26 +00:00
julian
83d0c40d08 Fix the case when renaming to a file that you've just created and deleted,
that had an inode that has not yet been written to disk, when the inode of the
new file is also not yet written to disk, and your old directory entry is not
yet on disk but you need to remove it and the new name exists in memory
but has been deleted but the transaction to write the deleted name to disk
exists and has not yet been cancelled by the request to delete the non
existant name.  I don't know how kirk could have missed such a glaring
problem for so long. :-) Especially since the inconsitency survived on
the disk for a whole 4 second on average before being fixed by other code.
This was not a crashing bug but just led to filesystem inconsitencies
if you crashed.

Submitted by: Kirk McKusick (mckusick@mckusick.com)
1998-06-12 20:48:30 +00:00
julian
9ab92044d4 Add B_NOCACHE to several cases where BSD4.4 only required a B_INVAL.
Change worked out by john and kirk in consort.
1998-06-11 17:44:32 +00:00
julian
1e472dc252 Fix for "live inode" panic.
Submitted by: Kirk McKusick <mckusick@McKusick.COM>
Reviewed by: yeah right...
1998-06-10 20:45:46 +00:00
julian
c16287759e Remove buggy debugging code. 1998-06-10 20:03:16 +00:00
julian
10bd7912f6 Back out John's changes 1.45 -> 1.46
Kirk confirms that the original semantic was what he wanted...
(well, a very slight difference)
May fix "dangling deps" panic with soft updates.
1998-06-10 19:27:56 +00:00
dfr
2c20b9f8a3 Use size_t instead of u_int for sizes. 1998-06-04 17:21:39 +00:00
julian
1daff3f91f Add a reference to the original softupdates paper 1998-06-02 01:30:51 +00:00
julian
231adce91d Add a reference to the Ganger/Patt paper 1998-06-02 01:27:27 +00:00
julian
6cb609db70 A fix to a debug test from Kirk. 1998-05-27 03:32:23 +00:00
julian
7ee5b0760a Ensure that there is enough information here, so that people can use
soft updates should they desire.
1998-05-19 23:18:37 +00:00
julian
1d370e5321 Bring up-to-date with Whistle's current version
Includes some debugging code.
1998-05-19 23:07:25 +00:00
julian
1aca5b515a Merge with Kirk's version as of Feb 20
His version 9.23 == our version 1.5 of ffs_softdep.c
His version 9.5 ==  our version 1.4 of softdep.c
1998-05-19 22:54:53 +00:00
julian
775d9bc55b Merge in Kirk's changes to stop softupdates from hogging all of memory. 1998-05-19 21:45:53 +00:00
julian
03438335f7 Change to stop a silly panic. This should be understood better.
Change a buffer swizzle trick to a bcopy. It would be nice if the efficient
trick could be used in the future.
1998-05-19 20:50:41 +00:00
julian
294f018f94 First published FreeBSD version of soft updates Feb 5. 1998-05-19 20:18:42 +00:00
julian
59d05b8498 This commit was generated by cvs2svn to compensate for changes in r36206,
which included commits to RCS files with non-trunk default branches.
1998-05-19 20:03:29 +00:00
julian
6df1279ada Import the next version received from kirk after some
FreeBSD feedback.
1998-05-19 20:03:29 +00:00
julian
e12c18397d This commit was generated by cvs2svn to compensate for changes in r36201,
which included commits to RCS files with non-trunk default branches.
1998-05-19 19:47:22 +00:00
julian
f9429f2998 Import the earliest version of the soft update code that I have. 1998-05-19 19:47:22 +00:00
julian
d21c93a4ec try stop the user from using mount -u to set the async flag on
a filesystem currently using soft updates.
Also needs a new copy of ffs_softdep.c to complete the fix.
1998-05-18 06:38:18 +00:00
julian
b27e38e1d8 Add missing splx()
Submitted by: Luoqi Chen <luoqi@chen.ml.org>
1998-05-11 21:41:13 +00:00
msmith
54f95d7f2a As described by the submitter:
Reverse the VFS_VRELE patch.  Reference counting of vnodes does not need
to be done per-fs.  I noticed this while fixing vfs layering violations.
Doing reference counting in generic code is also the preference cited by
John Heidemann in recent discussions with him.

The implementation of alternative vnode management per-fs is still a valid
requirement for some filesystems but will be revisited sometime later,
most likely using a different framework.

Submitted by:	Michael Hancock <michaelh@cet.co.jp>
1998-05-06 05:29:41 +00:00