Commit Graph

853 Commits

Author SHA1 Message Date
Ian Dowse
2ed42812bd When compacting directories, ufs_direnter() always trusted DIRSIZ()
to supply the number of bytes to be bcopy()'d to move an entry. If
d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible
length, so ufs_direnter could end up corrupting a directory during
compaction. In practice I believe this can only happen after fsck_ffs
has fixed a previously-corrupted directory.

We now deal with any mid-block unused entries specially to avoid
using DIRSIZ() or bcopy() on such entries. We also ensure that the
variables 'dsize' and 'spacefree' contain meaningful values at all
times. Add a few comments to describe better this intricate piece
of code.

The special handling of mid-block unused entries makes the dirhash-
specific bugfix in the previous revision (1.53) now uncecessary,
so this change removes it.

Reviewed by:	mckusick
2001-08-26 01:25:12 +00:00
Ian Dowse
7dfb550e0c When compressing directory blocks, the dirhash code didn't check
that the directory entry was in use before attempting to find it
in the hash structures to change its offset. Normally, unused
entries do not need to be moved, but fsck can leave behind some
unused entries that do. A dirhash sanity panic resulted when the
entry to be moved was not found. Add a check that stops entries
with d_ino == 0 from being passed to ufsdirhash_move().
2001-08-22 01:35:17 +00:00
Peter Wemm
61a4237001 Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS'
without 'options FFS' would fail to link.
2001-08-18 03:08:48 +00:00
Ian Dowse
9e27954de1 Two recent commits in sys/ufs/ufs interacted badly with ext2fs
because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink
is invalid because ext2fs does not maintain this field. In ufs_close(),
i_effnlink is also tested, to determines whether or not to call
vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of
ext2fs filesystems; I believe the other is harmless.

Fix both cases by checking um_i_effnlink_valid in the ufsmount
struct, and use i_nlink if necessary.

Noticed by:	bde
Reviewed by:	mckusick, bde
2001-07-29 22:26:01 +00:00
Ian Dowse
54d6d2dfaf Disable the dirhash sanity check that panics if an unused directory
entry (d_ino == 0) is found in a position that is not the start of
a DIRBLKSIZ block.

While such entries cannot occur normally (ufs always extends the
previous entry to cover the free space instead), they do not cause
problems and fsck does not fix them, so panicking is bad.
2001-07-27 18:45:41 +00:00
Peter Wemm
815d14ddab Use a fixed type for times in on-disk structures for ufs rather than
something that could potentially change like time_t.
2001-07-16 00:55:27 +00:00
Ian Dowse
50c7c3a7c8 Return a locked struct buf from ufsdirhash_lookup() to avoid one
extra getblk/brelse sequence for each lookup. We already had this
buf in ufsdirhash_lookup(), so there was no point in brelse'ing it
only to have the caller immediately reaquire the same buffer.

This should make the case of sequential lookups marginally faster;
in my tests, sequential lookups with dirhash enabled are now only
around 1% slower than without dirhash.
2001-07-13 20:50:38 +00:00
Ian Dowse
9b5ad47fb7 Bring in dirhash, a simple hash-based lookup optimisation for large
directories. When enabled via "options UFS_DIRHASH", in-core hash
arrays are maintained for large directories. These allow all
directory operations to take place quickly instead of requiring
long linear searches. For now anyway, dirhash is not enabled by
default.

The in-core hash arrays have a memory requirement that is approximately
half the size of the size of the on-disk directory file. A number
of new sysctl variables allow control over which directories get
hashed and over the maximum amount of memory that dirhash will use:

  vfs.ufs.dirhash_minsize
    The minimum on-disk directory size for which hashing should be
    used. The default is 2560 (2.5k).

  vfs.ufs.dirhash_maxmem
    The system-wide maximum total memory to be used by dirhash data
    structures. The default is 2097152 (2MB).

The current amount of memory being used by dirhash is visible
through the read-only sysctl variable vfs.ufs.dirhash_maxmem.
Finally, some extra sanity checks that are enabled by default, but
which may have an impact on performance, can be disabled by setting
vfs.ufs.dirhash_docheck to 0.

Discussed on: -fs, -hackers
2001-07-10 21:21:29 +00:00
Matthew Dillon
0cddd8f023 With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
John Baldwin
ed87274d16 Fix more mntvnode and vnode interlock order reversals. 2001-06-28 22:21:33 +00:00
John Baldwin
49d2d9f4a4 - Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.
2001-06-28 04:12:56 +00:00
Peter Wemm
78236790cd Fix warning:
1973: warning: int format, long int arg (arg 5)
2001-06-15 07:44:39 +00:00
Kirk McKusick
eb87cd754f Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.
2001-06-13 23:13:13 +00:00
Thomas Moestl
1fbcf0ac65 Call vn_close on the backing file vnode if ufs_extattr_enable failed to
avoid leaking it.

Reviewed by:	rwatson
2001-06-07 00:11:32 +00:00
Jonathan Lemon
3b6e32b01e Add a wrapper for the fifo kqfilter which falls through to the ufs routine.
This permits the fifo to inherit the ufs VNODE kqfilter.
2001-06-06 17:40:57 +00:00
Jonathan Lemon
4e92ec9dd0 Add a kqueue filter for writing to ufs filesystems which always returns
true.  This permits better interoperability with programs which register
filters on their stdin/stdout handles.

Submitted by: Niels Provos <provos@citi.umich.edu>
2001-06-05 13:52:37 +00:00
David E. O'Brien
1239674238 There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency.  This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by:	Tor.Egge@fast.no
2001-06-05 01:49:37 +00:00
John Baldwin
1c11b01562 Revert the previous commit in favor of the fix in rev 1.42 of
ufs/ffs/ffs_extern.h instead.

Requested by:	bde
2001-05-30 23:09:19 +00:00
John Baldwin
55d132317c Forward declare struct cg to quiet a warning.
Submitted by:	bde
2001-05-30 23:08:40 +00:00
John Baldwin
59718ee556 Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a
warning.
2001-05-29 23:53:16 +00:00
Poul-Henning Kamp
c7a3e2379c Remove last vestiges of MFS. 2001-05-29 21:21:53 +00:00
Poul-Henning Kamp
870b4959b7 Remove MFS from the kernel. 2001-05-29 18:50:30 +00:00
Thomas Moestl
3c436f07b7 Add a check to determine whether extended attributes have been
initialized on the file system before trying to grab the lock of the
per-mount extattr structure, as this lock is unitialized in that case.
This is needed because ufs_extattr_vnode_inactive is called from
ufs_inactive, which is also used by EA-unaware file systems such as
ext2fs.

Reviewed by:	rwatson
2001-05-25 18:24:52 +00:00
Robert Watson
b1fc0ec1a7 o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
  pcred->pc_uidinfo, which was associated with the real uid, only rename
  it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
  corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
  original macro that pointed.
  p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
  p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
  cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
  cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
  we figure out locking and optimizations; generally speaking, this
  means moving to a structure like this:
        newcred = crdup(oldcred);
        ...
        p->p_ucred = newcred;
        crfree(oldcred);
  It's not race-free, but better than nothing.  There are also races
  in sys_process.c, all inter-process authorization, fork, exec, and
  exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
  remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
  use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
  pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
  allocation.
o Clean up ktrcanset() to take into account changes, and move to using
  suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
  calls to better document current behavior.  In a couple of places,
  current behavior is a little questionable and we need to check
  POSIX.1 to make sure it's "right".  More commenting work still
  remains to be done.
o Update credential management calls, such as crfree(), to take into
  account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
      change_euid()
      change_egid()
      change_ruid()
      change_rgid()
      change_svuid()
      change_svgid()
  In each case, the call now acts on a credential not a process, and as
  such no longer requires more complicated process locking/etc.  They
  now assume the caller will do any necessary allocation of an
  exclusive credential reference.  Each is commented to document its
  reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
  and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
  questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
  processes and pcreds.  Note that this authorization, as well as
  CANSIGIO(), needs to be updated to use the p_cansignal() and
  p_cansched() centralized authorization routines, as they currently
  do not take into account some desirable restrictions that are handled
  by the centralized routines, as well as being inconsistent with other
  similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from:	TrustedBSD Project
Reviewed by:	green, bde, jhb, freebsd-arch, freebsd-audit
2001-05-25 16:59:11 +00:00
Matthew Dillon
ac8f990bde This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%.  For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by:	tegge, dillon
2001-05-24 07:22:27 +00:00
Alfred Perlstein
1752ee59ba ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when
calling it from the pager routine
2001-05-23 10:30:25 +00:00
Ruslan Ermilov
99d300a1ec - FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file
systems were repo-copied from sys/miscfs to sys/fs.

- Renamed the following file systems and their modules:
  fdesc -> fdescfs, portal -> portalfs, union -> unionfs.

- Renamed corresponding kernel options:
  FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.

- Install header files for the above file systems.

- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
  Makefiles.
2001-05-23 09:42:29 +00:00
Kirk McKusick
57042c7f72 Update softdep_setup_directory_add prototype to reflect changes in
actual function.

Obtained from:	Jim Bloom <bloom@jbloom.jbloom.org>
2001-05-20 15:59:55 +00:00
Kirk McKusick
dc01275be9 Must ensure that all the entries on the pd_pendinghd list have been
committed to disk before clearing them. More specifically, when
free_newdirblk is called, we know that the inode claims the new
directory block. However, if the associated pagedep is still linked
onto the directory buffer dependency chain, then some of the entries
on the pd_pendinghd list may not be committed to disk yet. In this
case, we will simply note that the inode claims the block and let
the pd_pendinghd list be processed when the pagedep is next written.
If the pagedep is no longer on the buffer dependency chain, then
all the entries on the pd_pending list are committed to disk and
we can free them in free_newdirblk. This corrects a window of
vulnerability introduced in the code added in version 1.95.
2001-05-19 19:24:26 +00:00
Alfred Perlstein
2395531439 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
Kirk McKusick
9f5192ff71 Must be a bit less aggressive about freeing pagedep structures.
Obtained from:	Robert Watson <rwatson@FreeBSD.org> and
		Matthew Jacob <mjacob@feral.com>
2001-05-18 22:16:28 +00:00
Kirk McKusick
24a83a4b3f When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written.  When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.
2001-05-17 07:24:03 +00:00
Ian Dowse
0864ef1e8a Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by:	phk, bp
2001-05-16 18:04:37 +00:00
Kirk McKusick
7389126d9a Further fixes for deadlock in the presence of multiple snapshots.
There are still more to find, but this fix should cover the
common cases that folks are hitting.
2001-05-14 17:16:49 +00:00
Kirk McKusick
0b04113700 If the effective link count is zero when an NFS file handle request
comes in for it, the file is really gone, so return ESTALE.

The problem arises when the last reference to an FFS file is
released because soft-updates may delay the actual freeing of the
inode for some time. Since there are no filesystem links or open
file descriptors referencing the inode, from the point of view of
the system, the file is inaccessible. However, if the filesystem
is NFS exported, then the remote client can still access the inode
via ufs_fhtovp() until the inode really goes away. To prevent this
anomoly, it is necessary to begin returning ESTALE at the same time
that the file ceases to be accessible to the local filesystem.

Obtained from:	Ian Dowse <iedowse@maths.tcd.ie>
2001-05-13 23:30:45 +00:00
Kirk McKusick
9b35c30cf7 Remove yet another deadlock case. 2001-05-11 07:12:03 +00:00
Kirk McKusick
9ccb939ef0 When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.
2001-05-08 07:42:20 +00:00
Kirk McKusick
27b047acf0 Several fixes for units errors:
1) Do not assume that the superblock will be of size fs->fs_bsize.
   This fixes a panic when taking a snapshot on a filesystem with
   a block size bigger than 8K.
2) Properly calculate the number of fragments that follow the
   superblock summary information. This fixes a bug with inconsistent
   snapshots.
3) When cleaning up a snapshot that is about to be removed, properly
   calculate the number of blocks that need to be checked. This fixes
   a bug that created partially allocated inodes.
4) When moving blocks from a snapshot that is about to be removed
   to another snapshot, properly account for the reduced number of
   blocks in the snapshot from which they are taken. This fixes a
   bug in which the number of blocks released from a snapshot did not
   match the number that it claimed to have.
2001-05-08 07:29:03 +00:00
Kirk McKusick
0c6fbff0a5 When syncing out snapshot metadata, we must temporarily allow recursive
buffer locking so as to avoid locking against ourselves if we need to
write filesystem metadata.
2001-05-08 07:13:00 +00:00
Kirk McKusick
23371b2f22 Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce
the amount of time that the filesystem must be suspended. The
current snapshot is elided as well as the earlier snapshots.
2001-05-04 05:49:28 +00:00
Poul-Henning Kamp
3858e5e797 Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes. 2001-05-01 09:12:39 +00:00
Poul-Henning Kamp
3c7a8027cb Remove blatantly pointless call to VOP_BMAP().
Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
2001-05-01 09:12:31 +00:00
Poul-Henning Kamp
a62615e59b Implement vop_std{get|put}pages() and add them to the default vop[].
Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.
2001-05-01 08:34:45 +00:00
Mark Murray
fb919e4d5a Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by:	bde (with reservations)
2001-05-01 08:13:21 +00:00
Poul-Henning Kamp
855aa097af VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".
2001-04-29 12:36:52 +00:00
Poul-Henning Kamp
b7ebffbc08 Add a vop_stdbmap(), and make it part of the default vop vector.
Make 7 filesystems which don't really know about VOP_BMAP rely
on the default vector, rather than more or less complete local
vop_nopbmap() implementations.
2001-04-29 11:48:41 +00:00
Poul-Henning Kamp
f2ddd13ad2 Call ufs_bmaparray() directly instead of indirectly via VOP_BMAP(). 2001-04-29 10:25:30 +00:00
Poul-Henning Kamp
954a0e256e Remove two unused arguments from ufs_bmaparray(). 2001-04-29 10:24:58 +00:00
Poul-Henning Kamp
e955479077 Remove faint traces of blind copy&paste. 2001-04-29 10:23:50 +00:00
Poul-Henning Kamp
0c25dbeb17 Remove faint traces of non-existant ffs_bmap(). 2001-04-29 10:23:32 +00:00
Greg Lehey
60fb0ce365 Revert consequences of changes to mount.h, part 2.
Requested by:	bde
2001-04-29 02:45:39 +00:00
Kirk McKusick
c9509f5865 Rather than copying all the indirect blocks of the snapshot,
simply mark them as BLK_NOCOPY. This trick cuts the initial
size of the snapshot in half and cuts the time to take a
snapshot by a third.
2001-04-26 00:50:53 +00:00
Kirk McKusick
112f737245 When closing the last reference to an unlinked file, it is freed
by the inactive routine. Because the freeing causes the filesystem
to be modified, the close must be held up during periods when the
filesystem is suspended.

For snapshots to be consistent across crashes, they must write
blocks that they copy and claim those written blocks in their
on-disk block pointers before the old blocks that they referenced
can be allowed to be written.

Close a loophole that allowed unwritten blocks to be skipped when
doing ffs_sync with a request to wait for all I/O activity to be
completed.
2001-04-25 08:11:18 +00:00
Poul-Henning Kamp
a13234bb35 Move the netexport structure from the fs-specific mountstructure
to struct mount.

This makes the "struct netexport *" paramter to the vfs_export
and vfs_checkexport interface unneeded.

Consequently that all non-stacking filesystems can use
vfs_stdcheckexp().

At the same time, make it a pointer to a struct netexport
in struct mount, so that we can remove the bogus AF_MAX
and #include <net/radix.h> from <sys/mount.h>
2001-04-25 07:07:52 +00:00
Ian Dowse
5d69bac493 Pre-dirpref versions of fsck may zero out the new superblock fields
fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause
panics if these fields were zeroed while a filesystem was mounted
read-only, and then remounted read-write.

Add code to ffs_reload() which copies the fs_contigdirs pointer
from the previous superblock, and reinitialises fs_avgf* if necessary.

Reviewed by:	mckusick
2001-04-24 00:37:16 +00:00
Greg Lehey
d98dc34f52 Correct #includes to work with fixed sys/mount.h. 2001-04-23 09:05:15 +00:00
Poul-Henning Kamp
f84e29a06c This patch removes the VOP_BWRITE() vector.
VOP_BWRITE() was a hack which made it possible for NFS client
side to use struct buf with non-bio backing.

This patch takes a more general approach and adds a bp->b_op
vector where more methods can be added.

The success of this patch depends on bp->b_op being initialized
all relevant places for some value of "relevant" which is not
easy to determine.  For now the buffers have grown a b_magic
element which will make such issues a tiny bit easier to debug.
2001-04-17 08:56:39 +00:00
Kirk McKusick
5819ab3f12 Add debugging option to always read/write cylinder groups as full
sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'.
Add debugging option to disable background writes of cylinder
groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'.
These debugging options should be tried on systems that are panicing
with corrupted cylinder group maps to see if it makes the problem
go away. The set of panics in question are:

	ffs_clusteralloc: map mismatch
	ffs_nodealloccg: map corrupted
	ffs_nodealloccg: block not in map
	ffs_alloccg: map corrupted
	ffs_alloccg: block not in map
	ffs_alloccgblk: cyl groups corrupted
	ffs_alloccgblk: can't find blk in cyl
	ffs_checkblk: partially free fragment

The following panics are less likely to be related to this problem,
but might be helped by these debugging options:

	ffs_valloc: dup alloc
	ffs_blkfree: freeing free block
	ffs_blkfree: freeing free frag
	ffs_vfree: freeing free inode

If you try these options, please report whether they helped reduce your
bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com>
and to Matt Dillon <dillon@earth.backplane.com>.
2001-04-17 05:37:51 +00:00
Kirk McKusick
f0f3f19f05 Background fsck sysctl operations must use vn_start_write and
vn_finished_write so that they do not attempt to modify a
suspended filesystem.
2001-04-17 05:06:37 +00:00
Robert Watson
b114e127e6 In my first reading of POSIX.1e, I misinterpreted handling of the
ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the
access ACL could be used by privileged processes to change file/directory
ownership.  In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and
ACL_OTHER) should have undefined ae_id fields; this commit attempts
to correct that misunderstanding.

o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid
  associated with the vnode, as those can no longer be extracted from
  the ACL passed as an argument.  Perform all comparisons against
  the passed arguments.  This actually has the effect of simplifying
  a number of components of this call, as well as reducing the indent
  level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP.

o Modify acl_posix1e_check() to return EINVAL if the ae_id field of
  any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value
  other than ACL_UNDEFINED_ID.  As a temporary work-around to allow
  clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before
  each check so that this cannot cause a failure in the short term
  (this work-around will be removed when the userland libraries and
  utilities are updated to take this change into account).

o Modify ufs_sync_acl_from_inode() so that it forces
  ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID
  when synchronizing the ACL from the inode.

o Modify ufs_sync_inode_from_acl to not propagate uid and gid
  information to the inode from the ACL during ACL update.  Also
  modify the masking of permission bits that may be set from
  ALLPERMS to (S_IRWXU|S_IRWXG|S_IRWXO), as ACLs currently do not
  carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT).

o Modify ufs_getacl() so that when it emulates an access ACL from
  the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID.

o Clean up ufs_setacl() substantially since it is no longer possible
  to perform chown/chgrp operations using vop_setacl(), so all the
  access control for that can be eliminated.

o Modify ufs_access() so that it passes owner uid and gid information
  into vaccess_acl_posix1e().

Pointed out by:	jedger
Obtained from:	TrustedBSD Project
2001-04-17 04:33:34 +00:00
Kirk McKusick
74046077a7 Update to describe use of mdconfig instead of deprecated vnconfig.
Submitted by:	Steve Ames <steve@virtual-voodoo.com>
2001-04-14 18:32:09 +00:00
Kirk McKusick
1a6a661032 This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag.
It is described in ufs/ffs/fs.h as follows:

/*
 * Filesystem flags.
 *
 * Note that the FS_NEEDSFSCK flag is set and cleared only by the
 * fsck utility. It is set when background fsck finds an unexpected
 * inconsistency which requires a traditional foreground fsck to be
 * run. Such inconsistencies should only be found after an uncorrectable
 * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when
 * it has successfully cleaned up the filesystem. The kernel uses this
 * flag to enforce that inconsistent filesystems be mounted read-only.
 */
#define FS_UNCLEAN    0x01	/* filesystem not clean at mount */
#define FS_DOSOFTDEP  0x02	/* filesystem using soft dependencies */
#define FS_NEEDSFSCK  0x04	/* filesystem needs sync fsck before mount */
2001-04-14 05:26:28 +00:00
Kirk McKusick
a61ab64ac4 Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>.
His description of the problem and solution follow. My own tests show
speedups on typical filesystem intensive workloads of 5% to 12% which
is very impressive considering the small amount of code change involved.

------

  One day I noticed that some file operations run much faster on
small file systems then on big ones. I've looked at the ffs
algorithms, thought about them, and redesigned the dirpref algorithm.

  First I want to describe the results of my tests. These results are old
and I have improved the algorithm after these tests were done. Nevertheless
they show how big the perfomance speedup may be. I have done two file/directory
intensive tests on a two OpenBSD systems with old and new dirpref algorithm.
The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports".
The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release.
It contains 6596 directories and 13868 files. The test systems are:

1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for
   test is at wd1. Size of test file system is 8 Gb, number of cg=991,
   size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current
   from Dec 2000 with BUFCACHEPERCENT=35

2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system
   at wd0, file system for test is at wd1. Size of test file system is 40 Gb,
   number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k
   OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50

You can get more info about the test systems and methods at:
http://www.ptci.ru/gluk/dirpref/old/dirpref.html

                              Test Results

             tar -xzf ports.tar.gz               rm -rf ports
  mode  old dirpref new dirpref speedup old dirprefnew dirpref speedup
                             First system
 normal     667         472      1.41       477        331       1.44
 async      285         144      1.98       130         14       9.29
 sync       768         616      1.25       477        334       1.43
 softdep    413         252      1.64       241         38       6.34
                             Second system
 normal     329         81       4.06       263.5       93.5     2.81
 async      302         25.7    11.75       112          2.26   49.56
 sync       281         57.0     4.93       263         90.5     2.9
 softdep    341         40.6     8.4        284          4.76   59.66

"old dirpref" and "new dirpref" columns give a test time in seconds.
speedup - speed increasement in times, ie. old dirpref / new dirpref.

------

Algorithm description

The old dirpref algorithm is described in comments:

/*
 * Find a cylinder to place a directory.
 *
 * The policy implemented by this algorithm is to select from
 * among those cylinder groups with above the average number of
 * free inodes, the one with the smallest number of directories.
 */

A new directory is allocated in a different cylinder groups than its
parent directory resulting in a directory tree that is spreaded across
all the cylinder groups. This spreading out results in a non-optimal
access to the directories and files. When we have a small filesystem
it is not a problem but when the filesystem is big then perfomance
degradation becomes very apparent.

What I mean by a big file system ?

  1. A big filesystem is a filesystem which occupy 20-30 or more percent
     of total drive space, i.e. first and last cylinder are physically
     located relatively far from each other.
  2. It has a relatively large number of cylinder groups, for example
     more cylinder groups than 50% of the buffers in the buffer cache.

The first results in long access times, while the second results in
many buffers being used by metadata operations. Such operations use
cylinder group blocks and on-disk inode blocks. The cylinder group
block (fs->fs_cblkno) contains struct cg, inode and block bit maps.
It is 2k in size for the default filesystem parameters. If new and
parent directories are located in different cylinder groups then the
system performs more input/output operations and uses more buffers.
On filesystems with many cylinder groups, lots of cache buffers are
used for metadata operations.

My solution for this problem is very simple. I allocate many directories
in one cylinder group. I also do some things, so that the new allocation
method does not cause excessive fragmentation and all directory inodes
will not be located at a location far from its file's inodes and data.
The algorithm is:
/*
 * Find a cylinder group to place a directory.
 *
 * The policy implemented by this algorithm is to allocate a
 * directory inode in the same cylinder group as its parent
 * directory, but also to reserve space for its files inodes
 * and data. Restrict the number of directories which may be
 * allocated one after another in the same cylinder group
 * without intervening allocation of files.
 *
 * If we allocate a first level directory then force allocation
 * in another cylinder group.
 */

  My early versions of dirpref give me a good results for a wide range of
file operations and different filesystem capacities except one case:
those applications that create their entire directory structure first
and only later fill this structure with files.

  My solution for such and similar cases is to limit a number of
directories which may be created one after another in the same cylinder
group without intervening file creations. For this purpose, I allocate
an array of counters at mount time. This array is linked to the superblock
fs->fs_contigdirs[cg]. Each time a directory is created the counter
increases and each time a file is created the counter decreases. A 60Gb
filesystem with 8mb/cg requires 10kb of memory for the counters array.

  The maxcontigdirs is a maximum number of directories which may be created
without an intervening file creation. I found in my tests that the best
performance occurs when I restrict the number of directories in one cylinder
group such that all its files may be located in the same cylinder group.
There may be some deterioration in performance if all the file inodes
are in the same cylinder group as its containing directory, but their
data partially resides in a different cylinder group. The maxcontigdirs
value is calculated to try to prevent this condition. Since there is
no way to know how many files and directories will be allocated later
I added two optimization parameters in superblock/tunefs. They are:

        int32_t  fs_avgfilesize;   /* expected average file size */
        int32_t  fs_avgfpdir;      /* expected # of files per directory */

These parameters have reasonable defaults but may be tweeked for special
uses of a filesystem. They are only necessary in rare cases like better
tuning a filesystem being used to store a squid cache.

I have been using this algorithm for about 3 months. I have done
a lot of testing on filesystems with different capacities, average
filesize, average number of files per directory, and so on. I think
this algorithm has no negative impact on filesystem perfomance. It
works better than the default one in all cases. The new dirpref
will greatly improve untarring/removing/coping of big directories,
decrease load on cvs servers and much more. The new dirpref doesn't
speedup a compilation process, but also doesn't slow it down.

Obtained from:	Grigoriy Orlov <gluk@ptci.ru>
2001-04-10 08:38:59 +00:00
Robert Watson
f6958f21cd o Indent sub-section headings to be consistent with README.extattr.
Obtained from:	TrustedBSD Project
2001-04-03 18:05:03 +00:00
Robert Watson
45b4a0b163 o Introduce a README file describing briefly how to use access control
lists, in the style of FFS README files for soft updates and snapshots.

Obtained from:        TrustedBSD Project
2001-04-03 17:58:25 +00:00
Robert Watson
8f0644a765 o Introduce a README file describing briefly how to use extended
attributes, in the style of FFS README files for soft updates and
  snapshots.

Obtained from:	TrustedBSD Project
2001-04-03 17:31:36 +00:00
Robert Watson
bfe2c6ad72 o Change the default from using IO_SYNC on EA set and delete operations
to not using IO_SYNC.  Expose a sysctl (debug.ufs_extattr_sync) for
  enabling the use of IO_SYNC.

    - Use of IO_SYNC substantially degrades ACL performance when a
      default ACL is set on a directory, as there are four synchronous
      writes initiated to define both supporting EAs for new
      sub-directories, and to set the data; two for new files.  Later, this
      may be optimized to two writes for sub-directories, one for new
      files.

    - IO_SYNC does not substantially improve consistency properties due
      to the poor consistency properties of existing permissions (which
      ACLs are a superset of), due to interaction with soft updates,
      and due to differences in handling consistency for data and file
      system meta-data.

    - In macro-benchmarks, this reduces the overhead of setting default
      ACLs down to the same overhead as enabling ACLs on a file system
      and not using them.  Enabling ACLs still introduces a small
      overhead (I measure 7% on a -j 2 buildworld with pre-allocated
      EA backing store, but this is not rigorous testing, nor in any way
      optimized).

    - The sysctl will probably change to another administration method
      (or at least, a better name) in the near future, but consistency
      properties of EAs are still being worked out.  The toggle is defined
      right now to allow easier performance analysis and exploration
      of possible guarantees.

Obtained from:	TrustedBSD Project
2001-04-03 04:09:53 +00:00
Robert Watson
aad65d6f79 o Correct an ACL implementation bug that could result in a system panic
under heavy use when default ACLs were bgin inherited by new files
  or directories.  This is done by removing a bug in default ACL
  reading, and improving error handling for this failure case:

    - Move the setting of the buffer length (len) variable to above the
      ACL type (ap->a_type) switch rather than having it only for
      ACL_TYPE_ACCESS.  Otherwise, the len variable is unitialized in
      the ACL_TYPE_DEFAULT case, which generally worked right, but could
      result in failure.

    - Add a check for a short/long read of the ACL_TYPE_DEFAULT type from
      the underlying EA, resulting in EPERM rather than passing a
      potentially corrupted ACL back to the caller (resulting "cleaner"
      failures if the EA is damaged: right now, the caller will almost
      always panic in the presence of a corrupted EA).  This code is similar
      to code in the ACL_TYPE_ACCESS handling in the previous switch case.

    - While I'm fixing this code, remove a redundant bzero() of the ACL
      reader buffer; it need only be initialized above the acl_type
      switch.

Obtained from:	TrustedBSD Project
2001-04-02 01:02:32 +00:00
Robert Watson
a70f27470f Introduce support for POSIX.1e ACLs on UFS-based file systems. This
implementation is still experimental, and while fairly broadly tested,
is not yet intended for production use.  Support for POSIX.1e ACLs on
UFS will not be MFC'd to RELENG_4.

This implementation works by providing implementations of VOP_[GS]ETACL()
for FFS, as well as modifying the appropriate access control and file
creation routines.  In this implementation, ACLs are backed into extended
attributes; the base ACL (owner, group, other) permissions remain in the
inode for performance and compatibility reasons, so only the extended and
default ACLs are placed in extended attributes.  The logic for ACL
evaluation is provided by the fs-independent kern/kern_acl.c.

o Introduce UFS_ACL, a compile-time configuration option that enables
  support for ACLs on FFS (and potentially other UFS-based file systems).
o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which
  respectively get, set, and check the ACLs on the passed vnode.
o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to
  maintain access control information between inode permissions and
  extended attribute data.
o Modify ufs_access() to load a file access ACL and invoke
  vaccess_acl_posix1e() if ACLs are available on the file system
o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly
  created directories and files, inheriting from the parent directory's
  default ACL.
o Enable these new vnode operations and conditionally compiled code
  paths if UFS_ACL is defined.

A few notes:

o This implementation is fairly widely tested, but still should be
  considered experimental.
o Currently, ACLs are not exported via NFS, instead, the summarizing
  file mode/etc from the inode is.  This results in conservative
  protection behavior, similar to the behavior of ACL-nonaware programs
  acting locally.
o It is possible that underlying binary data formats associated with
  this implementation may change.  Consumers of the implementation
  should expect to find their local configuration obsoleted in the
  next few months, resulting in possible loss of ACL data during an
  upgrade.
o The extended attributes interface and implementation is still
  undergoing modification to address portable interface concerns, as
  well as performance.
o Many applications do not yet correctly handle ACLs.  In general,
  due to the POSIX.1e ACL model, behavior of ACL-unaware applications
  will be conservative with respects to file protection; some caution
  is recommended.
o Instructions for configuring and maintaining ACLs on UFS will be
  committed in the near future; in the mean time it is possible to
  reference the README included in the last UFS ACL distribution
  placed in the TrustedBSD web site:

      http://www.TrustedBSD.org/downloads/

Substantial debugging, hardware, travel, or connectivity support for this
project was provided by: BSDi, Safeport Network Services, and NAI Labs.
Significant coding contributions were made by Chris Faulhaber.  Additional
support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin.

Reviewed by:	jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs
Obtained from:	TrustedBSD Project
2001-03-26 17:53:19 +00:00
Poul-Henning Kamp
f83880518b Send the remains (such as I have located) of "block major numbers" to
the bit-bucket.
2001-03-26 12:41:29 +00:00
Jeroen Ruigrok van der Werven
5d0b660f2a Fix typo ); -> , 2001-03-24 15:25:04 +00:00
Kirk McKusick
fca26df055 Check that background fsck operation is being done on a ufs filesystem.
Obtained from:	Robert Watson <rwatson@FreeBSD.org>
2001-03-23 20:58:25 +00:00
Robert Watson
7b4dc13de4 o Remove an unnecessary debugging printf from ufs_extattr_lookup(),
which resulted in the output of warning messages at boot if
  UFS_EXTATTR_AUTOSTART was enabled but ".attribute" and possible
  sub-directories weren't in a mounted MFS or UFS file systems.

Pointed out by:	dcs
Obtained from:	TrustedBSD Project
2001-03-21 23:00:39 +00:00
Kirk McKusick
812b1d416c Add kernel support for running fsck on active filesystems. 2001-03-21 04:09:01 +00:00
Kirk McKusick
31c6ce0aed Clear the fs_clean flag only when the FS_UNCLEAN flag is not set
(as is done in unmount).

Remove a snapshot inode from the superblock list when its last
name goes away rather than when its last reference goes away.
That way it will be properly reclaimed by fsck after a crash
rather than reenabled when the filesystem is mounted.
2001-03-21 04:05:20 +00:00
Kirk McKusick
7e72e9918a Report the correct inode number when panicing with freeing free inode.
Report the correct block number when panicing with freeing free block.
2001-03-21 04:01:02 +00:00
Robert Watson
d4562167ac o Enable UFS-based extended attribute support on MFS. Note that this change
is under-tested, and that MFS appears to be in the process of being
  deprecated in favor of FFS over md.  Note also that UFS_EXTATTR_AUTOSTART
  doesn't make much sense on MFS unless the MFSROOT is compiled in, so
  manual configuration is generally required.

Obtained from:	TrustedBSD Project
2001-03-19 06:44:18 +00:00
Robert Watson
3063207147 o Rename "namespace" argument to "attrnamespace" as namespace is a C++
reserved word.

Submitted by:	jkh
Obtained from:	TrustedBSD Project
2001-03-19 05:44:15 +00:00
Robert Watson
516081f288 o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to
options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively.  This change
  reflects the fact that our EA support is implemented entirely at the
  UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount
  events).  This also better reflects the fact that [shortly] MFS will also
  support EAs, as well as possibly IFS.

o Consumers of the EA support in FFS are reminded that as a result, they
  must change kernel config files to reflect the new option names.

Obtained from:	TrustedBSD Project
2001-03-19 04:35:40 +00:00
Robert Watson
3938ca1cc4 o Caused FFS_EXTATTR_AUTOSTART to scan two sub-directories of ".attribute"
off of the file system root: "user" for user attributes, and "system"
  for system attributes.  When the scan occurs, attribute backing files
  discovered in those directories will be started in the respective
  namespaces.  This re-introduces support for auto-starting of user
  attributes, which was removed when the "$" prefix for system attributes
  was replaced with explicit namespacing.

  For users of the TrustedBSD UFS POSIX.1e ACL code, you'll need to:
    mv ${FSROOT}/'$posix1e.acl_access' ${FSROOT}/system/posix1e.acl_access
    mv ${FSROOT}/'$posix1e.acl_default' ${FSROOT}/system/posix1e.acl_default

  For users of the TrustedBSD POSIX.1e Capability code, you'll need to:
    mv ${FSROOT}/'$posix1e.cap' ${FSROOT}/system/posix1e.cap

  For users of the TrustedBSD MAC code, you'll need to:
    mv ${FSROOT}/'$freebsd.mac' ${FSROOT}/system/freebsd.mac

  Updated versions of relevant patches will be released in the near
  future.

Obtained from:	TrustedBSD Project
2001-03-18 04:04:23 +00:00
Robert Watson
70f3685105 o Change the API and ABI of the Extended Attribute kernel interfaces to
introduce a new argument, "namespace", rather than relying on a first-
  character namespace indicator.  This is in line with more recent
  thinking on EA interfaces on various mailing lists, including the
  posix1e, Linux acl-devel, and trustedbsd-discuss forums.  Two namespaces
  are defined by default, EXTATTR_NAMESPACE_SYSTEM and
  EXTATTR_NAMESPACE_USER, where the primary distinction lies in the
  access control model: user EAs are accessible based on the normal
  MAC and DAC file/directory protections, and system attributes are
  limited to kernel-originated or appropriately privileged userland
  requests.

o These API changes occur at several levels: the namespace argument is
  introduced in the extattr_{get,set}_file() system call interfaces,
  at the vnode operation level in the vop_{get,set}extattr() interfaces,
  and in the UFS extended attribute implementation.  Changes are also
  introduced in the VFS extattrctl() interface (system call, VFS,
  and UFS implementation), where the arguments are modified to include
  a namespace field, as well as modified to advoid direct access to
  userspace variables from below the VFS layer (in the style of recent
  changes to mount by adrian@FreeBSD.org).  This required some cleanup
  and bug fixing regarding VFS locks and the VFS interface, as a vnode
  pointer may now be optionally submitted to the VFS_EXTATTRCTL()
  call.  Updated documentation for the VFS interface will be committed
  shortly.

o In the near future, the auto-starting feature will be updated to
  search two sub-directories to the ".attribute" directory in appropriate
  file systems: "user" and "system" to locate attributes intended for
  those namespaces, as the single filename is no longer sufficient
  to indicate what namespace the attribute is intended for.  Until this
  is committed, all attributes auto-started by UFS will be placed in
  the EXTATTR_NAMESPACE_SYSTEM namespace.

o The default POSIX.1e attribute names for ACLs and Capabilities have
  been updated to no longer include the '$' in their filename.  As such,
  if you're using these features, you'll need to rename the attribute
  backing files to the same names without '$' symbols in front.

o Note that these changes will require changes in userland, which will
  be committed shortly.  These include modifications to the extended
  attribute utilities, as well as to libutil for new namespace
  string conversion routines.  Once the matching userland changes are
  committed, a buildworld is recommended to update all the necessary
  include files and verify that the kernel and userland environments
  are in sync.  Note: If you do not use extended attributes (most people
  won't), upgrading is not imperative although since the system call
  API has changed, the new userland extended attribute code will no longer
  compile with old include files.

o Couple of minor cleanups while I'm there: make more code compilation
  conditional on FFS_EXTATTR, which should recover a bit of space on
  kernels running without EA's, as well as update copyright dates.

Obtained from:	TrustedBSD Project
2001-03-15 02:54:29 +00:00
Robert Watson
7e9044a636 o In my merge, missed the one-line patch to ufs_vnops.c that removed
the static prototype for ufs_readdir().  Note that ufs_readdir() was
  actually already non-static, the prototype was incorrect.

Submitted by:	jedgar
2001-03-14 18:27:04 +00:00
Robert Watson
f5161237ad o Implement "options FFS_EXTATTR_AUTOSTART", which depends on
"options FFS_EXTATTR".  When extended attribute auto-starting
  is enabled, FFS will scan the .attribute directory off of the
  root of each file system, as it is mounted.  If .attribute
  exists, EA support will be started for the file system.  If
  there are files in the directory, FFS will attempt to start
  them as attribute backing files for attributes baring the same
  name.  All attributes are started before access to the file
  system is permitted, so this permits race-free enabling of
  attributes.  For attributes backing support for security
  features, such as ACLs, MAC, Capabilities, this is vital, as
  it prevents the file system attributes from getting out of
  sync as a result of file system operations between mount-time
  and the enabling of the extended attribute.  The userland
  extattrctl tool will still function exactly as previously.
  Files must be placed directly in .attribute, which must be
  directly off of the file system root: symbolic links are
  not permitted.  FFS_EXTATTR will continue to be able
  to function without FFS_EXTATTR_AUTOSTART for sites that do not
  want/require auto-starting.  If you're using the UFS_ACL code
  available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART
  is recommended.

o This support is implemented by adding an invocation of
  ufs_extattr_autostart() to ffs_mountfs().  In addition,
  several new supporting calls are introduced in
  ufs_extattr.c:

    ufs_extattr_autostart(): start EAs on the specified mount
    ufs_extattr_lookup(): given a directory and filename,
                          return the vnode for the file.
    ufs_extattr_enable_with_open(): invoke ufs_extattr_enable()
                          after doing the equililent of vn_open()
                          on the passed file.
    ufs_extattr_iterate_directory(): iterate over a directory,
                          invoking ufs_extattr_lookup() and
                          ufs_extattr_enable_with_open() on each
                          entry.

o This feature is not widely tested, and therefore may contain
  bugs, caution is advised.  Several changes are in the pipeline
  for this feature, including breaking out of EA namespaces into
  subdirectories of .attribute (this is waiting on the updated
  EA API), as well as a per-filesystem flag indicating whether
  or not EAs should be auto-started.  This is required because
  administrators may not want .attribute auto-started on all
  file systems, especially if non-administrators have write access
  to the root of a file system.

Obtained from:	TrustedBSD Project
2001-03-14 05:32:31 +00:00
Kirk McKusick
589c7af992 Fixes to track snapshot copy-on-write checking in the specinfo
structure rather than assuming that the device vnode would reside
in the FFS filesystem (which is obviously a broken assumption with
the device filesystem).
2001-03-07 07:09:55 +00:00
John Baldwin
19eb87d22a Grab the process lock while calling psignal and before calling psignal. 2001-03-07 03:37:06 +00:00
John Baldwin
412503ae46 Protect SIGDELSET of p_siglist with the proc lock. 2001-03-07 03:34:55 +00:00
Kirk McKusick
8775e64a5d Free lock before returning from process_worklist_item.
Obtained from:	Constantine Sapuntzakis <csapuntz@stanford.edu>
2001-03-01 21:43:46 +00:00
Adrian Chadd
f3a90da995 Reviewed by: jlemon
An initial tidyup of the mount() syscall and VFS mount code.

This code replaces the earlier work done by jlemon in an attempt to
make linux_mount() work.

* the guts of the mount work has been moved into vfs_mount().

* move `type', `path' and `flags' from being userland variables into being
  kernel variables in vfs_mount(). `data' remains a pointer into
  userspace.

* Attempt to verify the `type' and `path' strings passed to vfs_mount()
  aren't too long.

* rework mount() and linux_mount() to take the userland parameters
  (besides data, as mentioned) and pass kernel variables to vfs_mount().
  (linux_mount() already did this, I've just tidied it up a little more.)

* remove the copyin*() stuff for `path'. `data' still requires copyin*()
  since its a pointer into userland.

* set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each
  filesystem.  This variable is generally initialised with `path', and
  each filesystem can override it if they want to.

* NOTE: f_mntonname is intiailised with "/" in the case of a root mount.
2001-03-01 21:00:17 +00:00
Jonathan Lemon
7df2842dee Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean().
Use this to tell a filter attached to a vnode that the underlying vnode is
no longer valid, by returning EV_EOF.

PR: kern/25309, kern/25206
2001-02-23 20:06:01 +00:00
Jonathan Lemon
31e7122397 Use correct list pointer when detaching knote from list. 2001-02-23 19:20:21 +00:00
Kirk McKusick
a5a94e3936 Free lock before calling panic so that subsequent attempt to write out
buffers does not re-panic with `locking against myself'. This change
should not affect normal operations of soft updates in any way.
2001-02-23 09:01:31 +00:00
Kirk McKusick
cc686e21c0 When cleaning up excess inode dependencies, check for being done.
Reviewed by:	Jan Koum <jkb@yahoo-inc.com>
2001-02-22 10:17:57 +00:00
Kirk McKusick
2cf5d587a9 This patch corrects two problems with the rate limiting code
that was introduced in revision 1.80. The problem manifested
itself with a `locking against myself' panic and could also
result in soft updates inconsistences associated with inodedeps.
The two problems are:

1) One of the background operations could manipulate the bitmap
while holding it locked with intent to create. This held lock
results in a `locking against myself' panic, when the background
processing that we have been coopted to do tries to lock the bitmap
which we are already holding locked. To understand how to fix this
problem, first, observe that we can do the background cleanups in
inodedep_lookup only when allocating inodedeps (DEPALLOC is set in
the call to inodedep_lookup). Second observe that calls to
inodedep_lookup with DEPALLOC set can only happen from the following
calls into the softdep code:

        softdep_setup_inomapdep
        softdep_setup_allocdirect
        softdep_setup_remove
        softdep_setup_freeblocks
        softdep_setup_directory_change
        softdep_setup_directory_add
        softdep_change_linkcnt

Only the first two of these can come from ffs_alloc.c while holding
a bitmap locked. Thus, inodedep_lookup must not go off to do
request_cleanups when being called from these functions. This change
adds a flag, NODELAY, that can be passed to inodedep_lookup to let
it know that it should not do background processing in those cases.

2) The return value from request_cleanup when helping out with the
cleanup was 0 instead of 1. This meant that despite the fact that
we may have slept while doing the cleanups, the code did not recheck
for the appearance of an inodedep (e.g., goto top in inodedep_lookup).
This lead to the softdep inconsistency in which we ended up with
two inodedep's for the same inode.

Reviewed by:	Peter Wemm <peter@yahoo-inc.com>,
		Matt Dillon <dillon@earth.backplane.com>
2001-02-20 11:14:38 +00:00
Jeroen Ruigrok van der Werven
d7d97eb0aa Preceed/preceeding are not english words. Use precede and preceding. 2001-02-18 10:43:53 +00:00
Jonathan Lemon
608a3ce62a Extend kqueue down to the device layer.
Backwards compatible approach suggested by: peter
2001-02-15 16:34:11 +00:00
Jake Burkholder
d5a08a6065 Implement a unified run queue and adjust priority levels accordingly.
- All processes go into the same array of queues, with different
  scheduling classes using different portions of the array.  This
  allows user processes to have their priorities propogated up into
  interrupt thread range if need be.
- I chose 64 run queues as an arbitrary number that is greater than
  32.  We used to have 4 separate arrays of 32 queues each, so this
  may not be optimal.  The new run queue code was written with this
  in mind; changing the number of run queues only requires changing
  constants in runq.h and adjusting the priority levels.
- The new run queue code takes the run queue as a parameter.  This
  is intended to be used to create per-cpu run queues.  Implement
  wrappers for compatibility with the old interface which pass in
  the global run queue structure.
- Group the priority level, user priority, native priority (before
  propogation) and the scheduling class into a struct priority.
- Change any hard coded priority levels that I found to use
  symbolic constants (TTIPRI and TTOPRI).
- Remove the curpriority global variable and use that of curproc.
  This was used to detect when a process' priority had lowered and
  it should yield.  We now effectively yield on every interrupt.
- Activate propogate_priority().  It should now have the desired
  effect without needing to also propogate the scheduling class.
- Temporarily comment out the call to vm_page_zero_idle() in the
  idle loop.  It interfered with propogate_priority() because
  the idle process needed to do a non-blocking acquire of Giant
  and then other processes would try to propogate their priority
  onto it.  The idle process should not do anything except idle.
  vm_page_zero_idle() will return in the form of an idle priority
  kernel thread which is woken up at apprioriate times by the vm
  system.
- Update struct kinfo_proc to the new priority interface.  Deliberately
  change its size by adjusting the spare fields.  It remained the same
  size, but the layout has changed, so userland processes that use it
  would parse the data incorrectly.  The size constraint should really
  be changed to an arbitrary version number.  Also add a debug.sizeof
  sysctl node for struct kinfo_proc.
2001-02-12 00:20:08 +00:00
Bosko Milekic
9ed346bab0 Change and clean the mutex lock interface.
mtx_enter(lock, type) becomes:

mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks)
mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized)

similarily, for releasing a lock, we now have:

mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN.
We change the caller interface for the two different types of locks
because the semantics are entirely different for each case, and this
makes it explicitly clear and, at the same time, it rids us of the
extra `type' argument.

The enter->lock and exit->unlock change has been made with the idea
that we're "locking data" and not "entering locked code" in mind.

Further, remove all additional "flags" previously passed to the
lock acquire/release routines with the exception of two:

MTX_QUIET and MTX_NOSWITCH

The functionality of these flags is preserved and they can be passed
to the lock/unlock routines by calling the corresponding wrappers:

mtx_{lock, unlock}_flags(lock, flag(s)) and
mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN
locks, respectively.

Re-inline some lock acq/rel code; in the sleep lock case, we only
inline the _obtain_lock()s in order to ensure that the inlined code
fits into a cache line. In the spin lock case, we inline recursion and
actually only perform a function call if we need to spin. This change
has been made with the idea that we generally tend to avoid spin locks
and that also the spin locks that we do have and are heavily used
(i.e. sched_lock) do recurse, and therefore in an effort to reduce
function call overhead for some architectures (such as alpha), we
inline recursion for this case.

Create a new malloc type for the witness code and retire from using
the M_DEV type. The new type is called M_WITNESS and is only declared
if WITNESS is enabled.

Begin cleaning up some machdep/mutex.h code - specifically updated the
"optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN
and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently
need those.

Finally, caught up to the interface changes in all sys code.

Contributors: jake, jhb, jasone (in no particular order)
2001-02-09 06:11:45 +00:00
Poul-Henning Kamp
37d4006626 Another round of the <sys/queue.h> FOREACH transmogriffer.
Created with:   sed(1)
Reviewed by:    md5(1)
2001-02-04 16:08:18 +00:00
Poul-Henning Kamp
fc2ffbe604 Mechanical change to use <sys/queue.h> macro API instead of
fondling implementation details.

Created with: sed(1)
Reviewed by: md5(1)
2001-02-04 13:13:25 +00:00
Poul-Henning Kamp
ef9e85abba Use <sys/queue.h> macro API. 2001-02-04 12:37:48 +00:00