Commit Graph

908 Commits

Author SHA1 Message Date
Robert Watson
74237f55b0 Part I: Update extended attribute API and ABI:
o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
  as not to use the scatter gather API (which appeared not to be used
  by any consumers, and be less portable), rather, accepts 'data'
  and 'nbytes' in the style of other simple read/write interfaces.
  This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
  a size_t.  When performing a read, the number of bytes read will
  be returned, unless the data pointer is NULL, in which case the
  number of bytes of data are returned.  This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
  argument so as to return the size, if desirable.  If set to NULL,
  the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable.  More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-02-10 04:43:22 +00:00
Poul-Henning Kamp
b6e1c37356 Remove di_inumber since LFS is long gone. 2002-02-10 00:55:49 +00:00
Kirk McKusick
b06051cf7c Occationally background fsck would cause a spurious ``freeing free
inode'' panic. This change corrects that problem by setting the
fs_active flag when the inode map changes to notify the snapshot
code that the cylinder group must be rescanned.

Submitted by:	Robert Watson <rwatson@FreeBSD.org>
2002-02-07 22:13:56 +00:00
Kirk McKusick
cfdaa88697 Occationally deleted files would hang around for hours or days
without being reclaimed. This bug was introduced in revision 1.95
dealing with filenames placed in newly allocated directory blocks,
thus is not present in 4.X systems. The bug is triggered when a
new entry is made in a directory after the data block containing
the original new entry has been written, but before the inode
that references the data block has been written.

Submitted by:	Bill Fenner <fenner@research.att.com>
2002-02-07 00:54:32 +00:00
Kirk McKusick
c9f96392c7 When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.
2002-02-02 01:42:44 +00:00
Kirk McKusick
7b60855308 Add a stub for softdep_request_cleanup() so that compilation without
SOFTUPDATES option works properly.

Submitted by:	Benno Rice <benno@jeamland.net>
2002-01-23 02:18:56 +00:00
Kirk McKusick
03a2057a5b This patch fixes a long standing complaint with soft updates in
which small and/or nearly full filesystems would fail with `file
system full' messages when trying to replace a number of existing
files (for example during a system installation). When the allocation
routines are about to fail with a file system full condition, they
make a call to softdep_request_cleanup() which attempts to accelerate
the flushing of pending deletion requests in an effort to free up
space. In the face of filesystem I/O requests that exceed the
available disk transfer capacity, the cleanup request could take
an unbounded amount of time. Thus, the softdep_request_cleanup()
routine will only try for tickdelay seconds (default 2 seconds)
before giving up and returning a filesystem full error. Under typical
conditions, the softdep_request_cleanup() routine is able to free
up space in under fifty milliseconds.
2002-01-22 06:17:22 +00:00
Kirk McKusick
99bef8782b Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26
which caused incomplete snapshots to be taken. When background
fsck would run on these snapshots, the result would be files
being incorrectly released which would subsequently panic the
kernel with ``handle_workitem_freefile: inodedep survived'',
``handle_written_inodeblock: live inodedep'', and
``handle_workitem_remove: lost inodedep'' errors.
2002-01-17 08:33:32 +00:00
Kirk McKusick
8af31e7b46 Put write on read-only filesystem panic after we have weeded out
block and character devices, fifo's, etc.

Submitted by:	Bruce Evans <bde@zeta.org.au>
2002-01-16 04:59:09 +00:00
Kirk McKusick
cd6005961f When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck.  With this change, such files are found at the time of the
down-grade.  Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by:	"Sam Leffler" <sam@errno.com>
2002-01-15 07:17:12 +00:00
Alfred Perlstein
426da3bcfb SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
   protects all the fields.
   protects "struct file" initialization, while a struct file
     is being changed from &badfileops -> &pipeops or something
     the filedesc should be locked.

1 mutex in each struct file
   protects the refcount fields.
   doesn't protect anything else.
   the flags used for garbage collection have been moved to
     f_gcflag which was the FILLER short, this doesn't need
     locking because the garbage collection is a single threaded
     container.
  could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file *	fhold(struct file *fp);
        /* increments reference count on a file */

struct file *	fhold_locked(struct file *fp);
        /* like fhold but expects file to locked */

struct file *	ffind_hold(struct thread *, int fd);
        /* finds the struct file in thread, adds one reference and
                returns it unlocked */

struct file *	ffind_lock(struct thread *, int fd);
        /* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.
2002-01-13 11:58:06 +00:00
Kirk McKusick
0bc7a833ec When going to sleep, we must save our SPL so that it does not get
lost if some other process uses the lock while we are sleeping. We
restore it after we have slept. This functionality is provided by
a new routine interlocked_sleep() that wraps the interlocking with
functions that sleep. This function is then used in place of the
old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros.

Submitted by:	Debbie Chu <dchu@juniper.net>
2002-01-12 20:57:36 +00:00
Kirk McKusick
794ef3471f Must call drain_output() before checking the dirty block list
in softdep_sync_metadata(). Otherwise we may miss dependencies
that need to be flushed which will result in a later panic
with the message ``vinvalbuf: dirty bufs''.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
MFC after:	1 week
2002-01-11 19:59:27 +00:00
Poul-Henning Kamp
9c643340bb Do not pull quota entries of the cache-list if they have already
been removed from the cache-list as part of a previous unmount.

This would result in panics (page fault in dqflush()) during subsequent
umounts provided that enough distinct UID's to actually make the
hash do something are active.

This can probably explain a number of weird quota related behaviours.

PR:		32331 maybe more.
Reproduced by:	Søren Schrørder <sch@cybercity.dk>
2002-01-10 15:02:57 +00:00
Mike Smith
b9a4338d29 Initialise the bioops vector hack at runtime rather than at link time. This
avoids the use of common variables.

Reviewed by:	mckusick
2002-01-08 19:32:18 +00:00
Matthew Dillon
23b590188f Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code.  Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release
2001-12-20 22:42:27 +00:00
Kirk McKusick
f305c5d199 Change the atomic_set_char to atomic_set_int and atomic_clear_char
to atomic_clear_int to ease the implementation for the sparc64.

Requested by:	Jake Burkholder <jake@locore.ca>
2001-12-18 18:05:17 +00:00
Ian Dowse
143a5346c9 Make sure we ignore the value of `fs_active' when reloading the
superblock, and move the initialisation of it to beside where other
pointer fields are initialised.
2001-12-16 18:54:09 +00:00
Ian Dowse
3fa4044e34 Move the new superblock field `fs_active' into the region of the
superblock that is already set up to handle pointer types. This
fixes an accidental change in the superblock size on 64-bit platforms
caused by revision 1.24.
2001-12-16 18:51:11 +00:00
Kirk McKusick
cc5a92334f Minimize the time necessary to suspend operations on a filesystem
when taking a snapshot. The two time consuming operations are
scanning all the filesystem bitmaps to determine which blocks
are in use and scanning all the other snapshots so as to be able
to expunge their blocks from the view of the current snapshot.
The bitmap scanning is broken into two passes. Before suspending
the filesystem all bitmaps are scanned. After the suspension,
those bitmaps that changed after being scanned the first time
are rescanned. Typically there are few bitmaps that need to be
rescanned. The expunging of other snapshots is now done after
the suspension is released by observing that we can easily
identify any blocks that were allocated to them after the
suspension (they will be maked as `not needing to be copied'
in the just created snapshot). For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.
2001-12-14 00:15:06 +00:00
Kirk McKusick
9db12e5108 When a file is partially truncated, we first check to see if the
new file end will land in the middle of a file hole. Since the last
block of a file must always be allocated, the hole is filled by
allocating a block at that location. If the hole being filled is
a direct block, then the truncation may eventually reduce the
full sized block down to a fragment. When running with soft
updates, it is necessary to FSYNC the file after allocating the
block and before creating the fragment to avoid triggering a
soft updates inconsistency when the block unexpectedly shrinks.

Found by:	Matthew Dillon <dillon@apollo.backplane.com>
MFC after:	1 week
2001-12-13 05:07:48 +00:00
Robert Watson
24373ce6ed Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.

Submitted by:	jedgar
2001-11-30 15:32:07 +00:00
Robert Watson
cff9580525 Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.
2001-11-30 15:21:20 +00:00
Robert Watson
15f1c8d3d2 README.extattr incorrectly specified sample command lines for
UFS_EXTATTR_AUTOSTART.  Insert the missing 'initattr' arguments
to extattrctl.

Noticed by:	green
2001-11-30 15:15:27 +00:00
Guido van Rooij
40e294f796 When mkdir()-ing, the parent dir gets is linkcount increased.
Fix VN_KNOTE to reflect that.

Found by: tobez@freebsd.org
MFC after:	2 days
2001-11-22 15:33:12 +00:00
Ian Dowse
4202b366fc Oops, when trying the dirhash sequential-access optimisation,
compare the slot offset against the predicted offset, not a boolean
flag. This typo effectively disabled the sequential optimisation,
but was otherwise harmless.

Not surprisingly, fixing this improves performance in the sequential
access case. I am seeing a 7% speedup on one machine here; using
dirhash when sequentially looking up directory entries is now about
5% faster instead of 2% slower than the non-dirhash case.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>
MFC after:	1 week
2001-11-14 15:08:07 +00:00
Matthew Dillon
7e76bb562e Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write.  This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use.  The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after:	1 week
2001-11-05 18:48:54 +00:00
Robert Watson
6d8785434f o Update copyright dates.
o Add reference to TrustedBSD Project in license header.
o Update dated comments, including comment in extattr.h claiming that
  no file systems support extended attributes.
o Improve comment consistency.
2001-11-01 21:37:07 +00:00
Robert Watson
b6e0472987 o Althought this is not specified in POSIX.1e, the UFS ACL implementation
coerces the deletion of a default ACL on a directory when no default
  ACL EA is present to success.  Because the UFS EA implementation doesn't
  disinguish the EA failure modes "that EA name has not been
  administratively enabled" from "that EA name has no defined data",
  there's a potential conflict in error return values.  Normally, the
  lack of administratively configured EA support is coerced to
  EOPNOTSUPP to indicate that ACLs are not available; in this case,
  it is possible to get a successful return, even if ACLs are not
  available because EA support for them has not been enabled.

  Expand the comment in ufs_setacl() to identify this case.

Obtained from:	TrustedBSD Project
2001-10-27 05:39:17 +00:00
Robert Watson
ac8b3dd7dc o Clarify a comment about the locking condition of the vnode upon exit
from ufs_extattr_enable_with_open().
o Print auto-start notifications if (bootverbose).  This was previously
  commented out since it didn't know how to check for bootverbose.
o Drop in comments throughout indicating where ENOENT should be replaced
  with ENOATTR once that is available.

Obtained from:	TrustedBSD Project
2001-10-27 05:19:14 +00:00
Robert Watson
29543004bd o The comment about ordering the destruction of the lock and the removal of
the flag indicating that the structure was initialized didn't need
  an XXX, since it didn't need fixing.

Obtained from:	TrustedBSD Project
2001-10-27 05:05:39 +00:00
Robert Watson
9444746795 o Wrap a number of long lines of code, many of which were introduced
due to KSE-related (p) expansions.

Obtained from:	TrustedBSD Project
2001-10-27 05:03:05 +00:00
Robert Watson
ce5ddec25f Since namespace support was added to the UFS extended attribute
implementation to replace single-character namespace prefixes, '$' is no
longer an invalid attribute name, and the namespace is relevant to
validity determination.

o Remove '$' case from ufs_extattr_valid_attrname()
o Add attrnamespace argument to ufs_extattr_valid_attrname(), and
  fill out appropriately.

Currently no decisions are made based on the namespace argument, but
may be in the future.

Obtained from:	TrustedBSD Project
2001-10-27 04:58:28 +00:00
Matthew Dillon
245df27cee Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  Improves looping case by 500%.

Optimize ffs_sync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held.  Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after:	1 week
2001-10-26 00:08:05 +00:00
Ian Dowse
71fc5e11c7 Default to not performing ufs_dirhash's extensive directory-block
sanity check after every directory modification. This check can be
re-enabled at any time by setting the sysctl "vfs.ufs.dirhash_docheck"
to 1.

This group of sanity tests was there to ensure that any UFS_DIRHASH
bugs could be caught by a panic before a potentially corrupted
directory block would be written to disk. It has served its main
purpose now, so disable it in the interest of performance.

MFC after:	1 week
2001-10-25 22:55:59 +00:00
Matthew Dillon
c72ccd014d Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after:	3 days
2001-10-23 01:21:29 +00:00
John Baldwin
bd78cece5d Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
  another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
  refcount is > 1 and false (0) otherwise.
2001-10-11 23:38:17 +00:00
John Baldwin
7106ca0d1a Add missing includes of sys/lock.h. 2001-10-11 17:52:20 +00:00
Matthew Dillon
962922dcd2 Remove panics for rename() race conditions. The panics are inappropriate
because the IN_RENAME flag only fixes a few of the huge number of race
conditions that can result in the source path becoming invalid even
prior to the VOP_RENAME() call.  The panics created a serious security
issue whereby an attacker could fairly easily cause the panic to
occur, crashing the machine.

The correct solution requires a great deal of work in the namei
path cache code.

MFC after:	0 days
2001-10-08 00:37:54 +00:00
Robert Watson
ab66aa1468 o Replace two direct uid!=0 comparisons with suser_xxx() calls.
Obtained from:	TrustedBSD Project
2001-10-02 14:41:43 +00:00
Robert Watson
b73d2870cd o Replace two direct uid!=0 comparisons with suser_td() calls.
Obtained from:	TrustedBSD Project
2001-10-02 14:34:22 +00:00
Matthew Dillon
4c94c7bfb9 Backout the last commit. The problem is actually much worse then I
first thought and may require serious work to the VOP_RENAME() api itself.
Basically, by the time the VOP_RENAME() function is called, it's already
too late.
2001-10-02 04:26:58 +00:00
Matthew Dillon
be2a975a9f IN_RENAME should only be cleared by the routine that set it. This fixes
a rename/rmdir race that has been shown to cause a panic.

Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com>
MFC after:	3 days
2001-10-02 02:58:48 +00:00
John Baldwin
eb46fac565 - Fix some minor whitespace nits.
- Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a
  little nit that caused it to be defined as -(sizeof (struct thread) + 1)
  instead of -2.
2001-09-27 21:04:13 +00:00
Robert Watson
57358f1e93 o Re-enable support of system file flags in jail() by adding back the
PRISON_ROOT to the suser_xxx() check.  Since securelevels may now
  be raised in specific jails, use of system flags can still be
  restricted in jail(), but in a more configurable way.
o Users of jail() expecting system flags (such as schg) to restrict
  jail()'s should be sure to set the securelevel appropriately in
  jail()'s.
o This fixes activities involving automated system flag removal in
  jail(), including installkernel and friends.

Obtained from:	TrustedBSD Project
2001-09-26 20:44:41 +00:00
Robert Watson
6748bcc51e o Modify ufs_setattr() so that it uses securelevel_gt() instead of
direct variable access.

Obtained from:	TrustedBSD Project
2001-09-26 20:31:37 +00:00
Robert Watson
aaef1c3934 o Further clarify comment: ad Udo's request, re-insert the 'if'
refering to securelevels; also, update the unprivileged process text
  to better indicate the scope of actions permittable when any system
  flags are already set (limited).

Submitted by:	Udo Schweigert <udo.schweigert@siemens.com>
2001-09-25 12:02:44 +00:00
Robert Watson
82e83c60b3 o Parallelize the comment on the relationship between privileged un-jailed
processes and the actual securelevel check: make the comment use '> 0'
  instead of inverted '<= 0'.
2001-09-25 02:26:10 +00:00
Ian Dowse
5d76690a7f The addition of i_dirhash to struct inode pushed RELENG_4's
sizeof(struct inode) into a new malloc bucket on the i386. This
didn't happen in -current due to the removal of i_lock, but it does
no harm to apply the workaround to -current first.

Reduce the size of the i_spare[] array in struct inode from 4 to
3 entries, and change ext2fs to use i_din.di_spare[1] so that it
does not need i_spare[3].

Reviewed by:	bde
MFC after:	3 days
2001-09-24 18:29:20 +00:00
Julian Elischer
b40ce4165d KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
Ian Dowse
4691e9ead0 The "dirpref" directory layout preference improvements make use of
an array "fs_contigdirs[]" to avoid too many directories getting
created in each cylinder group. The memory required for this and
two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with
a single malloc() call, and divided up afterwards.  However, the
'space' pointer is not advanced correctly, so fs_contigdirs and
fs_maxcluster end up pointing to the same address.

Add the missing code to advance the 'space' pointer, and remove
an unnecessary update of the pointer that follows.

This is likely to fix the "ffs_clusteralloc: map mismatch" panics
that have been reported recently.

Submitted by:		Luke Mewburn <lukem@wasabisystems.com>
2001-09-09 23:48:28 +00:00
Chris D. Faulhaber
dac4a67ce7 Use ACL_PERM_NONE instead of hardcoding 0 when initializing
ACL entry permissions.

Reviewed by:	rwatson
2001-09-01 23:18:15 +00:00
Robert Watson
7df97b6117 o At some point, unmounting a non-EA file system with EA's compiled
in got a bit broken, when ufs_extattr_stop() was called and failed,
  ufs_extattr_destroy() would panic.  This makes the call to destroy()
  conditional on the success of stop().

Submitted by:		Christian Carstensen <cc@devcon.net>
Obtained from:	TrustedBSD Project
2001-09-01 20:11:05 +00:00
Peter Wemm
0f7289022b If a file has been completely unlinked, stop automatically syncing the
file.  ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them.  This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.
2001-08-27 06:09:56 +00:00
Ian Dowse
be70fc04ce Stop using dirhash when a directory is removed, and ensure that we
never attempt to hash directories once they are deleted. This fixes
a problem where operations on a deleted directory could trigger
dirhash sanity panics.
2001-08-26 20:47:19 +00:00
Ian Dowse
2ed42812bd When compacting directories, ufs_direnter() always trusted DIRSIZ()
to supply the number of bytes to be bcopy()'d to move an entry. If
d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible
length, so ufs_direnter could end up corrupting a directory during
compaction. In practice I believe this can only happen after fsck_ffs
has fixed a previously-corrupted directory.

We now deal with any mid-block unused entries specially to avoid
using DIRSIZ() or bcopy() on such entries. We also ensure that the
variables 'dsize' and 'spacefree' contain meaningful values at all
times. Add a few comments to describe better this intricate piece
of code.

The special handling of mid-block unused entries makes the dirhash-
specific bugfix in the previous revision (1.53) now uncecessary,
so this change removes it.

Reviewed by:	mckusick
2001-08-26 01:25:12 +00:00
Ian Dowse
7dfb550e0c When compressing directory blocks, the dirhash code didn't check
that the directory entry was in use before attempting to find it
in the hash structures to change its offset. Normally, unused
entries do not need to be moved, but fsck can leave behind some
unused entries that do. A dirhash sanity panic resulted when the
entry to be moved was not found. Add a check that stops entries
with d_ino == 0 from being passed to ufsdirhash_move().
2001-08-22 01:35:17 +00:00
Peter Wemm
61a4237001 Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS'
without 'options FFS' would fail to link.
2001-08-18 03:08:48 +00:00
Ian Dowse
9e27954de1 Two recent commits in sys/ufs/ufs interacted badly with ext2fs
because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink
is invalid because ext2fs does not maintain this field. In ufs_close(),
i_effnlink is also tested, to determines whether or not to call
vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of
ext2fs filesystems; I believe the other is harmless.

Fix both cases by checking um_i_effnlink_valid in the ufsmount
struct, and use i_nlink if necessary.

Noticed by:	bde
Reviewed by:	mckusick, bde
2001-07-29 22:26:01 +00:00
Ian Dowse
54d6d2dfaf Disable the dirhash sanity check that panics if an unused directory
entry (d_ino == 0) is found in a position that is not the start of
a DIRBLKSIZ block.

While such entries cannot occur normally (ufs always extends the
previous entry to cover the free space instead), they do not cause
problems and fsck does not fix them, so panicking is bad.
2001-07-27 18:45:41 +00:00
Peter Wemm
815d14ddab Use a fixed type for times in on-disk structures for ufs rather than
something that could potentially change like time_t.
2001-07-16 00:55:27 +00:00
Ian Dowse
50c7c3a7c8 Return a locked struct buf from ufsdirhash_lookup() to avoid one
extra getblk/brelse sequence for each lookup. We already had this
buf in ufsdirhash_lookup(), so there was no point in brelse'ing it
only to have the caller immediately reaquire the same buffer.

This should make the case of sequential lookups marginally faster;
in my tests, sequential lookups with dirhash enabled are now only
around 1% slower than without dirhash.
2001-07-13 20:50:38 +00:00
Ian Dowse
9b5ad47fb7 Bring in dirhash, a simple hash-based lookup optimisation for large
directories. When enabled via "options UFS_DIRHASH", in-core hash
arrays are maintained for large directories. These allow all
directory operations to take place quickly instead of requiring
long linear searches. For now anyway, dirhash is not enabled by
default.

The in-core hash arrays have a memory requirement that is approximately
half the size of the size of the on-disk directory file. A number
of new sysctl variables allow control over which directories get
hashed and over the maximum amount of memory that dirhash will use:

  vfs.ufs.dirhash_minsize
    The minimum on-disk directory size for which hashing should be
    used. The default is 2560 (2.5k).

  vfs.ufs.dirhash_maxmem
    The system-wide maximum total memory to be used by dirhash data
    structures. The default is 2097152 (2MB).

The current amount of memory being used by dirhash is visible
through the read-only sysctl variable vfs.ufs.dirhash_maxmem.
Finally, some extra sanity checks that are enabled by default, but
which may have an impact on performance, can be disabled by setting
vfs.ufs.dirhash_docheck to 0.

Discussed on: -fs, -hackers
2001-07-10 21:21:29 +00:00
Matthew Dillon
0cddd8f023 With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
John Baldwin
ed87274d16 Fix more mntvnode and vnode interlock order reversals. 2001-06-28 22:21:33 +00:00
John Baldwin
49d2d9f4a4 - Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.
2001-06-28 04:12:56 +00:00
Peter Wemm
78236790cd Fix warning:
1973: warning: int format, long int arg (arg 5)
2001-06-15 07:44:39 +00:00
Kirk McKusick
eb87cd754f Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.
2001-06-13 23:13:13 +00:00
Thomas Moestl
1fbcf0ac65 Call vn_close on the backing file vnode if ufs_extattr_enable failed to
avoid leaking it.

Reviewed by:	rwatson
2001-06-07 00:11:32 +00:00
Jonathan Lemon
3b6e32b01e Add a wrapper for the fifo kqfilter which falls through to the ufs routine.
This permits the fifo to inherit the ufs VNODE kqfilter.
2001-06-06 17:40:57 +00:00
Jonathan Lemon
4e92ec9dd0 Add a kqueue filter for writing to ufs filesystems which always returns
true.  This permits better interoperability with programs which register
filters on their stdin/stdout handles.

Submitted by: Niels Provos <provos@citi.umich.edu>
2001-06-05 13:52:37 +00:00
David E. O'Brien
1239674238 There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency.  This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by:	Tor.Egge@fast.no
2001-06-05 01:49:37 +00:00
John Baldwin
1c11b01562 Revert the previous commit in favor of the fix in rev 1.42 of
ufs/ffs/ffs_extern.h instead.

Requested by:	bde
2001-05-30 23:09:19 +00:00
John Baldwin
55d132317c Forward declare struct cg to quiet a warning.
Submitted by:	bde
2001-05-30 23:08:40 +00:00
John Baldwin
59718ee556 Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a
warning.
2001-05-29 23:53:16 +00:00
Poul-Henning Kamp
c7a3e2379c Remove last vestiges of MFS. 2001-05-29 21:21:53 +00:00
Poul-Henning Kamp
870b4959b7 Remove MFS from the kernel. 2001-05-29 18:50:30 +00:00
Thomas Moestl
3c436f07b7 Add a check to determine whether extended attributes have been
initialized on the file system before trying to grab the lock of the
per-mount extattr structure, as this lock is unitialized in that case.
This is needed because ufs_extattr_vnode_inactive is called from
ufs_inactive, which is also used by EA-unaware file systems such as
ext2fs.

Reviewed by:	rwatson
2001-05-25 18:24:52 +00:00
Robert Watson
b1fc0ec1a7 o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
  pcred->pc_uidinfo, which was associated with the real uid, only rename
  it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
  corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
  original macro that pointed.
  p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
  p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
  cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
  cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
  we figure out locking and optimizations; generally speaking, this
  means moving to a structure like this:
        newcred = crdup(oldcred);
        ...
        p->p_ucred = newcred;
        crfree(oldcred);
  It's not race-free, but better than nothing.  There are also races
  in sys_process.c, all inter-process authorization, fork, exec, and
  exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
  remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
  use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
  pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
  allocation.
o Clean up ktrcanset() to take into account changes, and move to using
  suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
  calls to better document current behavior.  In a couple of places,
  current behavior is a little questionable and we need to check
  POSIX.1 to make sure it's "right".  More commenting work still
  remains to be done.
o Update credential management calls, such as crfree(), to take into
  account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
      change_euid()
      change_egid()
      change_ruid()
      change_rgid()
      change_svuid()
      change_svgid()
  In each case, the call now acts on a credential not a process, and as
  such no longer requires more complicated process locking/etc.  They
  now assume the caller will do any necessary allocation of an
  exclusive credential reference.  Each is commented to document its
  reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
  and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
  questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
  processes and pcreds.  Note that this authorization, as well as
  CANSIGIO(), needs to be updated to use the p_cansignal() and
  p_cansched() centralized authorization routines, as they currently
  do not take into account some desirable restrictions that are handled
  by the centralized routines, as well as being inconsistent with other
  similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from:	TrustedBSD Project
Reviewed by:	green, bde, jhb, freebsd-arch, freebsd-audit
2001-05-25 16:59:11 +00:00
Matthew Dillon
ac8f990bde This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%.  For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by:	tegge, dillon
2001-05-24 07:22:27 +00:00
Alfred Perlstein
1752ee59ba ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when
calling it from the pager routine
2001-05-23 10:30:25 +00:00
Ruslan Ermilov
99d300a1ec - FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file
systems were repo-copied from sys/miscfs to sys/fs.

- Renamed the following file systems and their modules:
  fdesc -> fdescfs, portal -> portalfs, union -> unionfs.

- Renamed corresponding kernel options:
  FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.

- Install header files for the above file systems.

- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
  Makefiles.
2001-05-23 09:42:29 +00:00
Kirk McKusick
57042c7f72 Update softdep_setup_directory_add prototype to reflect changes in
actual function.

Obtained from:	Jim Bloom <bloom@jbloom.jbloom.org>
2001-05-20 15:59:55 +00:00
Kirk McKusick
dc01275be9 Must ensure that all the entries on the pd_pendinghd list have been
committed to disk before clearing them. More specifically, when
free_newdirblk is called, we know that the inode claims the new
directory block. However, if the associated pagedep is still linked
onto the directory buffer dependency chain, then some of the entries
on the pd_pendinghd list may not be committed to disk yet. In this
case, we will simply note that the inode claims the block and let
the pd_pendinghd list be processed when the pagedep is next written.
If the pagedep is no longer on the buffer dependency chain, then
all the entries on the pd_pending list are committed to disk and
we can free them in free_newdirblk. This corrects a window of
vulnerability introduced in the code added in version 1.95.
2001-05-19 19:24:26 +00:00
Alfred Perlstein
2395531439 Introduce a global lock for the vm subsystem (vm_mtx).
vm_mtx does not recurse and is required for most low level
vm operations.

faults can not be taken without holding Giant.

Memory subsystems can now call the base page allocators safely.

Almost all atomic ops were removed as they are covered under the
vm mutex.

Alpha and ia64 now need to catch up to i386's trap handlers.

FFS and NFS have been tested, other filesystems will need minor
changes (grabbing the vm lock when twiddling page properties).

Reviewed (partially) by: jake, jhb
2001-05-19 01:28:09 +00:00
Kirk McKusick
9f5192ff71 Must be a bit less aggressive about freeing pagedep structures.
Obtained from:	Robert Watson <rwatson@FreeBSD.org> and
		Matthew Jacob <mjacob@feral.com>
2001-05-18 22:16:28 +00:00
Kirk McKusick
24a83a4b3f When a new block is allocated to a directory, an fsync of a file
whose name is within that block must ensure not only that the block
containing the file name has been written, but also that the on-disk
directory inode references that block. When a new directory block
is created, we allocate a newdirblk structure which is linked to
the associated allocdirect (on its ad_newdirblk list). When the
allocdirect has been satisfied, the newdirblk structure is moved
to the inodedep id_bufwait list of its directory to await the inode
being written.  When the inode is written, the directory entries
are fully committed and can be deleted from their pagedep->id_pendinghd
and inodedep->id_pendinghd lists.
2001-05-17 07:24:03 +00:00
Ian Dowse
0864ef1e8a Change the second argument of vflush() to an integer that specifies
the number of references on the filesystem root vnode to be both
expected and released. Many filesystems hold an extra reference on
the filesystem root vnode, which must be accounted for when
determining if the filesystem is busy and then released if it isn't
busy. The old `skipvp' approach required individual filesystem
xxx_unmount functions to re-implement much of vflush()'s logic to
deal with the root vnode.

All 9 filesystems that hold an extra reference on the root vnode
got the logic wrong in the case of forced unmounts, so `umount -f'
would always fail if there were any extra root vnode references.
Fix this issue centrally in vflush(), now that we can.

This commit also fixes a vnode reference leak in devfs, which could
result in idle devfs filesystems that refuse to unmount.

Reviewed by:	phk, bp
2001-05-16 18:04:37 +00:00
Kirk McKusick
7389126d9a Further fixes for deadlock in the presence of multiple snapshots.
There are still more to find, but this fix should cover the
common cases that folks are hitting.
2001-05-14 17:16:49 +00:00
Kirk McKusick
0b04113700 If the effective link count is zero when an NFS file handle request
comes in for it, the file is really gone, so return ESTALE.

The problem arises when the last reference to an FFS file is
released because soft-updates may delay the actual freeing of the
inode for some time. Since there are no filesystem links or open
file descriptors referencing the inode, from the point of view of
the system, the file is inaccessible. However, if the filesystem
is NFS exported, then the remote client can still access the inode
via ufs_fhtovp() until the inode really goes away. To prevent this
anomoly, it is necessary to begin returning ESTALE at the same time
that the file ceases to be accessible to the local filesystem.

Obtained from:	Ian Dowse <iedowse@maths.tcd.ie>
2001-05-13 23:30:45 +00:00
Kirk McKusick
9b35c30cf7 Remove yet another deadlock case. 2001-05-11 07:12:03 +00:00
Kirk McKusick
9ccb939ef0 When running with soft updates, track the number of blocks and files
that are committed to being freed and reflect these blocks in the
counts returned by statfs (and thus also by the `df' command). This
change allows programs such as those that do news expiration to
know when to stop if they are trying to create a certain percentage
of free space. Note that this change does not solve the much harder
problem of making this to-be-freed space available to applications
that want it (thus on a nearly full filesystem, you may still
encounter out-of-space conditions even though the free space will
show up eventually). Hopefully this harder problem will be the
subject of a future enhancement.
2001-05-08 07:42:20 +00:00
Kirk McKusick
27b047acf0 Several fixes for units errors:
1) Do not assume that the superblock will be of size fs->fs_bsize.
   This fixes a panic when taking a snapshot on a filesystem with
   a block size bigger than 8K.
2) Properly calculate the number of fragments that follow the
   superblock summary information. This fixes a bug with inconsistent
   snapshots.
3) When cleaning up a snapshot that is about to be removed, properly
   calculate the number of blocks that need to be checked. This fixes
   a bug that created partially allocated inodes.
4) When moving blocks from a snapshot that is about to be removed
   to another snapshot, properly account for the reduced number of
   blocks in the snapshot from which they are taken. This fixes a
   bug in which the number of blocks released from a snapshot did not
   match the number that it claimed to have.
2001-05-08 07:29:03 +00:00
Kirk McKusick
0c6fbff0a5 When syncing out snapshot metadata, we must temporarily allow recursive
buffer locking so as to avoid locking against ourselves if we need to
write filesystem metadata.
2001-05-08 07:13:00 +00:00
Kirk McKusick
23371b2f22 Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce
the amount of time that the filesystem must be suspended. The
current snapshot is elided as well as the earlier snapshots.
2001-05-04 05:49:28 +00:00
Poul-Henning Kamp
3858e5e797 Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes. 2001-05-01 09:12:39 +00:00
Poul-Henning Kamp
3c7a8027cb Remove blatantly pointless call to VOP_BMAP().
Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.
2001-05-01 09:12:31 +00:00
Poul-Henning Kamp
a62615e59b Implement vop_std{get|put}pages() and add them to the default vop[].
Un-copy&paste all the VOP_{GET|PUT}PAGES() functions which do nothing but
the default.
2001-05-01 08:34:45 +00:00
Mark Murray
fb919e4d5a Undo part of the tangle of having sys/lock.h and sys/mutex.h included in
other "system" header files.

Also help the deprecation of lockmgr.h by making it a sub-include of
sys/lock.h and removing sys/lockmgr.h form kernel .c files.

Sort sys/*.h includes where possible in affected files.

OK'ed by:	bde (with reservations)
2001-05-01 08:13:21 +00:00
Poul-Henning Kamp
855aa097af VOP_BALLOC was never really a VOP in the first place, so convert it
to UFS_BALLOC like the other "between UFS and FFS function interfaces".
2001-04-29 12:36:52 +00:00