Commit Graph

1225 Commits

Author SHA1 Message Date
phk
72aba11981 Use three UMA zones for FFS/UFS inodes instead of malloc space.
Since inodes are currently 144 bytes, this will save 112 bytes per
inode.  This can amount to up to 10MByte on large systems.
2002-12-27 11:05:05 +00:00
phk
3d9be4e20a Move the allocation of the inode contents into ffs_vfsops.c rather than
passing malloc types around.
2002-12-27 10:23:03 +00:00
phk
afd8ad09d7 Make ffs_mountfs() static.
Remove the malloctype from the ufs mount structure, instead add a callback
to the storage method for freeing inodes: UFS_IFREE().

Add vfs_ifree() method function which frees an inode.

Unvariablelize the malloc type used for allocating inodes.
2002-12-27 10:06:37 +00:00
mckusick
91195ac766 Fix corruption introduced in previous delta.
Reported by:	Aurelien Nephtali <aurelien.nephtali@wanadoo.fr>
Sponsored by:   DARPA & NAI Labs.
2002-12-18 19:50:28 +00:00
mckusick
f86b91ebe0 Keep comments consistent with the code. Minor optimization.
Sponsored by:   DARPA & NAI Labs.
2002-12-18 07:19:41 +00:00
mckusick
345fe7ca1f Cosmetic cleanup of unsigned buglets.
Submitted by:	Bruce Evans <bde@zeta.org.au>
Sponsored by:   DARPA & NAI Labs.
2002-12-18 00:53:45 +00:00
phk
723e3b218d Remove unused lockcnt variable.
Approved by:	mckusick
2002-12-17 20:23:51 +00:00
mckusick
668d95fdc4 Update to previous change (1.54) to use an approperly wide inode field
so as to work correctly on 64-bit platforms.

Reported-by:	Jake Burkholder <jake@locore.ca>
Sponsored by:   DARPA & NAI Labs.
Approved by:	Ian Dowse <iedowse@maths.tcd.ie>
2002-12-15 19:25:59 +00:00
iedowse
3bf3f7056c Undo the adjustment of the total memory used by dirhash in the case
where allocating the dirhash structure fails. Fix a few typos in
comments and update copyright.

MFC after:	1 week
2002-12-14 17:16:16 +00:00
mckusick
069ffa44b5 Only the most recent snapshot contains the complete list of blocks
that were copied in all of the earlier snapshots, thus its precomputed
list must be used in the copyonwrite test. Using incomplete lists may
lead to deadlock. Also do not include the blocks used for the indirect
pointers in the indirect pointers as this may lead to inconsistent
snapshots.

Sponsored by:   DARPA & NAI Labs.
Approved by:	re
2002-12-14 01:36:59 +00:00
trhodes
b45b4e9e9e Remove the comment about dump(8) not working properly with snapshots.
Discussed with:	mckusick
Approved by:	re (rwatson)
2002-12-12 00:31:45 +00:00
mckusick
461d1f1c94 More tightly verify the preference returned for the new inode.
Submitted by:	Kris Kennaway <kris@obsecurity.org>
Sponsored by:   DARPA & NAI Labs.
Approved by:	re
2002-12-06 02:08:46 +00:00
mckusick
6822be5fe2 Have to use bread() rather than UFS_BALLOC() when obtaining a
previously allocated block as the previous use of the block may
have fallen out of the cache. Failure to reread its contents cause
zeroed results to be written instead of the proper contents.
Conversely, when the block is going to be entirely filled in, it
is not necessary reread the old contents.

Sponsored by:   DARPA & NAI Labs.
Approved by:	re
2002-12-03 18:19:27 +00:00
mckusick
69c6a0dc68 Add a check to disable the previous patch so that future filesystems
that choose to place their superblocks in non-standard locations will
not get them smashed.

Sponsored by:   DARPA & NAI Labs.
2002-11-30 19:04:57 +00:00
mckusick
c1f52553a2 Remove a race condition / deadlock from snapshots. When
converting from individual vnode locks to the snapshot
lock, be sure to pass any waiting processes along to the
new lock as well. This transfer is done by a new function
in the lock manager, transferlockers(from_lock, to_lock);
Thanks to Lamont Granquist <lamont@scriptkiddie.org> for
his help in pounding on snapshots beyond all reason and
finding this deadlock.

Sponsored by:   DARPA & NAI Labs.
2002-11-30 19:00:51 +00:00
mckusick
d4a32db4ae Fix two deadlocks in snapshots:
1) Release the snapshot file lock while suspending the system. Otherwise
   a process trying to read the lock may block on its containing directory
   preventing the suspension from completing. Thanks to Sean Kelly
   <smkelly@zombie.org> for finding this deadlock.

2) Replace some bdwrite's with bawrite's so as not to fill all the
   buffers with dirty data. The buffers could not be cleaned as the
   snapshot vnode was locked hence the system could deadlock when
   making snapshots of really massive filesystems. Thanks to
   Hidetoshi Shimokawa <simokawa@sat.t.u-tokyo.ac.jp> for figuring
   this out.

Sponsored by:   DARPA & NAI Labs.
2002-11-30 07:27:12 +00:00
mckusick
78d15991e4 Check to make sure that the fs_sblockloc field was properly updated
before using it to write the superblock. This is to guard against
accidentally trashing the disklabel if the superblock format missed
being upgraded by the new kernel.

Reported by:	Sam Leffler <sam@errno.com>
Sponsored by:   DARPA & NAI Labs.
Approved by:	Murray Stokely <murray@FreeBSD.org>
2002-11-29 19:20:15 +00:00
mckusick
9251693096 Create a new 32-bit fs_flags word in the superblock. Add code to move
the old 8-bit fs_old_flags to the new location the first time that the
filesystem is mounted by a new kernel. One of the unused flags in
fs_old_flags is used to indicate that the flags have been moved.
Leave the fs_old_flags word intact so that it will work properly if
used on an old kernel.

Change the fs_sblockloc superblock location field to be in units
of bytes instead of in units of filesystem fragments. The old units
did not work properly when the fragment size exceeeded the superblock
size (8192). Update old fs_sblockloc values at the same time that
the flags are moved.

Suggested by:	BOUWSMA Barry <freebsd-misuser@netscum.dyndns.dk>
Sponsored by:   DARPA & NAI Labs.
2002-11-27 02:18:58 +00:00
mckusick
a51ebc9dc0 The target for the maximum number of dependencies has been cut
in half because of reports that under heavy load the kernel could
exhaust its memory pool. The limit is now (desiredvnodes * 4)
rather than (desiredvnodes * 8), so it will still scale with
larger systems, just not as quickly.

Sponsored by:   DARPA & NAI Labs.
2002-11-20 05:16:11 +00:00
mckusick
637af64f54 If an error occurs while writing a buffer, then the data will
not have hit the disk and the dependencies cannot be unrolled.
In this case, the system will mark the buffer as dirty again so
that the write can be retried in the future. When the write
succeeds or the system gives up on the buffer and marks it as
invalid (B_INVAL), the dependencies will be cleared.

Sponsored by:   DARPA & NAI Labs.
2002-11-20 05:14:16 +00:00
peter
c841be9bcb Do not assume that time_t is an int.
Approved by:	re (jhb)
2002-11-15 22:36:57 +00:00
jhb
b8daeabf59 Print daddr_t's with %j and intmax_t. 2002-11-08 22:28:35 +00:00
rwatson
3ad18c8074 Update licenses and wording: NAI has authorized the removal of clause three
of their BSD-style license; also, carry out the NAI Labs -> Network
Associates Laboratories renaming in these files.
2002-11-04 02:35:46 +00:00
wollman
ce3867deda Implement the new 1003.1-2001 pathconf() keys, including the Advisory
Information option.  Other filesystem implementations should do something
similar.

With advice from:	mckusick, phk
2002-10-27 18:09:49 +00:00
rwatson
312cab0dee Slightly change the semantics of vnode labels for MAC: rather than
"refreshing" the label on the vnode before use, just get the label
right from inception.  For single-label file systems, set the label
in the generic VFS getnewvnode() code; for multi-label file systems,
leave the labeling up to the file system.  With UFS1/2, this means
reading the extended attribute during vfs_vget() as the inode is
pulled off disk, rather than hitting the extended attributes
frequently during operations later, improving performance.  This
also corrects sematics for shared vnode locks, which were not
previously present in the system.  This chances the cache
coherrency properties WRT out-of-band access to label data, but in
an acceptable form.  With UFS1, there is a small race condition
during automatic extended attribute start -- this is not present
with UFS2, and occurs because EAs aren't available at vnode
inception.  We'll introduce a work around for this shortly.

Approved by:	re
Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-26 14:38:24 +00:00
mckusick
6b1611bd94 Within ufs, the ffs_sync and ffs_fsync functions did not always
check for and/or report I/O errors. The result is that a VFS_SYNC
or VOP_FSYNC called with MNT_WAIT could loop infinitely on ufs in
the presence of a hard error writing a disk sector or in a filesystem
full condition. This patch ensures that I/O errors will always be
checked and returned.  This patch also ensures that every call to
VFS_SYNC or VOP_FSYNC with MNT_WAIT set checks for and takes
appropriate action when an error is returned.

Sponsored by:   DARPA & NAI Labs.
2002-10-25 00:20:37 +00:00
mckusick
0337df10b7 We must be careful to avoid recursive copy-on-write faults when
trying to clean up during disk-full senarios.

Sponsored by:	DARPA & NAI Labs.
2002-10-23 21:47:02 +00:00
mckusick
3819d46020 Missplaced FREE_LOCK causes a panic when hit while taking a snapshot.
Sponsored by:	DARPA & NAI Labs.
2002-10-23 05:14:06 +00:00
mckusick
04450228c6 This update further fine tunes the locking of snapshot vnodes in
the ffs_copyonwrite routine to avoid a deadlock between the syncer
daemon trying to sync out a snapshot vnode and the bufdaemon
trying to write out a buffer containing the snapshot inode.
With any luck this will be the last snapshot race condition.

Sponsored by:	DARPA & NAI Labs.
2002-10-22 01:23:00 +00:00
mckusick
a515fcf789 This update is a performance improvement when allocating blocks on
a full filesystem. Previously, if the allocation failed, we had to
fsync the file before rolling back any partial allocation of indirect
blocks. Most block allocation requests only need to allocate a single
data block and if that allocation fails, there is nothing to unroll.
So, before doing the fsync, we check to see if any rollback will
really be necessary. If none is necessary, then we simply return.
This update eliminates the flurry of disk activity that got triggered
whenever a filesystem would run out of space.

Sponsored by:	DARPA & NAI Labs.
2002-10-22 01:14:25 +00:00
mckusick
305e5868f3 This checkin reimplements the io-request priority hack in a way
that works in the new threaded kernel. It was commented out of
the disksort routine earlier this year for the reasons given in
kern/subr_disklabel.c (which is where this code used to reside
before it moved to kern/subr_disk.c):

----------------------------
revision 1.65
date: 2002/04/22 06:53:20;  author: phk;  state: Exp;  lines: +5 -0
Comment out Kirks io-request priority hack until we can do this in a
civilized way which doesn't cause grief.

The problem is that it is not generally safe to cast a "struct bio
*" to a "struct buf *".  Things like ccd, vinum, ata-raid and GEOM
constructs bio's which are not entrails of a struct buf.

Also, curthread may or may not have anything to do with the I/O request
at hand.

The correct solution can either be to tag struct bio's with a
priority derived from the requesting threads nice and have disksort
act on this field, this wouldn't address the "silly-seek syndrome"
where two equal processes bang the diskheads from one edge to the
other of the disk repeatedly.

Alternatively, and probably better: a sleep should be introduced
either at the time the I/O is requested or at the time it is completed
where we can be sure to sleep in the right thread.

The sleep also needs to be in constant timeunits, 1/hz can be practicaly
any sub-second size, at high HZ the current code practically doesn't
do anything.
----------------------------

As suggested in this comment, it is no longer located in the disk sort
routine, but rather now resides in spec_strategy where the disk operations
are being queued by the thread that is associated with the process that
is really requesting the I/O. At that point, the disk queues are not
visible, so the I/O for positively niced processes is always slowed
down whether or not there is other activity on the disk.

On the issue of scaling HZ, I believe that the current scheme is
better than using a fixed quantum of time. As machines and I/O
subsystems get faster, the resolution on the clock also rises.
So, ten years from now we will be slowing things down for shorter
periods of time, but the proportional effect on the system will
be about the same as it is today. So, I view this as a feature
rather than a drawback. Hence this patch sticks with using HZ.

Sponsored by:	DARPA & NAI Labs.
Reviewed by:	Poul-Henning Kamp <phk@critter.freebsd.dk>
2002-10-22 00:59:49 +00:00
rwatson
d862ecfee8 Rename _POSIX_FOO_PRESENT and friends from POSIX.1e to _PC_FOO_PRESENT
and related friends.  This would have been corrected had POSIX.1e
progressed to a standard.

Pointed out by:	wollman
2002-10-20 22:11:13 +00:00
rwatson
438835cabb Implement _POSIX_ACL_PATH_MAX, which returns the maximum number of ACL
entries for a file system node using pathconf().

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-20 22:08:26 +00:00
rwatson
9d17032f64 Teach UFS to respond to pathconf() tests for _POSIX_ACL_EXTENDED and
_POSIX_MAC_PRESENT based on available mount flags, if the services are
available.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-20 21:49:41 +00:00
rwatson
a2eb2e3662 Clarify that the UFS1 extended attribute configuration steps do not apply
to UFS2 file systems.

Submitted by:	jedgar
Obtained from:	TrustedBSD Project
2002-10-19 16:09:16 +00:00
dillon
d155b8f135 Fix a file-rewrite performance case for UFS[2]. When rewriting portions
of a file in chunks that are less then the filesystem block size, if the
data is not already cached the system will perform a read-before-write.
The problem is that it does this on a block-by-block basis, breaking up the
I/Os and making clustering impossible for the writes.  Programs such
as INN using cyclic file buffers suffer greatly.  This problem is only going
to get worse as we use larger and larger filesystem block sizes.

The solution is to extend the sequential heuristic so UFS[2] can perform
a far larger read and readahead when dealing with this case.

(note: maximum disk write bandwidth is 27MB/sec thru filesystem)
(note: filesystem blocksize in test is 8K (1K frag))
dd if=/dev/zero of=test.dat bs=1k count=2m conv=notrunc

Before:  (note half of these are reads)
      tty             da0              da1             acd0             cpu
 tin tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   0   76 14.21 598  8.30   0.00   0  0.00   0.00   0  0.00   0  0  7  1 92
   0   76 14.09 813 11.19   0.00   0  0.00   0.00   0  0.00   0  0  9  5 86
   0   76 14.28 821 11.45   0.00   0  0.00   0.00   0  0.00   0  0  8  1 91

After:	(note half of these are reads)
      tty             da0              da1             acd0             cpu
 tin tout  KB/t tps  MB/s   KB/t tps  MB/s   KB/t tps  MB/s  us ni sy in id
   0   76 63.62 434 26.99   0.00   0  0.00   0.00   0  0.00   0  0 18  1 80
   0   76 63.58 424 26.30   0.00   0  0.00   0.00   0  0.00   0  0 17  2 82
   0   76 63.82 438 27.32   0.00   0  0.00   0.00   0  0.00   1  0 19  2 79

Reviewed by:	mckusick
Approved by:	re
X-MFC after:	immediately (was heavily tested in -stable for 4 months)
2002-10-18 22:52:41 +00:00
rwatson
ab9568ccbf Update extended attribute readme file to note that no special configuration
is required to use EAs with UFS2, and that UFS2 is recommend for EA use
for a variety of reasons.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-18 21:11:36 +00:00
rwatson
10e2a00a6a Update instructions for ACLs given recent tunefs, mount changes. Also
note that UFS2 doesn't require explicit extended attribute configuration,
and is recommends for this and other reasons if you plan to use ACLs.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-18 21:09:57 +00:00
rwatson
749729a702 Use 'size_t' instead of 'int' for the result of sizeof(). 2002-10-18 21:03:30 +00:00
mckusick
0af0d22682 With the revised single-lock method used in snapshots, the
BA_NOWAIT flag is no longer needed.

Sponsored by:	DARPA & NAI Labs.
2002-10-18 01:17:28 +00:00
mckusick
733bfbdd78 Change locking so that all snapshots on a particular filesystem share
a common lock. This change avoids a deadlock between snapshots when
separate requests cause them to deadlock checking each other for a
need to copy blocks that are close enough together that they fall
into the same indirect block. Although I had anticipated a slowdown
from contention for the single lock, my filesystem benchmarks show
no measurable change in throughput on a uniprocessor system with
three active snapshots. I conjecture that this result is because
every copy-on-write fault must check all the active snapshots, so
the process was inherently serial already. This change removes the
last of the deadlocks of which I am aware in snapshots.

Sponsored by:	DARPA & NAI Labs.
2002-10-16 00:19:23 +00:00
rwatson
174c4a0034 Push most UFS ACL behavior behind a check for MNT_ACLS, permitting ACLs
to be administratively disabled as needed on UFS/UFS2 file systems.  This
also has the effect of preventing the slightly more expensive ACL code
from running on non-ACL file systems, avoiding storage allocation for
ACLs that may be read from disk.  MNT_ACLS may be set at mount-time
using mount -o acls, or implicitly by setting the FS_ACLS flag using
tunefs.  On UFS1, you may also have to configure ACL store.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-15 21:28:24 +00:00
rwatson
e112f21cae If the FS_MULTILABEL flag is set in a UFS or UFS2 superblock,
automatically set MNT_MULTILABEL in the mount flags.

If FS_ACLS is set in a UFS or UFS2 superblock, automatically
set MNT_ACLS in the mount flags.

If either of these flags is set, but the appropriate kernel option
to support the features associated with the flag isn't available,
then print a warning at mount-time.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-15 20:00:06 +00:00
mckusick
750c7540cc When reading or writing the extended attributes of a special device
or fifo in UFS2, the normal ufs_strategy routine needs to be used
rather than the spec_strategy or fifo_strategy routine. Thus the
ffsext_strategy routine is interposed in the ffs_vnops vectors for
special devices and fifo's to pick off this special case. Otherwise
it simply falls through to the usual spec_strategy or fifo_strategy
routine.

Submitted by:	Robert Watson <rwatson@FreeBSD.org>
Sponsored by:	DARPA & NAI Labs.
2002-10-14 23:18:09 +00:00
rwatson
67d568c288 Fix two memory leaks in error conditions involving the UFS ACL code:
if failures occur, make sure that we release both the default ACL
and access ACL storage during new object creation.

Spotted by:	phk and his pet flexelint
Sponsored by:	DARPA, Network Associates Laboratories
2002-10-14 19:55:49 +00:00
rwatson
9c6b9f51d1 Define two new superblock file system flags:
FS_ACLS		Administrative enable/disable of extended ACL support
FS_MULTILABEL	Administrative flag to indicate to the MAC Framework
		that objects in the file system are individually
		labeled using extended attributes.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, Network Associates Laboratories
Reviewed by:	(in principal) mckusick, phk
2002-10-14 17:07:11 +00:00
mckusick
25230d4c6a Regularize the vop_stdlock'ing protocol across all the filesystems
that use it. Specifically, vop_stdlock uses the lock pointed to by
vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to
reference vp->v_lock. Filesystems that wish to use the default
do not need to allocate a lock at the front of their node structure
(as some still did) or do a lockinit. They can simply start using
vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks,
but still use the vop_stdlock functions (such as nullfs) can simply
replace vp->v_vnlock with a pointer to the lock that they wish to
have used for the vnode. Such filesystems are responsible for
setting the vp->v_vnlock back to the default in their vop_reclaim
routine (e.g., vp->v_vnlock = &vp->v_lock).

In theory, this set of changes cleans up the existing filesystem
lock interface and should have no function change to the existing
locking scheme.

Sponsored by:	DARPA & NAI Labs.
2002-10-14 03:20:36 +00:00
mike
8630abe45f Change iov_base's type from char *' to the standard void *'. All
uses of iov_base which assume its type is `char *' (in order to do
pointer arithmetic) have been updated to cast iov_base to `char *'.
2002-10-11 14:58:34 +00:00
mux
ddefc9d7d6 Fix build of 64 bit platforms. 2002-10-09 12:19:36 +00:00
mckusick
e13d3e9276 When creating a snapshot, create a list of initially allocated blocks.
Whenever doing a copy-on-write check, first look in the list of
initially allocated blocks to see if it is there. If so, no further
check is needed. If not, fall through and do the full check. This
change eliminates one of two known deadlocks caused by snapshots.
Handling the second deadlock will be the subject of another check-in.
This change also reduces the cost of the copy-on-write check by
speeding up the verification of frequently checked blocks.

Sponsored by:	DARPA & NAI Labs.
2002-10-09 07:28:35 +00:00
mckusick
8cecf7b3e5 When creating a snapshot, create a list of initially allocated blocks.
Whenever doing a copy-on-write check, first look in the list of
initially allocated blocks to see if it is there. If so, no further
check is needed. If not, fall through and do the full check. This
change eliminates one of two known deadlocks caused by snapshots.
Handling the second deadlock will be the subject of another check-in.
This change also reduces the cost of the copy-on-write check by
speeding up the verification of frequently checked blocks.

Sponsored by:	DARPA & NAI Labs.
2002-10-09 06:13:48 +00:00
mckusick
d01b53c175 The appropriate units for disk block addresses are always DEV_BSIZE,
even when the underlying device has a larger sector size. Therefore,
the filesystem code should not (and with this patch does not) try to
use the underlying sector size when doing disk block address calculations.

This patch fixes problems in -current when using the swap-based
memory-disk device (mdconfig -a -t swap ...). This bugfix is not
relevant to -stable as -stable does not have the memory-disk device.

Sponsored by:	DARPA & NAI Labs.
2002-10-09 04:01:23 +00:00
jeff
29006c0306 - Remove LK_INTERLOCK from the vn_lock() in ffs_snapshot().
Pointy hat to:	me
Found by:	green
2002-10-08 21:00:52 +00:00
phk
f98c8d3a06 Mark two places where an unsigned number is checked "if (foo < 0)" with
an XXX comment.

Somebody[TM] should look at this in some detail.

Spotted by:	FlexeLint
2002-10-02 09:11:18 +00:00
dd
814e414600 size_t is not a struct (fix mislabelling in a comment). 2002-10-02 05:15:34 +00:00
phk
b55fa4540e Fix some harmless mis-indents.
Spotted by:	FlexeLint
2002-10-01 15:48:31 +00:00
jmallett
a17c9b6f5c When spamming me with a printf(9), under DIAGNOSTIC, at least be nice enough
to include a newline.

MFC after:	4 days
Sponsored by:	Bright Path Solutions
2002-09-28 19:04:49 +00:00
phk
1dfc2c167f Be consistent about "static" functions: if the function is marked
static in its prototype, mark it static at the definition too.

Inspired by:    FlexeLint warning #512
2002-09-28 17:15:38 +00:00
phk
1a999cb3f9 Make it a tad easier to deal with struct inode in userland programs which
fondle /dev/kmem by using "struct cdev *" instead of "dev_t".

Requsted by:	jake
2002-09-27 20:03:05 +00:00
phk
5f5d8287e5 Use our mount-credential if we get a NOCRED when we try to write out EA
space back to disk.

This is wrong in many ways, but not as wrong as a panic.

Pancied on:	rwatson & jmallet
Sponsored by:	DARPA & NAI Labs.
2002-09-27 20:00:03 +00:00
jeff
8bebc5fdba - Convert locks to use standard macros.
- Lock access to the buflists.
 - Document broken locking.
 - Use vrefcnt().
2002-09-25 02:49:48 +00:00
jeff
90e87c8eb5 - Document broken locking.
- Use vrefcnt().
2002-09-25 02:47:49 +00:00
jeff
41b9d1ca5d - Lock accesses to v_usecount.
- Convert interlock locks to use standard macros.
2002-09-25 02:45:50 +00:00
jeff
263f8202f6 - Don't use the interlock to protect v_writecount. 2002-09-25 02:44:55 +00:00
phk
71d1473801 We don't need to #include <sys/disklabel.h>.
We don't need to #include <sys/disklabel.h> second time either.

Sponsored by:	DARPA & NAI Labs.
2002-09-20 16:42:33 +00:00
truckman
f280782003 VOP_FSYNC() requires that it's vnode argument be locked, which nfs_link()
wasn't doing.  Rather than just lock and unlock the vnode around the call
to VOP_FSYNC(), implement rwatson's suggestion to lock the file vnode
in kern_link() before calling VOP_LINK(), since the other filesystems
also locked the file vnode right away in their link methods.  Remove the
locking and and unlocking from the leaf filesystem link methods.

Reviewed by:	rwatson, bde  (except for the unionfs_link() changes)
2002-09-19 13:32:45 +00:00
obrien
a8deaee84b intmax_t is printed with %jd, not %lld. 2002-09-19 03:55:30 +00:00
njl
00c79f5c92 Remove any VOP_PRINT that redundantly prints the tag.
Move lockmgr_printinfo() into vprint() for everyone's benefit.

Suggested by: bde
2002-09-18 20:42:04 +00:00
njl
0590c43070 Remove all use of vnode->v_tag, replacing with appropriate substitutes.
v_tag is now const char * and should only be used for debugging.

Additionally:
1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK
2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which
is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP.

Suggested by:   phk
Reviewed by:    bde, rwatson (earlier version)
2002-09-14 09:02:28 +00:00
bde
8aa3df4eb2 vfs_syscalls.c:
Changed rename(2) to follow the letter of the POSIX spec.  POSIX
requires rename() to have no effect if its args "resolve to the same
existing file".  I think "file" can only reasonably be read as referring
to the inode, although the rationale and "resolve" seem to say that
sameness is at the level of (resolved) directory entries.

ext2fs_vnops.c, ufs_vnops.c:
Replaced code that gave the historical BSD behaviour of removing one
link name by checks that this code is now unreachable.  This fixes
some races.  All vnodes needed to be unlocked for the removal, and
locking at another level using something like IN_RENAME was not even
attempted, so it was possible for rename(x, y) to return with both x
and y removed even without any unlink(2) syscalls (one process can
remove x using rename(x, y) and another process can remove y using
rename(y, x)).

Prodded by:	alfred
MFC after:	8 weeks
PR:		42617
2002-09-10 11:09:13 +00:00
phk
87f5667c5a Implement the VOP_OPENEXTATTR() and VOP_CLOSEEXTATTR() methods.
Use extattr_check_cred() to check access to EAs.

This is still a WIP.

Sponsored by:   DARPA & NAI Labs.
2002-09-05 20:59:42 +00:00
phk
db06a743d8 Use canonical extattr_check_cred() instead of private implementation of the
same policy.

Sponsored by:	DARPA & NAI Labs.
2002-09-05 20:39:36 +00:00
phk
80704cd6a8 Fix credentials check: do not leak ENOATTR until we know if they're
supposed to know.

Sponsored by:	DARPA & NAI Labs.
2002-09-05 20:28:24 +00:00
bde
fcab7b8389 Include <sys/malloc.h> instead of depending on namespace pollution 2
layers deep in <sys/proc.h> or <sys/vnode.h>.

Include <sys/vmmeter.h> instead of depending on namespace pollution in
<sys/pcpu.h>.

Sorted includes as much as possible.
2002-09-05 09:43:24 +00:00
rwatson
9c25e3c24e Since we have vp and td cached in local variables, use those instead
of derefencing the VOP arguments again when calling the UFS code.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-09-01 16:06:40 +00:00
phk
49cb8194a9 Correctly handle setting, getting and deleting EA's with zero length content.
Sponsored by:	DARPA & NAI Labs.
2002-08-30 08:57:09 +00:00
charnier
7dd9d47059 Replace various spelling with FALLTHROUGH which is lint()able 2002-08-25 13:23:09 +00:00
alc
cdcc7b3446 o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since
pmap_zero_page() and pmap_zero_page_area() were modified to accept
   a struct vm_page * instead of a physical address, vm_page_zero_fill()
   and vm_page_zero_fill_area() have served no purpose.
2002-08-25 00:22:31 +00:00
phk
a1e3931514 Implement list of EA return functionality.
Correctly delete EA's when the content length is set to zero.

Sponsored by:	DARPA & NAI Labs.
2002-08-20 11:34:58 +00:00
phk
d1ede7ccd1 First snapshot of UFS2 EA support.
Sponsored by: DARPA & NAI Labs.
2002-08-19 07:01:55 +00:00
rwatson
44404e4547 In order to better support flexible and extensible access control,
make a series of modifications to the credential arguments relating
to file read and write operations to cliarfy which credential is
used for what:

- Change fo_read() and fo_write() to accept "active_cred" instead of
  "cred", and change the semantics of consumers of fo_read() and
  fo_write() to pass the active credential of the thread requesting
  an operation rather than the cached file cred.  The cached file
  cred is still available in fo_read() and fo_write() consumers
  via fp->f_cred.  These changes largely in sys_generic.c.

For each implementation of fo_read() and fo_write(), update cred
usage to reflect this change and maintain current semantics:

- badfo_readwrite() unchanged
- kqueue_read/write() unchanged
  pipe_read/write() now authorize MAC using active_cred rather
  than td->td_ucred
- soo_read/write() unchanged
- vn_read/write() now authorize MAC using active_cred but
  VOP_READ/WRITE() with fp->f_cred

Modify vn_rdwr() to accept two credential arguments instead of a
single credential: active_cred and file_cred.  Use active_cred
for MAC authorization, and select a credential for use in
VOP_READ/WRITE() based on whether file_cred is NULL or not.  If
file_cred is provided, authorize the VOP using that cred,
otherwise the active credential, matching current semantics.

Modify current vn_rdwr() consumers to pass a file_cred if used
in the context of a struct file, and to always pass active_cred.
When vn_rdwr() is used without a file_cred, pass NOCRED.

These changes should maintain current semantics for read/write,
but avoid a redundant passing of fp->f_cred, as well as making
it more clear what the origin of each credential is in file
descriptor read/write operations.

Follow-up commits will make similar changes to other file descriptor
operations, and modify the MAC framework to pass both credentials
to MAC policy modules so they can implement either semantic for
revocation.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-15 20:55:08 +00:00
phk
40f2376812 Expand the arguments to ffs_ext{read,write}() to their component
parts rather than use vop_{read,write}_args.  Access to these
functions will ultimately not be available through the
"vop_{read,write}+IO_EXT" API but this functionality is retained
for debugging purposes for now.

Sponsored by: DARPA & NAI Labs.
2002-08-13 11:33:01 +00:00
phk
088460429a Unravel the UFS_EXTATTR incest between FFS and UFS: UFS_EXTATTR is an
UFS only thing, and FFS should in principle not know if it is enabled
or not.

This commit cleans ffs_vnops.c for such knowledge, but not ffs_vfsops.c

Sponsored by: DARPA and NAI Labs.
2002-08-13 10:33:57 +00:00
phk
e4f487f25e Introduce typedefs for the member functions of struct vfsops and employ
these in the main filesystems.  This does not change the resulting code
but makes the source a little bit more grepable.

Sponsored by:	DARPA and NAI Labs.
2002-08-13 10:05:50 +00:00
rwatson
b0388fc24a Pass IO_NOMACCHECK to vn_rdwr() in the following checks to prevent
enforcement of MAC policy on the read or write operations:

- In ext2fs, don't enforce MAC on loop-back reads and writes supporting
  directory read operations in lookup(), directory modifications in
  rename(), directory write operations in mkdir(), symlink write
  operations in symlink().

- In the NFS client locking code, perform vn_rdwr() on the NFS locking
  socket without enforcing MAC, since the write is done on behalf of
  the kernel NFS implementation rather than the user process.

- In UFS, don't enforce MAC on loop-back reads and writes supporting
  directory read operations in lookup(), and symlink write operations
  in symlink().

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-08-12 16:43:04 +00:00
phk
58bc3221a4 Stop pretending that the FFS file ufs_readwrite.c is a UFS file.
Instead of #including it, pull it into ffs_vnops.c and name things
correctly.

Sponsored by:	DARPA & NAI Labs.
2002-08-12 10:32:56 +00:00
phk
d5c53dc7ef Fix a comment. 2002-08-12 09:22:11 +00:00
iedowse
504ae196a9 Don't call softdep_slowdown() if soft updates are not active on the
filesystem. This causes a panic for kernels compiled without
softupdates.

Reported by:	luigi
2002-08-05 17:59:20 +00:00
jeff
02517b6731 - Replace v_flag with v_iflag and v_vflag
- v_vflag is protected by the vnode lock and is used when synchronization
   with VOP calls is needed.
 - v_iflag is protected by interlock and is used for dealing with vnode
   management issues.  These flags include X/O LOCK, FREE, DOOMED, etc.
 - All accesses to v_iflag and v_vflag have either been locked or marked with
   mp_fixme's.
 - Many ASSERT_VOP_LOCKED calls have been added where the locking was not
   clear.
 - Many functions in vfs_subr.c were restructured to provide for stronger
   locking.

Idea stolen from:	BSD/OS
2002-08-04 10:29:36 +00:00
rwatson
85e0975519 Introduce support for Mandatory Access Control and extensible
kernel access control.

Instrument UFS to support per-inode MAC labels.  In particular,
invoke MAC framework entry points for generically supporting the
backing of MAC labels into extended attributes.  This ends up
introducing new vnode operation vector entries point at the MAC
framework entry points, as well as some explicit entry point
invocations for file and directory creation events so that the
MAC framework can push labels to disk before the directory names
become persistent (this will work better once EAs in UFS2 are
hooked into soft updates).  The generic EA MAC entry points
support executing with the file system in either single label
or multilabel operation, and will fall back to the mount label
if multilabel is not specified at mount-time.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-07-31 16:05:30 +00:00
phk
c1cd9e269d I forgot this bit of uglyness in the fsck_ffs cleanup. 2002-07-31 07:01:18 +00:00
phk
aba97c33f3 Fix braino in last commit. 2002-07-30 12:02:41 +00:00
phk
99c4adc497 Move ffs_isfreeblock() to ffs_alloc.c and make it static.
Sponsored by: DARPA & NAI Labs.
2002-07-30 11:54:48 +00:00
alc
27bb8a7310 Lock page queue accesses by vm_page_free(). 2002-07-28 08:01:48 +00:00
benno
7409589c7a Add a missing argument to the stub for softdep_setup_freeblocks.
Forgotten by:	mckusick
2002-07-20 04:07:15 +00:00
peter
c458732bcf Fix a warning:
ffs_softdep.c:1630: warning: int format, different type arg (arg 2)
2002-07-20 01:09:35 +00:00
mckusick
b44cb5787c Add support to UFS2 to provide storage for extended attributes.
As this code is not actually used by any of the existing
interfaces, it seems unlikely to break anything (famous
last words).

The internal kernel interface to manipulate these attributes
is invoked using two new IO_ flags: IO_NORMAL and IO_EXT.
These flags may be specified in the ioflags word of VOP_READ,
VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that
you want to do I/O to the normal data part of the file and
IO_EXT means that you want to do I/O to the extended attributes
part of the file. IO_NORMAL and IO_EXT are mutually exclusive
for VOP_READ and VOP_WRITE, but may be specified individually
or together in the case of VOP_TRUNCATE. For example, when
removing a file, VOP_TRUNCATE is called with both IO_NORMAL
and IO_EXT set. For backward compatibility, if neither IO_NORMAL
nor IO_EXT is set, then IO_NORMAL is assumed.

Note that the BA_ and IO_ flags have been `merged' so that they
may both be used in the same flags word. This merger is possible
by assigning the IO_ flags to the low sixteen bits and the BA_
flags the high sixteen bits. This works because the high sixteen
bits of the IO_ word is reserved for read-ahead and help with
write clustering so will never be used for flags. This merge
lets us get away from code of the form:

        if (ioflags & IO_SYNC)
                flags |= BA_SYNC;

For the future, I have considered adding a new field to the
vattr structure, va_extsize. This addition could then be
exported through the stat structure to allow applications to
find out the size of the extended attribute storage and also
would provide a more standard interface for truncating them
(via VOP_SETATTR rather than VOP_TRUNCATE).

I am also contemplating adding a pathconf parameter (for
concreteness, lets call it _PC_MAX_EXTSIZE) which would
let an application determine the maximum size of the extended
atribute storage.

Sponsored by:	DARPA & NAI Labs.
2002-07-19 07:29:39 +00:00
mckusick
3abb526f86 Change utimes to set the file creation time (for filesystems that
support creation times such as UFS2) to the value of the
modification time if the value of the modification time is older
than the current creation time. See utimes(2) for further details.

Sponsored by:	DARPA & NAI Labs.
2002-07-17 02:03:19 +00:00
mckusick
a3ff90696f Change the name of st_createtime to st_birthtime. This change is
made to reduce confusion between st_ctime and st_createtime.

Submitted by:	Eric Allman <eric@sendmail.org>
Sponsored by:	DARPA & NAI Labs.
2002-07-16 22:36:00 +00:00
trhodes
931ad6a4ff Fix a type: s/your are/you are/ 2002-07-12 19:56:31 +00:00
bde
ec6a3fc0c3 Fixed some printf format errors (4 new ones reported by gcc and 5 nearby
old ones not reported by gcc).  This helps unbreak LINT.
2002-07-08 12:42:29 +00:00
iedowse
4416f82706 Use indirect function pointer hooks instead of #ifdef SOFTUPDATES
direct calls for the two places where the kernel calls into soft
updates code. Set up the hooks in softdep_initialize() and NULL
them out in softdep_uninitialize(). This change allows soft updates
to function correctly when ufs is loaded as a module.

Reviewed by:	mckusick
2002-07-01 17:59:40 +00:00
iedowse
603a270887 Add the ffs bits necessary to support unloading of the ufs kernel
module. This adds an ffs_uninit() function that calls ufs_uninit()
and also calls a new softdep_uninitialize() function. Add a stub
for softdep_uninitialize() to cover the non-SOFTUPDATES case.

Reviewed by:	mckusick
2002-07-01 11:00:47 +00:00
iedowse
948e368072 Remove the bogus SYSINIT from ufs_dirhash.c and instead add a call
to ufsdirhash_init() from ufs_init(). Add uninit() functions
corresponding the ufs, dirhash, quota and ihash init() functions.
2002-06-30 02:49:39 +00:00
iedowse
c0323a07fb Remove the kernel file-size limit for UFS2, so that only the limit
imposed by the filesystem structure itself remains. With 16k blocks,
the maximum file size is now just over 128TB.

For now, the UFS1 file size limit is left unchanged so as to remain
consistent with RELENG_4, but it too could be removed in the future.

Reviewed by:	mckusick
2002-06-26 18:34:51 +00:00
ken
0d3a835f3f At long last, commit the zero copy sockets code.
MAKEDEV:	Add MAKEDEV glue for the ti(4) device nodes.

ti.4:		Update the ti(4) man page to include information on the
		TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options,
		and also include information about the new character
		device interface and the associated ioctls.

man9/Makefile:	Add jumbo.9 and zero_copy.9 man pages and associated
		links.

jumbo.9:	New man page describing the jumbo buffer allocator
		interface and operation.

zero_copy.9:	New man page describing the general characteristics of
		the zero copy send and receive code, and what an
		application author should do to take advantage of the
		zero copy functionality.

NOTES:		Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS,
		TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT.

conf/files:	Add uipc_jumbo.c and uipc_cow.c.

conf/options:	Add the 5 options mentioned above.

kern_subr.c:	Receive side zero copy implementation.  This takes
		"disposable" pages attached to an mbuf, gives them to
		a user process, and then recycles the user's page.
		This is only active when ZERO_COPY_SOCKETS is turned on
		and the kern.ipc.zero_copy.receive sysctl variable is
		set to 1.

uipc_cow.c:	Send side zero copy functions.  Takes a page written
		by the user and maps it copy on write and assigns it
		kernel virtual address space.  Removes copy on write
		mapping once the buffer has been freed by the network
		stack.

uipc_jumbo.c:	Jumbo disposable page allocator code.  This allocates
		(optionally) disposable pages for network drivers that
		want to give the user the option of doing zero copy
		receive.

uipc_socket.c:	Add kern.ipc.zero_copy.{send,receive} sysctls that are
		enabled if ZERO_COPY_SOCKETS is turned on.

		Add zero copy send support to sosend() -- pages get
		mapped into the kernel instead of getting copied if
		they meet size and alignment restrictions.

uipc_syscalls.c:Un-staticize some of the sf* functions so that they
		can be used elsewhere.  (uipc_cow.c)

if_media.c:	In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid
		calling malloc() with M_WAITOK.  Return an error if
		the M_NOWAIT malloc fails.

		The ti(4) driver and the wi(4) driver, at least, call
		this with a mutex held.  This causes witness warnings
		for 'ifconfig -a' with a wi(4) or ti(4) board in the
		system.  (I've only verified for ti(4)).

ip_output.c:	Fragment large datagrams so that each segment contains
		a multiple of PAGE_SIZE amount of data plus headers.
		This allows the receiver to potentially do page
		flipping on receives.

if_ti.c:	Add zero copy receive support to the ti(4) driver.  If
		TI_PRIVATE_JUMBOS is not defined, it now uses the
		jumbo(9) buffer allocator for jumbo receive buffers.

		Add a new character device interface for the ti(4)
		driver for the new debugging interface.  This allows
		(a patched version of) gdb to talk to the Tigon board
		and debug the firmware.  There are also a few additional
		debugging ioctls available through this interface.

		Add header splitting support to the ti(4) driver.

		Tweak some of the default interrupt coalescing
		parameters to more useful defaults.

		Add hooks for supporting transmit flow control, but
		leave it turned off with a comment describing why it
		is turned off.

if_tireg.h:	Change the firmware rev to 12.4.11, since we're really
		at 12.4.11 plus fixes from 12.4.13.

		Add defines needed for debugging.

		Remove the ti_stats structure, it is now defined in
		sys/tiio.h.

ti_fw.h:	12.4.11 firmware.

ti_fw2.h:	12.4.11 firmware, plus selected fixes from 12.4.13,
		and my header splitting patches.  Revision 12.4.13
		doesn't handle 10/100 negotiation properly.  (This
		firmware is the same as what was in the tree previously,
		with the addition of header splitting support.)

sys/jumbo.h:	Jumbo buffer allocator interface.

sys/mbuf.h:	Add a new external mbuf type, EXT_DISPOSABLE, to
		indicate that the payload buffer can be thrown away /
		flipped to a userland process.

socketvar.h:	Add prototype for socow_setup.

tiio.h:		ioctl interface to the character portion of the ti(4)
		driver, plus associated structure/type definitions.

uio.h:		Change prototype for uiomoveco() so that we'll know
		whether the source page is disposable.

ufs_readwrite.c:Update for new prototype of uiomoveco().

vm_fault.c:	In vm_fault(), check to see whether we need to do a page
		based copy on write fault.

vm_object.c:	Add a new function, vm_object_allocate_wait().  This
		does the same thing that vm_object allocate does, except
		that it gives the caller the opportunity to specify whether
		it should wait on the uma_zalloc() of the object structre.

		This allows vm objects to be allocated while holding a
		mutex.  (Without generating WITNESS warnings.)

		vm_object_allocate() is implemented as a call to
		vm_object_allocate_wait() with the malloc flag set to
		M_WAITOK.

vm_object.h:	Add prototype for vm_object_allocate_wait().

vm_page.c:	Add page-based copy on write setup, clear and fault
		routines.

vm_page.h:	Add page based COW function prototypes and variable in
		the vm_page structure.

Many thanks to Drew Gallatin, who wrote the zero copy send and receive
code, and to all the other folks who have tested and reviewed this code
over the years.
2002-06-26 03:37:47 +00:00
mckusick
ea2f279985 Force the quota update to be done when an inode is released in
ufs_inactive. This avoid a panic when checking a NULL credential
in suser_cred().
2002-06-25 01:02:28 +00:00
jlemon
fe48a4f5d5 Prototype fixes (long newinum --> ino_t newinum). 2002-06-24 17:20:19 +00:00
mux
0a55bc4167 Warning fixes for 64 bits platforms. This eliminates all the
warnings I have had in the FFS code on sparc64.

Reviewed by:	mckusick
2002-06-23 18:17:27 +00:00
dillon
b0b3177408 Rename the BALLOC flags from B_* to BA_* to avoid confusion with the
struct buf B_ flags.

Approved by:	mckusick
2002-06-23 06:12:22 +00:00
mckusick
2730f1975d This patch fixes a problem whereby filesystems that ran
out of inodes in a cylinder group would fail to check for
free inodes in other cylinder groups. This bug was introduced
in the UFS2 code merge two days ago.

An inode is allocated by calling ffs_valloc which calls
ffs_hashalloc to do the filesystem scan. Ffs_hashalloc
walks around the cylinder groups calling its passed allocator
(ffs_nodealloccg in this case) until the allocator returns a
non-zero result. The bug is that ffs_hashalloc expects the
passed allocator function to return a 64-bit ufs2_daddr_t.
When allocating inodes, it calls ffs_nodealloccg which was
returning a 32-bit ino_t. The ffs_hashalloc code checked
a 64-bit return value and usually found random non-zero bits in
the high 32-bits so decided that the allocation had succeeded
(in this case in the only cylinder group that it checked).
When the result was passed back to ffs_valloc it looked at
only the bottom 32-bits, saw zero and declared the system
out of inodes. But ffs_hashalloc had really only checked
one cylinder group.

The fix is to change ffs_nodealloccg to return 64-bit results.

Sponsored by:	DARPA & NAI Labs.
Submitted by:	Poul-Henning Kamp <phk@critter.freebsd.dk>
Reviewed by:	Maxime Henrion <mux@freebsd.org>
2002-06-22 21:24:58 +00:00
mckusick
88d85c15ef This commit adds basic support for the UFS2 filesystem. The UFS2
filesystem expands the inode to 256 bytes to make space for 64-bit
block pointers. It also adds a file-creation time field, an ability
to use jumbo blocks per inode to allow extent like pointer density,
and space for extended attributes (up to twice the filesystem block
size worth of attributes, e.g., on a 16K filesystem, there is space
for 32K of attributes). UFS2 fully supports and runs existing UFS1
filesystems. New filesystems built using newfs can be built in either
UFS1 or UFS2 format using the -O option. In this commit UFS1 is
the default format, so if you want to build UFS2 format filesystems,
you must specify -O 2. This default will be changed to UFS2 when
UFS2 proves itself to be stable. In this commit the boot code for
reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c)
as there is insufficient space in the boot block. Once the size of the
boot block is increased, this code can be defined.

Things to note: the definition of SBSIZE has changed to SBLOCKSIZE.
The header file <ufs/ufs/dinode.h> must be included before
<ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and
ufs_lbn_t.

Still TODO:
Verify that the first level bootstraps work for all the architectures.
Convert the utility ffsinfo to understand UFS2 and test growfs.
Add support for the extended attribute storage. Update soft updates
to ensure integrity of extended attribute storage. Switch the
current extended attribute interfaces to use the extended attribute
storage. Add the extent like functionality (framework is there,
but is currently never used).

Sponsored by: DARPA & NAI Labs.
Reviewed by:	Poul-Henning Kamp <phk@freebsd.org>
2002-06-21 06:18:05 +00:00
dillon
7aa1bb9d05 In rev 1.72 a situation related to write/mmap was fixed which could result
in a user process gaining visibility into the 'old' contents of a filesystem
block.  There were two cases:  (1) when uiomove() fails (user process issues
illegal write), and (2) when uiomove() overlaps a mmap() of the same file at
the same offset (fault -> recursive buffer I/O reads contents of old block).

Unfortunately 1.72 also had the unintended effect of forcing the filesystem
to do a read-before-write in the case of a full-block-write (non append case),
e.g. 'dd if=/dev/zero of=test.dat bs=1m count=256 conv=notrunc'.  This
destroys performance.. not only is a read forced for every write, but
clustering breaks as well.

The solution is to clear the buffer manually in the full-block case rather
then asking BALLOC to do it (BALLOC issues the read-before-write).  In the
partial-block case we want BALLOC to do it because the read-before-write
is necessary.  This patch should greatly improve database and news-feed
server performance.

Found by: MKI <mki@mozone.net>
MFC after:	3 days
2002-06-19 09:39:41 +00:00
semenu
f48bc32790 Fix a typo in my recently added comment: s/beleived/believed/
Submitted by:	keramida
2002-06-06 20:43:03 +00:00
alfred
28ce0e9ae5 Backout/modify previous revision:
"empty default cases shouldn't be removed, they should have a break;
  statement added to them."

Requested by: billf
2002-06-01 20:54:21 +00:00
alfred
04d357d9e8 Silence warnings, remove some empty 'default' switch cases. 2002-06-01 20:40:42 +00:00
semenu
b9960bece1 Remove lock from ffs_vget introduced by v1.24. Instead of locking the
vnode creation globaly, we allow processes to create vnodes concurently.
In case of concurent creation of vnode for the one ino, we allow processes
to race and then check who wins.

Assuming that concurent creation of vnode for same ino is really rare case,
this is belived to be an improvement, as it just allows concurent creation
of vnodes.

Idea by:	bp
Reviewed by:	dillon
MFC after:	1 month
2002-05-30 22:04:17 +00:00
rwatson
930f7599ed Remove IFS from 5.0-CURRENT. This facilitates introducing UFS2 as
IFS had its fingers deep in the belly of the UFS/FFS split.  IFS
will be reimplemented by the maintainer at a later date.

Requested by:	adrian (maintainer)
2002-05-19 00:11:08 +00:00
iedowse
caf2c5a16f Fix two casts to "daddr_t *" that should have been "ufs_daddr_t *". 2002-05-18 19:03:00 +00:00
iedowse
cc9d214396 Fix a typo where sizeof(daddr_t) was specified instead of sizeof(doff_t).
Now that daddr_t is 64-bit, this caused hash blocks to be allocated
twice as large as they need to be.
2002-05-18 18:58:27 +00:00
iedowse
794dddf9bf Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u
unions, since these were only necessary when ext2fs used ufs code.

Reviewed by:	mckusick
2002-05-18 18:51:14 +00:00
phk
e127001407 Fix ufs_daddr_t/daddr_t type problems.
Sponsored by:	DARPA & NAI labs.
2002-05-17 18:59:53 +00:00
phk
152c7cfbca Call ufs_bmaparray() with right parameter type.
Sponsored by: DARPA & NAI Labs.
2002-05-17 18:53:29 +00:00
trhodes
28d42899b7 More s/file system/filesystem/g 2002-05-16 21:28:32 +00:00
phk
8536ea3cdb Make daddr_t and u_daddr_t 64bits wide.
Retire daddr64_t and use daddr_t instead.

Sponsored by:	DARPA & NAI Labs.
2002-05-14 11:09:43 +00:00
phk
b591373208 Remove register keyword.
Sponsored by:	DARPA & NAI Labs.
Submitted by:	mckusick
2002-05-13 09:22:31 +00:00
phk
e76663bf87 Remove two "register" and a blank line.
Submitted by:	mckusick
Sponsored by:	DARPA & NAI Labs.
2002-05-12 22:54:48 +00:00
phk
e27fed0950 ARGH! SBLOCK is not unused. Try to get this right.
BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant).

Define SBLOCK again, using the right math.

Sponsored by: DARPA & NAI Labs.
2002-05-12 20:21:40 +00:00
phk
4aa59982cb Remove #define for BBOFF, it is assumed == 0 so many places that we might
as well forget about it.  In fact the only thing which used it was the
SBOFF macro.

Sponsored by: DARPA & NAI Labs.
2002-05-12 20:00:21 +00:00
phk
57e4ad9a8b Remove unused BBLOCK and SBLOCK #defines.
Sponsored by: DARPA & NAI Labs.
2002-05-12 19:56:31 +00:00
alc
d60a525036 o Condition the compilation and use of vm_freeze_copyopts()
on ENABLE_VFS_IOOPT.
2002-05-06 05:45:57 +00:00
phk
f075f5e39a Move some UFS related stuff home where it belongs. 2002-05-05 20:04:33 +00:00
jeff
86e123b512 Include systm.h so panic(9) is defined when doing DEBUG_ALL_VFS_LOCKS. 2002-05-04 02:40:37 +00:00
phk
a1f99d3509 Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and
put then in the ufs_vnops where they belong, rather than in the ffs_vnops.

Ok'ed by:	rwatson
Sponsored by:	DARPA & NAI Labs.
2002-05-03 08:40:33 +00:00
phk
860ddc59bc Use vop_panic() instead of our home-rolled version. 2002-05-02 19:15:52 +00:00
alfred
d1526107c3 Remove support for using soon to be retired "special" poll(2) ops.
Replace with kevent(2) ops.

This is untested, but the code would rot even further if this wasn't
applied.  I've chosen to apply this to prompt some cleanup.

Submitted by: bde
2002-04-18 14:52:28 +00:00
jeff
b972756970 Don't peak into the malloc_type structure for limits. The desired vnodes
check should be sufficient.  This is required for the pending removal of
malloc_type limits.
2002-04-15 03:35:35 +00:00
phk
33405073ec Move generic disk ioctls from <sys/disklabel.h> to <sys/disk.h>.
Sponsored by:	DARPA & NAI Labs
2002-04-08 09:20:07 +00:00
jhb
db9aa81e23 Change callers of mtx_init() to pass in an appropriate lock type name. In
most cases NULL is passed, but in some cases such as network driver locks
(which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used.

Tested on:	i386, alpha, sparc64
2002-04-04 21:03:38 +00:00
phk
65c4a4cb9d Move the FFS parameter MAXFRAG from <sys/param.h> to <ufs/ffs/fs.h>
Sponsored by:	DARPA & NAI Labs.
2002-04-03 20:39:27 +00:00
phk
79a894e10e Use DIOCGSECTORSIZE instead of the bogus DIOCGPART ioctl. 2002-04-02 11:23:14 +00:00
jhb
dc2e474f79 Change the suser() API to take advantage of td_ucred as well as do a
general cleanup of the API.  The entire API now consists of two functions
similar to the pre-KSE API.  The suser() function takes a thread pointer
as its only argument.  The td_ucred member of this thread must be valid
so the only valid thread pointers are curthread and a few kernel threads
such as thread0.  The suser_cred() function takes a pointer to a struct
ucred as its first argument and an integer flag as its second argument.
The flag is currently only used for the PRISON_ROOT flag.

Discussed on:	smp@
2002-04-01 21:31:13 +00:00
bde
a9b7c63b29 In ffs_mountffs(), set mnt_iosize_max to si_iosize_max unconditionally
provided the latter is nonzero.  At this point, the former is a fairly
arbitrary default value (DFTPHYS), so changing it to any reasonable
value specified by the device driver is safe.  Using the maximum of
these limits broke ffs clustered i/o for devices whose si_iosize_max
is < DFLTPHYS.  Using the minimum would break device drivers' ability
to increase the active limit from DFTLPHYS up to MAXPHYS.

Copied the code for this and the associated (unnecessary?) fixup of
mp_iosize_max to all other filesystems that use clustering (ext2fs and
msdosfs).  It was completely missing.

PR:		36309
MFC-after:	1 week
2002-03-30 15:12:57 +00:00
dwmalone
246869b676 Two minor changes to dirhash, which result in some marginal benchmark
improvements.

1) If deleting an entry results in a chain of deleted slots ending in an
   empty slot, then we can be a bit more aggressive about marking slots as
   empty.

2) The last stage of the FNV hash is to xor the last byte of data
   into the hash. This means that filenames which differ only in
   the last byte will be placed close to one another in the hash
   table, which forms longer chains. To work around this common
   case, we also hash in the address of the dirhash structure.

     news/cancel = news/articles/control/cancel for a tradspool inn server
     squid2 = squid level 2 directory (dirs called 00->FF)
     squid3 = squid level 3 directory (files called 00001F00->00001FFF)

                             mean #probes for
                  home dir  mh inbox  news/cancel  tmp    squid2  squid3
old   successful  1.02      3.19      4.07         1.10    7.85   2.06
new   successful  1.04      1.32      1.27         1.04    1.93   1.17

old unsuccessful  1.08      4.50      5.37         1.17   10.76   2.69
new unsuccessful  1.08      1.73      1.64         1.17    2.89   1.37

Reviewed by:	iedowse
MFC after:	2 weeks
2002-03-20 17:58:02 +00:00
jeff
7bd172afc2 Remove references to vm_zone.h and switch over to the new uma API. 2002-03-20 08:48:07 +00:00
alfred
79061a9306 Remove __P. 2002-03-19 22:40:48 +00:00
bde
df8144b98e Fixed some printf format errors (hopefully all of the remaining daddr64_t
ones for GENERIC, and all others on the same line as those).  Reformat
the printfs if necessary to avoid new long lones or old format printf
errors.
2002-03-19 04:09:21 +00:00
mckusick
14dd08fd15 Add a flags parameter to VFS_VGET to pass through the desired
locking flags when acquiring a vnode. The immediate purpose is
to allow polling lock requests (LK_NOWAIT) needed by soft updates
to avoid deadlock when enlisting other processes to help with
the background cleanup. For the future it will allow the use of
shared locks for read access to vnodes. This change touches a
lot of files as it affects most filesystems within the system.
It has been well tested on FFS, loopback, and CD-ROM filesystems.
only lightly on the others, so if you find a problem there, please
let me (mckusick@mckusick.com) know.
2002-03-17 01:25:47 +00:00
mckusick
e929f2e4f0 Introduce the new 64-bit size disk block, daddr64_t. Change
the bio and buffer structures to have daddr64_t bio_pblkno,
b_blkno, and b_lblkno fields which allows access to disks
larger than a Terabyte in size. This change also requires
that the VOP_BMAP vnode operation accept and return daddr64_t
blocks. This delta should not affect system operation in
any way. It merely sets up the necessary interfaces to allow
the development of disk drivers that work with these larger
disk block addresses. It also allows for the development of
UFS2 which will use 64-bit block addresses.
2002-03-15 18:49:47 +00:00
obrien
b4c5b60736 Quiet a warning on the Alpha. 2002-03-15 04:06:10 +00:00
mckusick
670fad8a9e This corrects the first of two known deadlock conditions that
come from the presence of a snapshot file.
2002-03-14 01:21:13 +00:00
iedowse
cf446bea56 Fix a bug in ufsdirhash_adjfree() that caused it to incorrectly
update the free-space statistics in some cases. The problem affected
directory blocks when the free space dropped below the size of the
maximum allowed entry size. When this happened, the free-space
summary information could claim that there are no further blocks
that can fit a maximum-size entry, even if there are.

The effect of this bug is that the directory may be enlarged even
though there is space within the directory for the new entry. This
wastes disk space and has a negative impact on performance.

Fix it by correctly computing the dh_firstfree array index, adding
a helper macro for clarity. Put an extra sanity check into
ufsdirhash_checkblock() to detect the situation in future.

Found by:	dwmalone
Reviewed by:	dwmalone
MFC after:	1 week
2002-03-11 19:13:22 +00:00
phk
2098322771 I missed one VOP_CLOSE in the previous commit.
Pointed out by:	bde
2002-03-11 16:27:04 +00:00
phk
c68bd5ed71 As a XXX bandaid open the mounted device READ/WRITE even if we only mount
read-only.

The trouble here is that we don't reopen the device in read/write mode
when we remount in read/write mode resulting in a filesystem sending
write requests to a device which was only opened read/only.

I'm not quite sure how such a reopen would best be done and defer
the problem to more agile hackers.
2002-03-11 13:53:00 +00:00
rwatson
1a2700eed9 Update DBA for NAI. We have several. We used the wrong one. :-) 2002-03-07 17:49:06 +00:00
green
965906f60d Add new errno ``ENOATTR''. 2002-03-07 15:13:44 +00:00
dillon
e05fb41e7e cleanup readability syntax prior to ongoing b_resid work commits.
MFC after:	1 day
2002-03-06 00:44:30 +00:00
jhb
b8b3ac8816 Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic
and isn't strictly required.  However, it lowers the number of false
positives found when grep'ing the kernel sources for p_ucred to ensure
proper locking.
2002-02-27 19:18:10 +00:00
jhb
3706cd3509 Simple p_ucred -> td_ucred changes to start using the per-thread ucred
reference.
2002-02-27 18:32:23 +00:00
phk
62d248fb9e Replace bowrite() with BUF_WRITE in ufs.
Remove bowrite(), it is now unused.

This is the first step in getting entirely rid of BIO_ORDERED which is
a generally accepted evil thing.

Approved by:	mckusick
2002-02-22 09:03:00 +00:00
rwatson
ad827f33f7 o Minor style fix on #endif, missing '_' in comment. 2002-02-20 15:44:43 +00:00
phk
68389bd8ba Make v_addpollinfo() visible and non-inline.
Have callers only call it as needed.
Add necessary call in ufs_kqfilter().

Test-case found by:	Andrew Gallatin <gallatin@cs.duke.edu>
2002-02-18 16:18:02 +00:00
phk
c2a47cdbe8 Move the stuff related to select and poll out of struct vnode.
The use of the zone allocator may or may not be overkill.
There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need
to revisit.

This shaves about 60 bytes of struct vnode which on my laptop means
600k less RAM used for vnodes.
2002-02-17 21:15:36 +00:00
phk
3348974369 Collect the VN_KNOTE() macro definitions on vnode.h 2002-02-17 21:07:57 +00:00
julian
37369620df In a threaded world, differnt priorirites become properties of
different entities.  Make it so.

Reviewed by:	jhb@freebsd.org (john baldwin)
2002-02-11 20:37:54 +00:00
rwatson
c00ccc43a6 Minor style tweaks.
Remove an unneeded comment and commented out code that won't be
needed.
2002-02-10 04:57:08 +00:00
rwatson
ce1e1df4d1 Copyright + license update. 2002-02-10 04:50:24 +00:00
rwatson
6ca91a055c Part I: Update extended attribute API and ABI:
o Modify the system call syntax for extattr_{get,set}_{fd,file}() so
  as not to use the scatter gather API (which appeared not to be used
  by any consumers, and be less portable), rather, accepts 'data'
  and 'nbytes' in the style of other simple read/write interfaces.
  This changes the API and ABI.

o Modify system call semantics so that extattr_get_{fd,file}() return
  a size_t.  When performing a read, the number of bytes read will
  be returned, unless the data pointer is NULL, in which case the
  number of bytes of data are returned.  This changes the API only.

o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t
  argument so as to return the size, if desirable.  If set to NULL,
  the size will not be returned.

o Update various filesystems (pseodofs, ufs) to DTRT.

These changes should make extended attributes more useful and more
portable.  More commits to rebuild the system call files, as well
as update userland utilities to follow.

Obtained from:	TrustedBSD Project
Sponsored by:	DARPA, NAI Labs
2002-02-10 04:43:22 +00:00
phk
d2872f5d10 Remove di_inumber since LFS is long gone. 2002-02-10 00:55:49 +00:00
mckusick
88b4f0b921 Occationally background fsck would cause a spurious ``freeing free
inode'' panic. This change corrects that problem by setting the
fs_active flag when the inode map changes to notify the snapshot
code that the cylinder group must be rescanned.

Submitted by:	Robert Watson <rwatson@FreeBSD.org>
2002-02-07 22:13:56 +00:00
mckusick
9d51e5cc24 Occationally deleted files would hang around for hours or days
without being reclaimed. This bug was introduced in revision 1.95
dealing with filenames placed in newly allocated directory blocks,
thus is not present in 4.X systems. The bug is triggered when a
new entry is made in a directory after the data block containing
the original new entry has been written, but before the inode
that references the data block has been written.

Submitted by:	Bill Fenner <fenner@research.att.com>
2002-02-07 00:54:32 +00:00
mckusick
c142ab455c When taking a snapshot, we must check for active files that have
been unlinked (e.g., with a zero link count). We have to expunge
all trace of these files from the snapshot so that they are neither
reclaimed prematurely by fsck nor saved unnecessarily by dump.
2002-02-02 01:42:44 +00:00
mckusick
35edc11259 Add a stub for softdep_request_cleanup() so that compilation without
SOFTUPDATES option works properly.

Submitted by:	Benno Rice <benno@jeamland.net>
2002-01-23 02:18:56 +00:00
mckusick
4f050b1765 This patch fixes a long standing complaint with soft updates in
which small and/or nearly full filesystems would fail with `file
system full' messages when trying to replace a number of existing
files (for example during a system installation). When the allocation
routines are about to fail with a file system full condition, they
make a call to softdep_request_cleanup() which attempts to accelerate
the flushing of pending deletion requests in an effort to free up
space. In the face of filesystem I/O requests that exceed the
available disk transfer capacity, the cleanup request could take
an unbounded amount of time. Thus, the softdep_request_cleanup()
routine will only try for tickdelay seconds (default 2 seconds)
before giving up and returning a filesystem full error. Under typical
conditions, the softdep_request_cleanup() routine is able to free
up space in under fifty milliseconds.
2002-01-22 06:17:22 +00:00
mckusick
4e7dcb216b Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26
which caused incomplete snapshots to be taken. When background
fsck would run on these snapshots, the result would be files
being incorrectly released which would subsequently panic the
kernel with ``handle_workitem_freefile: inodedep survived'',
``handle_written_inodeblock: live inodedep'', and
``handle_workitem_remove: lost inodedep'' errors.
2002-01-17 08:33:32 +00:00
mckusick
0ed7ba2c74 Put write on read-only filesystem panic after we have weeded out
block and character devices, fifo's, etc.

Submitted by:	Bruce Evans <bde@zeta.org.au>
2002-01-16 04:59:09 +00:00
mckusick
b8d6599e4c When downgrading a filesystem from read-write to read-only, operations
involving file removal or file update were not always being fully
committed to disk. The result was lost files or corrupted file data.
This change ensures that the filesystem is properly synced to disk
before the filesystem is down-graded.

This delta also fixes a long standing bug in which a file open for
reading has been unlinked. When the last open reference to the file
is closed, the inode is reclaimed by the filesystem. Previously,
if the filesystem had been down-graded to read-only, the inode could
not be reclaimed, and thus was lost and had to be later recovered
by fsck.  With this change, such files are found at the time of the
down-grade.  Normally they will result in the filesystem down-grade
failing with `device busy'. If a forcible down-grade is done, then
the affected files will be revoked causing the inode to be released
and the open file descriptors to begin failing on attempts to read.

Submitted by:	"Sam Leffler" <sam@errno.com>
2002-01-15 07:17:12 +00:00
alfred
844237b396 SMP Lock struct file, filedesc and the global file list.
Seigo Tanimura (tanimura) posted the initial delta.

I've polished it quite a bit reducing the need for locking and
adapting it for KSE.

Locks:

1 mutex in each filedesc
   protects all the fields.
   protects "struct file" initialization, while a struct file
     is being changed from &badfileops -> &pipeops or something
     the filedesc should be locked.

1 mutex in each struct file
   protects the refcount fields.
   doesn't protect anything else.
   the flags used for garbage collection have been moved to
     f_gcflag which was the FILLER short, this doesn't need
     locking because the garbage collection is a single threaded
     container.
  could likely be made to use a pool mutex.

1 sx lock for the global filelist.

struct file *	fhold(struct file *fp);
        /* increments reference count on a file */

struct file *	fhold_locked(struct file *fp);
        /* like fhold but expects file to locked */

struct file *	ffind_hold(struct thread *, int fd);
        /* finds the struct file in thread, adds one reference and
                returns it unlocked */

struct file *	ffind_lock(struct thread *, int fd);
        /* ffind_hold, but returns file locked */

I still have to smp-safe the fget cruft, I'll get to that asap.
2002-01-13 11:58:06 +00:00
mckusick
c6e526cffc When going to sleep, we must save our SPL so that it does not get
lost if some other process uses the lock while we are sleeping. We
restore it after we have slept. This functionality is provided by
a new routine interlocked_sleep() that wraps the interlocking with
functions that sleep. This function is then used in place of the
old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros.

Submitted by:	Debbie Chu <dchu@juniper.net>
2002-01-12 20:57:36 +00:00
mckusick
b86f75b78b Must call drain_output() before checking the dirty block list
in softdep_sync_metadata(). Otherwise we may miss dependencies
that need to be flushed which will result in a later panic
with the message ``vinvalbuf: dirty bufs''.

Submitted by:	Matthew Dillon <dillon@apollo.backplane.com>
MFC after:	1 week
2002-01-11 19:59:27 +00:00
phk
64c925b060 Do not pull quota entries of the cache-list if they have already
been removed from the cache-list as part of a previous unmount.

This would result in panics (page fault in dqflush()) during subsequent
umounts provided that enough distinct UID's to actually make the
hash do something are active.

This can probably explain a number of weird quota related behaviours.

PR:		32331 maybe more.
Reproduced by:	Søren Schrørder <sch@cybercity.dk>
2002-01-10 15:02:57 +00:00
msmith
6fa204fd80 Initialise the bioops vector hack at runtime rather than at link time. This
avoids the use of common variables.

Reviewed by:	mckusick
2002-01-08 19:32:18 +00:00
dillon
ac9876d609 Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget()
against VM_WAIT in the pageout code.  Both fixes involve adjusting
the lockmgr's timeout capability so locks obtained with timeouts do not
interfere with locks obtained without a timeout.

Hopefully MFC: before the 4.5 release
2001-12-20 22:42:27 +00:00
mckusick
eeb2a6d271 Change the atomic_set_char to atomic_set_int and atomic_clear_char
to atomic_clear_int to ease the implementation for the sparc64.

Requested by:	Jake Burkholder <jake@locore.ca>
2001-12-18 18:05:17 +00:00
iedowse
5c873cad57 Make sure we ignore the value of `fs_active' when reloading the
superblock, and move the initialisation of it to beside where other
pointer fields are initialised.
2001-12-16 18:54:09 +00:00
iedowse
64972486c2 Move the new superblock field `fs_active' into the region of the
superblock that is already set up to handle pointer types. This
fixes an accidental change in the superblock size on 64-bit platforms
caused by revision 1.24.
2001-12-16 18:51:11 +00:00
mckusick
735db38be3 Minimize the time necessary to suspend operations on a filesystem
when taking a snapshot. The two time consuming operations are
scanning all the filesystem bitmaps to determine which blocks
are in use and scanning all the other snapshots so as to be able
to expunge their blocks from the view of the current snapshot.
The bitmap scanning is broken into two passes. Before suspending
the filesystem all bitmaps are scanned. After the suspension,
those bitmaps that changed after being scanned the first time
are rescanned. Typically there are few bitmaps that need to be
rescanned. The expunging of other snapshots is now done after
the suspension is released by observing that we can easily
identify any blocks that were allocated to them after the
suspension (they will be maked as `not needing to be copied'
in the just created snapshot). For all the gory details, see
the ``Running fsck in the Background'' paper in the Usenix
BSDCon 2002 Conference Proceedings, pages 55-64.
2001-12-14 00:15:06 +00:00
mckusick
d375501d04 When a file is partially truncated, we first check to see if the
new file end will land in the middle of a file hole. Since the last
block of a file must always be allocated, the hole is filled by
allocating a block at that location. If the hole being filled is
a direct block, then the truncation may eventually reduce the
full sized block down to a fragment. When running with soft
updates, it is necessary to FSYNC the file after allocating the
block and before creating the fragment to avoid triggering a
soft updates inconsistency when the block unexpectedly shrinks.

Found by:	Matthew Dillon <dillon@apollo.backplane.com>
MFC after:	1 week
2001-12-13 05:07:48 +00:00
rwatson
523aaa204d Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.

Submitted by:	jedgar
2001-11-30 15:32:07 +00:00
rwatson
df873c5111 Use 'mkdir -p /.attribute/system' instead of breaking it into
two seperate mkdir targets.
2001-11-30 15:21:20 +00:00
rwatson
3014016ccd README.extattr incorrectly specified sample command lines for
UFS_EXTATTR_AUTOSTART.  Insert the missing 'initattr' arguments
to extattrctl.

Noticed by:	green
2001-11-30 15:15:27 +00:00
guido
6ece6a6a1b When mkdir()-ing, the parent dir gets is linkcount increased.
Fix VN_KNOTE to reflect that.

Found by: tobez@freebsd.org
MFC after:	2 days
2001-11-22 15:33:12 +00:00
iedowse
61d89377de Oops, when trying the dirhash sequential-access optimisation,
compare the slot offset against the predicted offset, not a boolean
flag. This typo effectively disabled the sequential optimisation,
but was otherwise harmless.

Not surprisingly, fixing this improves performance in the sequential
access case. I am seeing a 7% speedup on one machine here; using
dirhash when sequentially looking up directory entries is now about
5% faster instead of 2% slower than the non-dirhash case.

Submitted by:	KOIE Hidetaka <koie@suri.co.jp>
MFC after:	1 week
2001-11-14 15:08:07 +00:00
dillon
1147eaf58a Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking
in wdrain during a write.  This flag needs to be used in devices whos
strategy routines turn-around and issue another high level I/O, such as
when MD turns around and issues a VOP_WRITE to vnode backing store, in order
to avoid deadlocking the dirty buffer draining code.

Remove a vprintf() warning from MD when the backing vnode is found to be
in-use.  The syncer of buf_daemon could be flushing the backing vnode at
the time of an MD operation so the warning is not correct.

MFC after:	1 week
2001-11-05 18:48:54 +00:00
rwatson
58469c01a1 o Update copyright dates.
o Add reference to TrustedBSD Project in license header.
o Update dated comments, including comment in extattr.h claiming that
  no file systems support extended attributes.
o Improve comment consistency.
2001-11-01 21:37:07 +00:00
rwatson
5e557739dc o Althought this is not specified in POSIX.1e, the UFS ACL implementation
coerces the deletion of a default ACL on a directory when no default
  ACL EA is present to success.  Because the UFS EA implementation doesn't
  disinguish the EA failure modes "that EA name has not been
  administratively enabled" from "that EA name has no defined data",
  there's a potential conflict in error return values.  Normally, the
  lack of administratively configured EA support is coerced to
  EOPNOTSUPP to indicate that ACLs are not available; in this case,
  it is possible to get a successful return, even if ACLs are not
  available because EA support for them has not been enabled.

  Expand the comment in ufs_setacl() to identify this case.

Obtained from:	TrustedBSD Project
2001-10-27 05:39:17 +00:00
rwatson
82286fef85 o Clarify a comment about the locking condition of the vnode upon exit
from ufs_extattr_enable_with_open().
o Print auto-start notifications if (bootverbose).  This was previously
  commented out since it didn't know how to check for bootverbose.
o Drop in comments throughout indicating where ENOENT should be replaced
  with ENOATTR once that is available.

Obtained from:	TrustedBSD Project
2001-10-27 05:19:14 +00:00
rwatson
b985eb702b o The comment about ordering the destruction of the lock and the removal of
the flag indicating that the structure was initialized didn't need
  an XXX, since it didn't need fixing.

Obtained from:	TrustedBSD Project
2001-10-27 05:05:39 +00:00
rwatson
1d903fb701 o Wrap a number of long lines of code, many of which were introduced
due to KSE-related (p) expansions.

Obtained from:	TrustedBSD Project
2001-10-27 05:03:05 +00:00
rwatson
25c2879cc9 Since namespace support was added to the UFS extended attribute
implementation to replace single-character namespace prefixes, '$' is no
longer an invalid attribute name, and the namespace is relevant to
validity determination.

o Remove '$' case from ufs_extattr_valid_attrname()
o Add attrnamespace argument to ufs_extattr_valid_attrname(), and
  fill out appropriately.

Currently no decisions are made based on the namespace argument, but
may be in the future.

Obtained from:	TrustedBSD Project
2001-10-27 04:58:28 +00:00
dillon
f883ef447a Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a
real effect.

Optimize vfs_msync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  Improves looping case by 500%.

Optimize ffs_sync().  Avoid having to continually drop and re-obtain
mutexes when scanning the vnode list.  This makes a couple of assumptions,
which I believe are ok, in regards to vnode stability when the mount list
mutex is held.  Improves looping case by 500%.

(more optimization work is needed on top of these fixes)

MFC after:	1 week
2001-10-26 00:08:05 +00:00
iedowse
6c4623e897 Default to not performing ufs_dirhash's extensive directory-block
sanity check after every directory modification. This check can be
re-enabled at any time by setting the sysctl "vfs.ufs.dirhash_docheck"
to 1.

This group of sanity tests was there to ensure that any UFS_DIRHASH
bugs could be caught by a panic before a potentially corrupted
directory block would be written to disk. It has served its main
purpose now, so disable it in the interest of performance.

MFC after:	1 week
2001-10-25 22:55:59 +00:00
dillon
45a6fabe87 Change the vnode list under the mount point from a LIST to a TAILQ
in preparation for an implementation of limiting code for kern.maxvnodes.

MFC after:	3 days
2001-10-23 01:21:29 +00:00
jhb
4806d88677 Change the kernel's ucred API as follows:
- crhold() returns a reference to the ucred whose refcount it bumps.
- crcopy() now simply copies the credentials from one credential to
  another and has no return value.
- a new crshared() primitive is added which returns true if a ucred's
  refcount is > 1 and false (0) otherwise.
2001-10-11 23:38:17 +00:00
jhb
4c62ba7c58 Add missing includes of sys/lock.h. 2001-10-11 17:52:20 +00:00
dillon
dbebfe18a1 Remove panics for rename() race conditions. The panics are inappropriate
because the IN_RENAME flag only fixes a few of the huge number of race
conditions that can result in the source path becoming invalid even
prior to the VOP_RENAME() call.  The panics created a serious security
issue whereby an attacker could fairly easily cause the panic to
occur, crashing the machine.

The correct solution requires a great deal of work in the namei
path cache code.

MFC after:	0 days
2001-10-08 00:37:54 +00:00
rwatson
63b7da3bb4 o Replace two direct uid!=0 comparisons with suser_xxx() calls.
Obtained from:	TrustedBSD Project
2001-10-02 14:41:43 +00:00
rwatson
bf46cc0b03 o Replace two direct uid!=0 comparisons with suser_td() calls.
Obtained from:	TrustedBSD Project
2001-10-02 14:34:22 +00:00
dillon
c998ec6a97 Backout the last commit. The problem is actually much worse then I
first thought and may require serious work to the VOP_RENAME() api itself.
Basically, by the time the VOP_RENAME() function is called, it's already
too late.
2001-10-02 04:26:58 +00:00
dillon
5eaf310a14 IN_RENAME should only be cleared by the routine that set it. This fixes
a rename/rmdir race that has been shown to cause a panic.

Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com>
MFC after:	3 days
2001-10-02 02:58:48 +00:00
jhb
fe3fc4701b - Fix some minor whitespace nits.
- Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a
  little nit that caused it to be defined as -(sizeof (struct thread) + 1)
  instead of -2.
2001-09-27 21:04:13 +00:00
rwatson
9eed33b643 o Re-enable support of system file flags in jail() by adding back the
PRISON_ROOT to the suser_xxx() check.  Since securelevels may now
  be raised in specific jails, use of system flags can still be
  restricted in jail(), but in a more configurable way.
o Users of jail() expecting system flags (such as schg) to restrict
  jail()'s should be sure to set the securelevel appropriately in
  jail()'s.
o This fixes activities involving automated system flag removal in
  jail(), including installkernel and friends.

Obtained from:	TrustedBSD Project
2001-09-26 20:44:41 +00:00
rwatson
4e4d85b5d1 o Modify ufs_setattr() so that it uses securelevel_gt() instead of
direct variable access.

Obtained from:	TrustedBSD Project
2001-09-26 20:31:37 +00:00
rwatson
56db2a389f o Further clarify comment: ad Udo's request, re-insert the 'if'
refering to securelevels; also, update the unprivileged process text
  to better indicate the scope of actions permittable when any system
  flags are already set (limited).

Submitted by:	Udo Schweigert <udo.schweigert@siemens.com>
2001-09-25 12:02:44 +00:00
rwatson
de37df3afa o Parallelize the comment on the relationship between privileged un-jailed
processes and the actual securelevel check: make the comment use '> 0'
  instead of inverted '<= 0'.
2001-09-25 02:26:10 +00:00
iedowse
af5c7958aa The addition of i_dirhash to struct inode pushed RELENG_4's
sizeof(struct inode) into a new malloc bucket on the i386. This
didn't happen in -current due to the removal of i_lock, but it does
no harm to apply the workaround to -current first.

Reduce the size of the i_spare[] array in struct inode from 4 to
3 entries, and change ext2fs to use i_din.di_spare[1] so that it
does not need i_spare[3].

Reviewed by:	bde
MFC after:	3 days
2001-09-24 18:29:20 +00:00
julian
5596676e6c KSE Milestone 2
Note ALL MODULES MUST BE RECOMPILED
make the kernel aware that there are smaller units of scheduling than the
process. (but only allow one thread per process at this time).
This is functionally equivalent to teh previousl -current except
that there is a thread associated with each process.

Sorry john! (your next MFC will be a doosie!)

Reviewed by: peter@freebsd.org, dillon@freebsd.org

X-MFC after:    ha ha ha ha
2001-09-12 08:38:13 +00:00
iedowse
ddaced1703 The "dirpref" directory layout preference improvements make use of
an array "fs_contigdirs[]" to avoid too many directories getting
created in each cylinder group. The memory required for this and
two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with
a single malloc() call, and divided up afterwards.  However, the
'space' pointer is not advanced correctly, so fs_contigdirs and
fs_maxcluster end up pointing to the same address.

Add the missing code to advance the 'space' pointer, and remove
an unnecessary update of the pointer that follows.

This is likely to fix the "ffs_clusteralloc: map mismatch" panics
that have been reported recently.

Submitted by:		Luke Mewburn <lukem@wasabisystems.com>
2001-09-09 23:48:28 +00:00
jedgar
5b9926665b Use ACL_PERM_NONE instead of hardcoding 0 when initializing
ACL entry permissions.

Reviewed by:	rwatson
2001-09-01 23:18:15 +00:00
rwatson
00ec5e8482 o At some point, unmounting a non-EA file system with EA's compiled
in got a bit broken, when ufs_extattr_stop() was called and failed,
  ufs_extattr_destroy() would panic.  This makes the call to destroy()
  conditional on the success of stop().

Submitted by:		Christian Carstensen <cc@devcon.net>
Obtained from:	TrustedBSD Project
2001-09-01 20:11:05 +00:00
peter
4b437abe78 If a file has been completely unlinked, stop automatically syncing the
file.  ffs will discard any pending dirty pages when it is closed,
so we may as well not waste time trying to clean them.  This doesn't
stop other things from writing it out, eg: pageout, fsync(2) etc.
2001-08-27 06:09:56 +00:00
iedowse
bff1ce4c20 Stop using dirhash when a directory is removed, and ensure that we
never attempt to hash directories once they are deleted. This fixes
a problem where operations on a deleted directory could trigger
dirhash sanity panics.
2001-08-26 20:47:19 +00:00
iedowse
c8ef91ce6c When compacting directories, ufs_direnter() always trusted DIRSIZ()
to supply the number of bytes to be bcopy()'d to move an entry. If
d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible
length, so ufs_direnter could end up corrupting a directory during
compaction. In practice I believe this can only happen after fsck_ffs
has fixed a previously-corrupted directory.

We now deal with any mid-block unused entries specially to avoid
using DIRSIZ() or bcopy() on such entries. We also ensure that the
variables 'dsize' and 'spacefree' contain meaningful values at all
times. Add a few comments to describe better this intricate piece
of code.

The special handling of mid-block unused entries makes the dirhash-
specific bugfix in the previous revision (1.53) now uncecessary,
so this change removes it.

Reviewed by:	mckusick
2001-08-26 01:25:12 +00:00
iedowse
c37a1a0228 When compressing directory blocks, the dirhash code didn't check
that the directory entry was in use before attempting to find it
in the hash structures to change its offset. Normally, unused
entries do not need to be moved, but fsck can leave behind some
unused entries that do. A dirhash sanity panic resulted when the
entry to be moved was not found. Add a check that stops entries
with d_ino == 0 from being passed to ufsdirhash_move().
2001-08-22 01:35:17 +00:00
peter
886fd9ad64 Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS'
without 'options FFS' would fail to link.
2001-08-18 03:08:48 +00:00
iedowse
c768482a0f Two recent commits in sys/ufs/ufs interacted badly with ext2fs
because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink
is invalid because ext2fs does not maintain this field. In ufs_close(),
i_effnlink is also tested, to determines whether or not to call
vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of
ext2fs filesystems; I believe the other is harmless.

Fix both cases by checking um_i_effnlink_valid in the ufsmount
struct, and use i_nlink if necessary.

Noticed by:	bde
Reviewed by:	mckusick, bde
2001-07-29 22:26:01 +00:00
iedowse
5857132545 Disable the dirhash sanity check that panics if an unused directory
entry (d_ino == 0) is found in a position that is not the start of
a DIRBLKSIZ block.

While such entries cannot occur normally (ufs always extends the
previous entry to cover the free space instead), they do not cause
problems and fsck does not fix them, so panicking is bad.
2001-07-27 18:45:41 +00:00
peter
57a6887663 Use a fixed type for times in on-disk structures for ufs rather than
something that could potentially change like time_t.
2001-07-16 00:55:27 +00:00
iedowse
991ee42593 Return a locked struct buf from ufsdirhash_lookup() to avoid one
extra getblk/brelse sequence for each lookup. We already had this
buf in ufsdirhash_lookup(), so there was no point in brelse'ing it
only to have the caller immediately reaquire the same buffer.

This should make the case of sequential lookups marginally faster;
in my tests, sequential lookups with dirhash enabled are now only
around 1% slower than without dirhash.
2001-07-13 20:50:38 +00:00
iedowse
ecbac42d61 Bring in dirhash, a simple hash-based lookup optimisation for large
directories. When enabled via "options UFS_DIRHASH", in-core hash
arrays are maintained for large directories. These allow all
directory operations to take place quickly instead of requiring
long linear searches. For now anyway, dirhash is not enabled by
default.

The in-core hash arrays have a memory requirement that is approximately
half the size of the size of the on-disk directory file. A number
of new sysctl variables allow control over which directories get
hashed and over the maximum amount of memory that dirhash will use:

  vfs.ufs.dirhash_minsize
    The minimum on-disk directory size for which hashing should be
    used. The default is 2560 (2.5k).

  vfs.ufs.dirhash_maxmem
    The system-wide maximum total memory to be used by dirhash data
    structures. The default is 2097152 (2MB).

The current amount of memory being used by dirhash is visible
through the read-only sysctl variable vfs.ufs.dirhash_maxmem.
Finally, some extra sanity checks that are enabled by default, but
which may have an impact on performance, can be disabled by setting
vfs.ufs.dirhash_docheck to 0.

Discussed on: -fs, -hackers
2001-07-10 21:21:29 +00:00
dillon
e028603b7e With Alfred's permission, remove vm_mtx in favor of a fine-grained approach
(this commit is just the first stage).  Also add various GIANT_ macros to
formalize the removal of Giant, making it easy to test in a more piecemeal
fashion. These macros will allow us to test fine-grained locks to a degree
before removing Giant, and also after, and to remove Giant in a piecemeal
fashion via sysctl's on those subsystems which the authors believe can
operate without Giant.
2001-07-04 16:20:28 +00:00
jhb
1bc2ddffa0 Fix more mntvnode and vnode interlock order reversals. 2001-06-28 22:21:33 +00:00
jhb
822af43e20 - Fix a mntvnode and vnode interlock reversal.
- Protect the mnt_vnode list with the mntvnode lock.
- Use queue(9) macros.
2001-06-28 04:12:56 +00:00
peter
efc05ef868 Fix warning:
1973: warning: int format, long int arg (arg 5)
2001-06-15 07:44:39 +00:00
mckusick
2f9709c0f1 Build on the change in revision 1.98 by Tor.Egge@fast.no.
The symptom being treated in 1.98 was to avoid freeing a
pagedep dependency if there was still a newdirblk dependency
referencing it. That change is correct and no longer prints
a warning message when it occurs. The other part of revision
1.98 was to panic when a newdirblk dependency was encountered
during a file truncation. This fix removes that panic and
replaces it with code to find and delete the newdirblk
dependency so that the truncation can succeed.
2001-06-13 23:13:13 +00:00
tmm
126158251a Call vn_close on the backing file vnode if ufs_extattr_enable failed to
avoid leaking it.

Reviewed by:	rwatson
2001-06-07 00:11:32 +00:00
jlemon
bf3a0c1bb1 Add a wrapper for the fifo kqfilter which falls through to the ufs routine.
This permits the fifo to inherit the ufs VNODE kqfilter.
2001-06-06 17:40:57 +00:00
jlemon
404f6dd2da Add a kqueue filter for writing to ufs filesystems which always returns
true.  This permits better interoperability with programs which register
filters on their stdin/stdout handles.

Submitted by: Niels Provos <provos@citi.umich.edu>
2001-06-05 13:52:37 +00:00
obrien
c3d076fe28 There seems to be a problem that the order of disk write operation being
incorrect due to a missing check for some dependency.  This change
avoids the freelist corruption (but not the temporarily inconsistent
state of the file system).

A message is printed as a reminder of the under lying problem when a
pagedep structure is not freed due to the NEWBLOCK flag being set.

Submitted by:	Tor.Egge@fast.no
2001-06-05 01:49:37 +00:00
jhb
ff6bb62be3 Revert the previous commit in favor of the fix in rev 1.42 of
ufs/ffs/ffs_extern.h instead.

Requested by:	bde
2001-05-30 23:09:19 +00:00
jhb
b033de0fdc Forward declare struct cg to quiet a warning.
Submitted by:	bde
2001-05-30 23:08:40 +00:00
jhb
3d4bb0e38d Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a
warning.
2001-05-29 23:53:16 +00:00
phk
d442761285 Remove last vestiges of MFS. 2001-05-29 21:21:53 +00:00
phk
785e1e1e17 Remove MFS from the kernel. 2001-05-29 18:50:30 +00:00
tmm
76ca05f26f Add a check to determine whether extended attributes have been
initialized on the file system before trying to grab the lock of the
per-mount extattr structure, as this lock is unitialized in that case.
This is needed because ufs_extattr_vnode_inactive is called from
ufs_inactive, which is also used by EA-unaware file systems such as
ext2fs.

Reviewed by:	rwatson
2001-05-25 18:24:52 +00:00
rwatson
f504530d9f o Merge contents of struct pcred into struct ucred. Specifically, add the
real uid, saved uid, real gid, and saved gid to ucred, as well as the
  pcred->pc_uidinfo, which was associated with the real uid, only rename
  it to cr_ruidinfo so as not to conflict with cr_uidinfo, which
  corresponds to the effective uid.
o Remove p_cred from struct proc; add p_ucred to struct proc, replacing
  original macro that pointed.
  p->p_ucred to p->p_cred->pc_ucred.
o Universally update code so that it makes use of ucred instead of pcred,
  p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo,
  cr_{r,sv}{u,g}id instead of p_*, etc.
o Remove pcred0 and its initialization from init_main.c; initialize
  cr_ruidinfo there.
o Restruction many credential modification chunks to always crdup while
  we figure out locking and optimizations; generally speaking, this
  means moving to a structure like this:
        newcred = crdup(oldcred);
        ...
        p->p_ucred = newcred;
        crfree(oldcred);
  It's not race-free, but better than nothing.  There are also races
  in sys_process.c, all inter-process authorization, fork, exec, and
  exit.
o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid;
  remove comments indicating that the old arrangement was a problem.
o Restructure exec1() a little to use newcred/oldcred arrangement, and
  use improved uid management primitives.
o Clean up exit1() so as to do less work in credential cleanup due to
  pcred removal.
o Clean up fork1() so as to do less work in credential cleanup and
  allocation.
o Clean up ktrcanset() to take into account changes, and move to using
  suser_xxx() instead of performing a direct uid==0 comparision.
o Improve commenting in various kern_prot.c credential modification
  calls to better document current behavior.  In a couple of places,
  current behavior is a little questionable and we need to check
  POSIX.1 to make sure it's "right".  More commenting work still
  remains to be done.
o Update credential management calls, such as crfree(), to take into
  account new ruidinfo reference.
o Modify or add the following uid and gid helper routines:
      change_euid()
      change_egid()
      change_ruid()
      change_rgid()
      change_svuid()
      change_svgid()
  In each case, the call now acts on a credential not a process, and as
  such no longer requires more complicated process locking/etc.  They
  now assume the caller will do any necessary allocation of an
  exclusive credential reference.  Each is commented to document its
  reference requirements.
o CANSIGIO() is simplified to require only credentials, not processes
  and pcreds.
o Remove lots of (p_pcred==NULL) checks.
o Add an XXX to authorization code in nfs_lock.c, since it's
  questionable, and needs to be considered carefully.
o Simplify posix4 authorization code to require only credentials, not
  processes and pcreds.  Note that this authorization, as well as
  CANSIGIO(), needs to be updated to use the p_cansignal() and
  p_cansched() centralized authorization routines, as they currently
  do not take into account some desirable restrictions that are handled
  by the centralized routines, as well as being inconsistent with other
  similar authorization instances.
o Update libkvm to take these changes into account.

Obtained from:	TrustedBSD Project
Reviewed by:	green, bde, jhb, freebsd-arch, freebsd-audit
2001-05-25 16:59:11 +00:00
dillon
a179ee09ab This patch implements O_DIRECT about 80% of the way. It takes a patchset
Tor created a while ago, removes the raw I/O piece (that has cache coherency
problems), and adds a buffer cache / VM freeing piece.

Essentially this patch causes O_DIRECT I/O to not be left in the cache, but
does not prevent it from going through the cache, hence the 80%.  For
the last 20% we need a method by which the I/O can be issued directly to
buffer supplied by the user process and bypass the buffer cache entirely,
but still maintain cache coherency.

I also have the code working under -stable but the changes made to sys/file.h
may not be MFCable, so an MFC is not on the table yet.

Submitted by:	tegge, dillon
2001-05-24 07:22:27 +00:00
alfred
edd79c680a ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when
calling it from the pager routine
2001-05-23 10:30:25 +00:00
ru
35437d86aa - FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file
systems were repo-copied from sys/miscfs to sys/fs.

- Renamed the following file systems and their modules:
  fdesc -> fdescfs, portal -> portalfs, union -> unionfs.

- Renamed corresponding kernel options:
  FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS.

- Install header files for the above file systems.

- Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland
  Makefiles.
2001-05-23 09:42:29 +00:00
mckusick
c9a89dbc03 Update softdep_setup_directory_add prototype to reflect changes in
actual function.

Obtained from:	Jim Bloom <bloom@jbloom.jbloom.org>
2001-05-20 15:59:55 +00:00