freebsd-nq

Author	SHA1	Message	Date
Matthew Dillon	1b7e3dafdf	Fix a file-rewrite performance case for UFS[2]. When rewriting portions of a file in chunks that are less then the filesystem block size, if the data is not already cached the system will perform a read-before-write. The problem is that it does this on a block-by-block basis, breaking up the I/Os and making clustering impossible for the writes. Programs such as INN using cyclic file buffers suffer greatly. This problem is only going to get worse as we use larger and larger filesystem block sizes. The solution is to extend the sequential heuristic so UFS[2] can perform a far larger read and readahead when dealing with this case. (note: maximum disk write bandwidth is 27MB/sec thru filesystem) (note: filesystem blocksize in test is 8K (1K frag)) dd if=/dev/zero of=test.dat bs=1k count=2m conv=notrunc Before: (note half of these are reads) tty da0 da1 acd0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 76 14.21 598 8.30 0.00 0 0.00 0.00 0 0.00 0 0 7 1 92 0 76 14.09 813 11.19 0.00 0 0.00 0.00 0 0.00 0 0 9 5 86 0 76 14.28 821 11.45 0.00 0 0.00 0.00 0 0.00 0 0 8 1 91 After: (note half of these are reads) tty da0 da1 acd0 cpu tin tout KB/t tps MB/s KB/t tps MB/s KB/t tps MB/s us ni sy in id 0 76 63.62 434 26.99 0.00 0 0.00 0.00 0 0.00 0 0 18 1 80 0 76 63.58 424 26.30 0.00 0 0.00 0.00 0 0.00 0 0 17 2 82 0 76 63.82 438 27.32 0.00 0 0.00 0.00 0 0.00 1 0 19 2 79 Reviewed by: mckusick Approved by: re X-MFC after: immediately (was heavily tested in -stable for 4 months)	2002-10-18 22:52:41 +00:00
Robert Watson	61eef6c245	Update extended attribute readme file to note that no special configuration is required to use EAs with UFS2, and that UFS2 is recommend for EA use for a variety of reasons. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2002-10-18 21:11:36 +00:00
Robert Watson	f5b1000b8f	Update instructions for ACLs given recent tunefs, mount changes. Also note that UFS2 doesn't require explicit extended attribute configuration, and is recommends for this and other reasons if you plan to use ACLs. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2002-10-18 21:09:57 +00:00
Robert Watson	16eac5b95c	Use 'size_t' instead of 'int' for the result of sizeof().	2002-10-18 21:03:30 +00:00
Kirk McKusick	ef6c0bb296	With the revised single-lock method used in snapshots, the BA_NOWAIT flag is no longer needed. Sponsored by: DARPA & NAI Labs.	2002-10-18 01:17:28 +00:00
Kirk McKusick	86aeb27fa2	Change locking so that all snapshots on a particular filesystem share a common lock. This change avoids a deadlock between snapshots when separate requests cause them to deadlock checking each other for a need to copy blocks that are close enough together that they fall into the same indirect block. Although I had anticipated a slowdown from contention for the single lock, my filesystem benchmarks show no measurable change in throughput on a uniprocessor system with three active snapshots. I conjecture that this result is because every copy-on-write fault must check all the active snapshots, so the process was inherently serial already. This change removes the last of the deadlocks of which I am aware in snapshots. Sponsored by: DARPA & NAI Labs.	2002-10-16 00:19:23 +00:00
Robert Watson	9e3bf94fd7	Push most UFS ACL behavior behind a check for MNT_ACLS, permitting ACLs to be administratively disabled as needed on UFS/UFS2 file systems. This also has the effect of preventing the slightly more expensive ACL code from running on non-ACL file systems, avoiding storage allocation for ACLs that may be read from disk. MNT_ACLS may be set at mount-time using mount -o acls, or implicitly by setting the FS_ACLS flag using tunefs. On UFS1, you may also have to configure ACL store. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2002-10-15 21:28:24 +00:00
Robert Watson	80830407c6	If the FS_MULTILABEL flag is set in a UFS or UFS2 superblock, automatically set MNT_MULTILABEL in the mount flags. If FS_ACLS is set in a UFS or UFS2 superblock, automatically set MNT_ACLS in the mount flags. If either of these flags is set, but the appropriate kernel option to support the features associated with the flag isn't available, then print a warning at mount-time. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories	2002-10-15 20:00:06 +00:00
Kirk McKusick	48f0495d85	When reading or writing the extended attributes of a special device or fifo in UFS2, the normal ufs_strategy routine needs to be used rather than the spec_strategy or fifo_strategy routine. Thus the ffsext_strategy routine is interposed in the ffs_vnops vectors for special devices and fifo's to pick off this special case. Otherwise it simply falls through to the usual spec_strategy or fifo_strategy routine. Submitted by: Robert Watson <rwatson@FreeBSD.org> Sponsored by: DARPA & NAI Labs.	2002-10-14 23:18:09 +00:00
Robert Watson	baeb8a4774	Fix two memory leaks in error conditions involving the UFS ACL code: if failures occur, make sure that we release both the default ACL and access ACL storage during new object creation. Spotted by: phk and his pet flexelint Sponsored by: DARPA, Network Associates Laboratories	2002-10-14 19:55:49 +00:00
Robert Watson	3ceef565b2	Define two new superblock file system flags: FS_ACLS Administrative enable/disable of extended ACL support FS_MULTILABEL Administrative flag to indicate to the MAC Framework that objects in the file system are individually labeled using extended attributes. Obtained from: TrustedBSD Project Sponsored by: DARPA, Network Associates Laboratories Reviewed by: (in principal) mckusick, phk	2002-10-14 17:07:11 +00:00
Kirk McKusick	a5b65058d5	Regularize the vop_stdlock'ing protocol across all the filesystems that use it. Specifically, vop_stdlock uses the lock pointed to by vp->v_vnlock. By default, getnewvnode sets up vp->v_vnlock to reference vp->v_lock. Filesystems that wish to use the default do not need to allocate a lock at the front of their node structure (as some still did) or do a lockinit. They can simply start using vn_lock/VOP_UNLOCK. Filesystems that wish to manage their own locks, but still use the vop_stdlock functions (such as nullfs) can simply replace vp->v_vnlock with a pointer to the lock that they wish to have used for the vnode. Such filesystems are responsible for setting the vp->v_vnlock back to the default in their vop_reclaim routine (e.g., vp->v_vnlock = &vp->v_lock). In theory, this set of changes cleans up the existing filesystem lock interface and should have no function change to the existing locking scheme. Sponsored by: DARPA & NAI Labs.	2002-10-14 03:20:36 +00:00
Mike Barcroft	2b7f24d210	Change iov_base's type from `char ' to the standard` void '. All uses of iov_base which assume its type is `char ' (in order to do pointer arithmetic) have been updated to cast iov_base to `char '.	2002-10-11 14:58:34 +00:00
Maxime Henrion	cba63e0291	Fix build of 64 bit platforms.	2002-10-09 12:19:36 +00:00
Kirk McKusick	98d275df37	When creating a snapshot, create a list of initially allocated blocks. Whenever doing a copy-on-write check, first look in the list of initially allocated blocks to see if it is there. If so, no further check is needed. If not, fall through and do the full check. This change eliminates one of two known deadlocks caused by snapshots. Handling the second deadlock will be the subject of another check-in. This change also reduces the cost of the copy-on-write check by speeding up the verification of frequently checked blocks. Sponsored by: DARPA & NAI Labs.	2002-10-09 07:28:35 +00:00
Kirk McKusick	4d533db182	When creating a snapshot, create a list of initially allocated blocks. Whenever doing a copy-on-write check, first look in the list of initially allocated blocks to see if it is there. If so, no further check is needed. If not, fall through and do the full check. This change eliminates one of two known deadlocks caused by snapshots. Handling the second deadlock will be the subject of another check-in. This change also reduces the cost of the copy-on-write check by speeding up the verification of frequently checked blocks. Sponsored by: DARPA & NAI Labs.	2002-10-09 06:13:48 +00:00
Kirk McKusick	b6cef5648d	The appropriate units for disk block addresses are always DEV_BSIZE, even when the underlying device has a larger sector size. Therefore, the filesystem code should not (and with this patch does not) try to use the underlying sector size when doing disk block address calculations. This patch fixes problems in -current when using the swap-based memory-disk device (mdconfig -a -t swap ...). This bugfix is not relevant to -stable as -stable does not have the memory-disk device. Sponsored by: DARPA & NAI Labs.	2002-10-09 04:01:23 +00:00
Jeff Roberson	a2c4ff970b	- Remove LK_INTERLOCK from the vn_lock() in ffs_snapshot(). Pointy hat to: me Found by: green	2002-10-08 21:00:52 +00:00
Poul-Henning Kamp	4f3ee6dcc4	Mark two places where an unsigned number is checked "if (foo < 0)" with an XXX comment. Somebody[TM] should look at this in some detail. Spotted by: FlexeLint	2002-10-02 09:11:18 +00:00
Dima Dorfman	85bba62925	size_t is not a struct (fix mislabelling in a comment).	2002-10-02 05:15:34 +00:00
Poul-Henning Kamp	8d3574c7a4	Fix some harmless mis-indents. Spotted by: FlexeLint	2002-10-01 15:48:31 +00:00
Juli Mallett	85de3147ea	When spamming me with a printf(9), under DIAGNOSTIC, at least be nice enough to include a newline. MFC after: 4 days Sponsored by: Bright Path Solutions	2002-09-28 19:04:49 +00:00
Poul-Henning Kamp	37c841831f	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512	2002-09-28 17:15:38 +00:00
Poul-Henning Kamp	a8babca268	Make it a tad easier to deal with struct inode in userland programs which fondle /dev/kmem by using "struct cdev *" instead of "dev_t". Requsted by: jake	2002-09-27 20:03:05 +00:00
Poul-Henning Kamp	993b0567b2	Use our mount-credential if we get a NOCRED when we try to write out EA space back to disk. This is wrong in many ways, but not as wrong as a panic. Pancied on: rwatson & jmallet Sponsored by: DARPA & NAI Labs.	2002-09-27 20:00:03 +00:00
Jeff Roberson	2ee5711e84	- Convert locks to use standard macros. - Lock access to the buflists. - Document broken locking. - Use vrefcnt().	2002-09-25 02:49:48 +00:00
Jeff Roberson	6ef1763407	- Document broken locking. - Use vrefcnt().	2002-09-25 02:47:49 +00:00
Jeff Roberson	d4820f8036	- Lock accesses to v_usecount. - Convert interlock locks to use standard macros.	2002-09-25 02:45:50 +00:00
Jeff Roberson	8823f1b6db	- Don't use the interlock to protect v_writecount.	2002-09-25 02:44:55 +00:00
Poul-Henning Kamp	cf09d67418	We don't need to #include <sys/disklabel.h>. We don't need to #include <sys/disklabel.h> second time either. Sponsored by: DARPA & NAI Labs.	2002-09-20 16:42:33 +00:00
Don Lewis	fa288043e2	VOP_FSYNC() requires that it's vnode argument be locked, which nfs_link() wasn't doing. Rather than just lock and unlock the vnode around the call to VOP_FSYNC(), implement rwatson's suggestion to lock the file vnode in kern_link() before calling VOP_LINK(), since the other filesystems also locked the file vnode right away in their link methods. Remove the locking and and unlocking from the leaf filesystem link methods. Reviewed by: rwatson, bde (except for the unionfs_link() changes)	2002-09-19 13:32:45 +00:00
David E. O'Brien	47a561263d	intmax_t is printed with %jd, not %lld.	2002-09-19 03:55:30 +00:00
Nate Lawson	86ed6d45ac	Remove any VOP_PRINT that redundantly prints the tag. Move lockmgr_printinfo() into vprint() for everyone's benefit. Suggested by: bde	2002-09-18 20:42:04 +00:00
Nate Lawson	06be2aaa83	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)	2002-09-14 09:02:28 +00:00
Bruce Evans	d3a7b5e70e	vfs_syscalls.c: Changed rename(2) to follow the letter of the POSIX spec. POSIX requires rename() to have no effect if its args "resolve to the same existing file". I think "file" can only reasonably be read as referring to the inode, although the rationale and "resolve" seem to say that sameness is at the level of (resolved) directory entries. ext2fs_vnops.c, ufs_vnops.c: Replaced code that gave the historical BSD behaviour of removing one link name by checks that this code is now unreachable. This fixes some races. All vnodes needed to be unlocked for the removal, and locking at another level using something like IN_RENAME was not even attempted, so it was possible for rename(x, y) to return with both x and y removed even without any unlink(2) syscalls (one process can remove x using rename(x, y) and another process can remove y using rename(y, x)). Prodded by: alfred MFC after: 8 weeks PR: 42617	2002-09-10 11:09:13 +00:00
Poul-Henning Kamp	0e168822b2	Implement the VOP_OPENEXTATTR() and VOP_CLOSEEXTATTR() methods. Use extattr_check_cred() to check access to EAs. This is still a WIP. Sponsored by: DARPA & NAI Labs.	2002-09-05 20:59:42 +00:00
Poul-Henning Kamp	190a4963d0	Use canonical extattr_check_cred() instead of private implementation of the same policy. Sponsored by: DARPA & NAI Labs.	2002-09-05 20:39:36 +00:00
Poul-Henning Kamp	04205dc4be	Fix credentials check: do not leak ENOATTR until we know if they're supposed to know. Sponsored by: DARPA & NAI Labs.	2002-09-05 20:28:24 +00:00
Bruce Evans	8f767abf71	Include <sys/malloc.h> instead of depending on namespace pollution 2 layers deep in <sys/proc.h> or <sys/vnode.h>. Include <sys/vmmeter.h> instead of depending on namespace pollution in <sys/pcpu.h>. Sorted includes as much as possible.	2002-09-05 09:43:24 +00:00
Robert Watson	2fc6567e9a	Since we have vp and td cached in local variables, use those instead of derefencing the VOP arguments again when calling the UFS code. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-09-01 16:06:40 +00:00
Poul-Henning Kamp	d0e9b8dbc4	Correctly handle setting, getting and deleting EA's with zero length content. Sponsored by: DARPA & NAI Labs.	2002-08-30 08:57:09 +00:00
Philippe Charnier	93b0017f88	Replace various spelling with FALLTHROUGH which is lint()able	2002-08-25 13:23:09 +00:00
Alan Cox	fff6062ab6	o Retire vm_page_zero_fill() and vm_page_zero_fill_area(). Ever since pmap_zero_page() and pmap_zero_page_area() were modified to accept a struct vm_page * instead of a physical address, vm_page_zero_fill() and vm_page_zero_fill_area() have served no purpose.	2002-08-25 00:22:31 +00:00
Poul-Henning Kamp	7428de69d2	Implement list of EA return functionality. Correctly delete EA's when the content length is set to zero. Sponsored by: DARPA & NAI Labs.	2002-08-20 11:34:58 +00:00
Poul-Henning Kamp	0176455bc8	First snapshot of UFS2 EA support. Sponsored by: DARPA & NAI Labs.	2002-08-19 07:01:55 +00:00
Robert Watson	9ca435893b	In order to better support flexible and extensible access control, make a series of modifications to the credential arguments relating to file read and write operations to cliarfy which credential is used for what: - Change fo_read() and fo_write() to accept "active_cred" instead of "cred", and change the semantics of consumers of fo_read() and fo_write() to pass the active credential of the thread requesting an operation rather than the cached file cred. The cached file cred is still available in fo_read() and fo_write() consumers via fp->f_cred. These changes largely in sys_generic.c. For each implementation of fo_read() and fo_write(), update cred usage to reflect this change and maintain current semantics: - badfo_readwrite() unchanged - kqueue_read/write() unchanged pipe_read/write() now authorize MAC using active_cred rather than td->td_ucred - soo_read/write() unchanged - vn_read/write() now authorize MAC using active_cred but VOP_READ/WRITE() with fp->f_cred Modify vn_rdwr() to accept two credential arguments instead of a single credential: active_cred and file_cred. Use active_cred for MAC authorization, and select a credential for use in VOP_READ/WRITE() based on whether file_cred is NULL or not. If file_cred is provided, authorize the VOP using that cred, otherwise the active credential, matching current semantics. Modify current vn_rdwr() consumers to pass a file_cred if used in the context of a struct file, and to always pass active_cred. When vn_rdwr() is used without a file_cred, pass NOCRED. These changes should maintain current semantics for read/write, but avoid a redundant passing of fp->f_cred, as well as making it more clear what the origin of each credential is in file descriptor read/write operations. Follow-up commits will make similar changes to other file descriptor operations, and modify the MAC framework to pass both credentials to MAC policy modules so they can implement either semantic for revocation. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-08-15 20:55:08 +00:00
Poul-Henning Kamp	18280bc653	Expand the arguments to ffs_ext{read,write}() to their component parts rather than use vop_{read,write}_args. Access to these functions will ultimately not be available through the "vop_{read,write}+IO_EXT" API but this functionality is retained for debugging purposes for now. Sponsored by: DARPA & NAI Labs.	2002-08-13 11:33:01 +00:00
Poul-Henning Kamp	d6fe88e475	Unravel the UFS_EXTATTR incest between FFS and UFS: UFS_EXTATTR is an UFS only thing, and FFS should in principle not know if it is enabled or not. This commit cleans ffs_vnops.c for such knowledge, but not ffs_vfsops.c Sponsored by: DARPA and NAI Labs.	2002-08-13 10:33:57 +00:00
Poul-Henning Kamp	9bf1a75697	Introduce typedefs for the member functions of struct vfsops and employ these in the main filesystems. This does not change the resulting code but makes the source a little bit more grepable. Sponsored by: DARPA and NAI Labs.	2002-08-13 10:05:50 +00:00
Robert Watson	c08b677fb5	Pass IO_NOMACCHECK to vn_rdwr() in the following checks to prevent enforcement of MAC policy on the read or write operations: - In ext2fs, don't enforce MAC on loop-back reads and writes supporting directory read operations in lookup(), directory modifications in rename(), directory write operations in mkdir(), symlink write operations in symlink(). - In the NFS client locking code, perform vn_rdwr() on the NFS locking socket without enforcing MAC, since the write is done on behalf of the kernel NFS implementation rather than the user process. - In UFS, don't enforce MAC on loop-back reads and writes supporting directory read operations in lookup(), and symlink write operations in symlink(). Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-08-12 16:43:04 +00:00
Poul-Henning Kamp	e179b40f14	Stop pretending that the FFS file ufs_readwrite.c is a UFS file. Instead of #including it, pull it into ffs_vnops.c and name things correctly. Sponsored by: DARPA & NAI Labs.	2002-08-12 10:32:56 +00:00
Poul-Henning Kamp	851da5d6cf	Fix a comment.	2002-08-12 09:22:11 +00:00
Ian Dowse	98caa2e4e9	Don't call softdep_slowdown() if soft updates are not active on the filesystem. This causes a panic for kernels compiled without softupdates. Reported by: luigi	2002-08-05 17:59:20 +00:00
Jeff Roberson	e6e370a7fe	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS	2002-08-04 10:29:36 +00:00
Robert Watson	af05e056ec	Introduce support for Mandatory Access Control and extensible kernel access control. Instrument UFS to support per-inode MAC labels. In particular, invoke MAC framework entry points for generically supporting the backing of MAC labels into extended attributes. This ends up introducing new vnode operation vector entries point at the MAC framework entry points, as well as some explicit entry point invocations for file and directory creation events so that the MAC framework can push labels to disk before the directory names become persistent (this will work better once EAs in UFS2 are hooked into soft updates). The generic EA MAC entry points support executing with the file system in either single label or multilabel operation, and will fall back to the mount label if multilabel is not specified at mount-time. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-07-31 16:05:30 +00:00
Poul-Henning Kamp	c3a0d1d4e1	I forgot this bit of uglyness in the fsck_ffs cleanup.	2002-07-31 07:01:18 +00:00
Poul-Henning Kamp	9fbc6a330d	Fix braino in last commit.	2002-07-30 12:02:41 +00:00
Poul-Henning Kamp	17b1994bbe	Move ffs_isfreeblock() to ffs_alloc.c and make it static. Sponsored by: DARPA & NAI Labs.	2002-07-30 11:54:48 +00:00
Alan Cox	1e8fabc097	Lock page queue accesses by vm_page_free().	2002-07-28 08:01:48 +00:00
Benno Rice	683eac8dbb	Add a missing argument to the stub for softdep_setup_freeblocks. Forgotten by: mckusick	2002-07-20 04:07:15 +00:00
Peter Wemm	382f95d332	Fix a warning: ffs_softdep.c:1630: warning: int format, different type arg (arg 2)	2002-07-20 01:09:35 +00:00
Kirk McKusick	7aca6291e3	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.	2002-07-19 07:29:39 +00:00
Kirk McKusick	fb36a3d847	Change utimes to set the file creation time (for filesystems that support creation times such as UFS2) to the value of the modification time if the value of the modification time is older than the current creation time. See utimes(2) for further details. Sponsored by: DARPA & NAI Labs.	2002-07-17 02:03:19 +00:00
Kirk McKusick	faab4e2722	Change the name of st_createtime to st_birthtime. This change is made to reduce confusion between st_ctime and st_createtime. Submitted by: Eric Allman <eric@sendmail.org> Sponsored by: DARPA & NAI Labs.	2002-07-16 22:36:00 +00:00
Tom Rhodes	ae76f60046	Fix a type: s/your are/you are/	2002-07-12 19:56:31 +00:00
Bruce Evans	2daf9dc825	Fixed some printf format errors (4 new ones reported by gcc and 5 nearby old ones not reported by gcc). This helps unbreak LINT.	2002-07-08 12:42:29 +00:00
Ian Dowse	6bd521df93	Use indirect function pointer hooks instead of #ifdef SOFTUPDATES direct calls for the two places where the kernel calls into soft updates code. Set up the hooks in softdep_initialize() and NULL them out in softdep_uninitialize(). This change allows soft updates to function correctly when ufs is loaded as a module. Reviewed by: mckusick	2002-07-01 17:59:40 +00:00
Ian Dowse	5346934fe7	Add the ffs bits necessary to support unloading of the ufs kernel module. This adds an ffs_uninit() function that calls ufs_uninit() and also calls a new softdep_uninitialize() function. Add a stub for softdep_uninitialize() to cover the non-SOFTUPDATES case. Reviewed by: mckusick	2002-07-01 11:00:47 +00:00
Ian Dowse	3423b21c09	Remove the bogus SYSINIT from ufs_dirhash.c and instead add a call to ufsdirhash_init() from ufs_init(). Add uninit() functions corresponding the ufs, dirhash, quota and ihash init() functions.	2002-06-30 02:49:39 +00:00
Ian Dowse	8f42fb8fc9	Remove the kernel file-size limit for UFS2, so that only the limit imposed by the filesystem structure itself remains. With 16k blocks, the maximum file size is now just over 128TB. For now, the UFS1 file size limit is left unchanged so as to remain consistent with RELENG_4, but it too could be removed in the future. Reviewed by: mckusick	2002-06-26 18:34:51 +00:00
Kenneth D. Merry	98cb733c67	At long last, commit the zero copy sockets code. MAKEDEV: Add MAKEDEV glue for the ti(4) device nodes. ti.4: Update the ti(4) man page to include information on the TI_JUMBO_HDRSPLIT and TI_PRIVATE_JUMBOS kernel options, and also include information about the new character device interface and the associated ioctls. man9/Makefile: Add jumbo.9 and zero_copy.9 man pages and associated links. jumbo.9: New man page describing the jumbo buffer allocator interface and operation. zero_copy.9: New man page describing the general characteristics of the zero copy send and receive code, and what an application author should do to take advantage of the zero copy functionality. NOTES: Add entries for ZERO_COPY_SOCKETS, TI_PRIVATE_JUMBOS, TI_JUMBO_HDRSPLIT, MSIZE, and MCLSHIFT. conf/files: Add uipc_jumbo.c and uipc_cow.c. conf/options: Add the 5 options mentioned above. kern_subr.c: Receive side zero copy implementation. This takes "disposable" pages attached to an mbuf, gives them to a user process, and then recycles the user's page. This is only active when ZERO_COPY_SOCKETS is turned on and the kern.ipc.zero_copy.receive sysctl variable is set to 1. uipc_cow.c: Send side zero copy functions. Takes a page written by the user and maps it copy on write and assigns it kernel virtual address space. Removes copy on write mapping once the buffer has been freed by the network stack. uipc_jumbo.c: Jumbo disposable page allocator code. This allocates (optionally) disposable pages for network drivers that want to give the user the option of doing zero copy receive. uipc_socket.c: Add kern.ipc.zero_copy.{send,receive} sysctls that are enabled if ZERO_COPY_SOCKETS is turned on. Add zero copy send support to sosend() -- pages get mapped into the kernel instead of getting copied if they meet size and alignment restrictions. uipc_syscalls.c:Un-staticize some of the sf* functions so that they can be used elsewhere. (uipc_cow.c) if_media.c: In the SIOCGIFMEDIA ioctl in ifmedia_ioctl(), avoid calling malloc() with M_WAITOK. Return an error if the M_NOWAIT malloc fails. The ti(4) driver and the wi(4) driver, at least, call this with a mutex held. This causes witness warnings for 'ifconfig -a' with a wi(4) or ti(4) board in the system. (I've only verified for ti(4)). ip_output.c: Fragment large datagrams so that each segment contains a multiple of PAGE_SIZE amount of data plus headers. This allows the receiver to potentially do page flipping on receives. if_ti.c: Add zero copy receive support to the ti(4) driver. If TI_PRIVATE_JUMBOS is not defined, it now uses the jumbo(9) buffer allocator for jumbo receive buffers. Add a new character device interface for the ti(4) driver for the new debugging interface. This allows (a patched version of) gdb to talk to the Tigon board and debug the firmware. There are also a few additional debugging ioctls available through this interface. Add header splitting support to the ti(4) driver. Tweak some of the default interrupt coalescing parameters to more useful defaults. Add hooks for supporting transmit flow control, but leave it turned off with a comment describing why it is turned off. if_tireg.h: Change the firmware rev to 12.4.11, since we're really at 12.4.11 plus fixes from 12.4.13. Add defines needed for debugging. Remove the ti_stats structure, it is now defined in sys/tiio.h. ti_fw.h: 12.4.11 firmware. ti_fw2.h: 12.4.11 firmware, plus selected fixes from 12.4.13, and my header splitting patches. Revision 12.4.13 doesn't handle 10/100 negotiation properly. (This firmware is the same as what was in the tree previously, with the addition of header splitting support.) sys/jumbo.h: Jumbo buffer allocator interface. sys/mbuf.h: Add a new external mbuf type, EXT_DISPOSABLE, to indicate that the payload buffer can be thrown away / flipped to a userland process. socketvar.h: Add prototype for socow_setup. tiio.h: ioctl interface to the character portion of the ti(4) driver, plus associated structure/type definitions. uio.h: Change prototype for uiomoveco() so that we'll know whether the source page is disposable. ufs_readwrite.c:Update for new prototype of uiomoveco(). vm_fault.c: In vm_fault(), check to see whether we need to do a page based copy on write fault. vm_object.c: Add a new function, vm_object_allocate_wait(). This does the same thing that vm_object allocate does, except that it gives the caller the opportunity to specify whether it should wait on the uma_zalloc() of the object structre. This allows vm objects to be allocated while holding a mutex. (Without generating WITNESS warnings.) vm_object_allocate() is implemented as a call to vm_object_allocate_wait() with the malloc flag set to M_WAITOK. vm_object.h: Add prototype for vm_object_allocate_wait(). vm_page.c: Add page-based copy on write setup, clear and fault routines. vm_page.h: Add page based COW function prototypes and variable in the vm_page structure. Many thanks to Drew Gallatin, who wrote the zero copy send and receive code, and to all the other folks who have tested and reviewed this code over the years.	2002-06-26 03:37:47 +00:00
Kirk McKusick	a7d50c22a6	Force the quota update to be done when an inode is released in ufs_inactive. This avoid a panic when checking a NULL credential in suser_cred().	2002-06-25 01:02:28 +00:00
Jonathan Lemon	c86c4abf99	Prototype fixes (long newinum --> ino_t newinum).	2002-06-24 17:20:19 +00:00
Maxime Henrion	cfbf0a4678	Warning fixes for 64 bits platforms. This eliminates all the warnings I have had in the FFS code on sparc64. Reviewed by: mckusick	2002-06-23 18:17:27 +00:00
Matthew Dillon	10cfbc1978	Rename the BALLOC flags from B_* to BA_* to avoid confusion with the struct buf B_ flags. Approved by: mckusick	2002-06-23 06:12:22 +00:00
Kirk McKusick	5006e77609	This patch fixes a problem whereby filesystems that ran out of inodes in a cylinder group would fail to check for free inodes in other cylinder groups. This bug was introduced in the UFS2 code merge two days ago. An inode is allocated by calling ffs_valloc which calls ffs_hashalloc to do the filesystem scan. Ffs_hashalloc walks around the cylinder groups calling its passed allocator (ffs_nodealloccg in this case) until the allocator returns a non-zero result. The bug is that ffs_hashalloc expects the passed allocator function to return a 64-bit ufs2_daddr_t. When allocating inodes, it calls ffs_nodealloccg which was returning a 32-bit ino_t. The ffs_hashalloc code checked a 64-bit return value and usually found random non-zero bits in the high 32-bits so decided that the allocation had succeeded (in this case in the only cylinder group that it checked). When the result was passed back to ffs_valloc it looked at only the bottom 32-bits, saw zero and declared the system out of inodes. But ffs_hashalloc had really only checked one cylinder group. The fix is to change ffs_nodealloccg to return 64-bit results. Sponsored by: DARPA & NAI Labs. Submitted by: Poul-Henning Kamp <phk@critter.freebsd.dk> Reviewed by: Maxime Henrion <mux@freebsd.org>	2002-06-22 21:24:58 +00:00
Kirk McKusick	1c85e6a35d	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>	2002-06-21 06:18:05 +00:00
Matthew Dillon	a37313d234	In rev 1.72 a situation related to write/mmap was fixed which could result in a user process gaining visibility into the 'old' contents of a filesystem block. There were two cases: (1) when uiomove() fails (user process issues illegal write), and (2) when uiomove() overlaps a mmap() of the same file at the same offset (fault -> recursive buffer I/O reads contents of old block). Unfortunately 1.72 also had the unintended effect of forcing the filesystem to do a read-before-write in the case of a full-block-write (non append case), e.g. 'dd if=/dev/zero of=test.dat bs=1m count=256 conv=notrunc'. This destroys performance.. not only is a read forced for every write, but clustering breaks as well. The solution is to clear the buffer manually in the full-block case rather then asking BALLOC to do it (BALLOC issues the read-before-write). In the partial-block case we want BALLOC to do it because the read-before-write is necessary. This patch should greatly improve database and news-feed server performance. Found by: MKI <mki@mozone.net> MFC after: 3 days	2002-06-19 09:39:41 +00:00
Semen Ustimenko	13866b3fd2	Fix a typo in my recently added comment: s/beleived/believed/ Submitted by: keramida	2002-06-06 20:43:03 +00:00
Alfred Perlstein	ba5a4d6c02	Backout/modify previous revision: "empty default cases shouldn't be removed, they should have a break; statement added to them." Requested by: billf	2002-06-01 20:54:21 +00:00
Alfred Perlstein	37e1dd483d	Silence warnings, remove some empty 'default' switch cases.	2002-06-01 20:40:42 +00:00
Semen Ustimenko	f576a00d1b	Remove lock from ffs_vget introduced by v1.24. Instead of locking the vnode creation globaly, we allow processes to create vnodes concurently. In case of concurent creation of vnode for the one ino, we allow processes to race and then check who wins. Assuming that concurent creation of vnode for same ino is really rare case, this is belived to be an improvement, as it just allows concurent creation of vnodes. Idea by: bp Reviewed by: dillon MFC after: 1 month	2002-05-30 22:04:17 +00:00
Robert Watson	2bab796d96	Remove IFS from 5.0-CURRENT. This facilitates introducing UFS2 as IFS had its fingers deep in the belly of the UFS/FFS split. IFS will be reimplemented by the maintainer at a later date. Requested by: adrian (maintainer)	2002-05-19 00:11:08 +00:00
Ian Dowse	ed6ca8732c	Fix two casts to "daddr_t " that should have been "ufs_daddr_t ".	2002-05-18 19:03:00 +00:00
Ian Dowse	e116910b8d	Fix a typo where sizeof(daddr_t) was specified instead of sizeof(doff_t). Now that daddr_t is 64-bit, this caused hash blocks to be allocated twice as large as they need to be.	2002-05-18 18:58:27 +00:00
Ian Dowse	00b162d018	Remove um_i_effnlink_valid, i_spare[] and the ufsmount_u and inode_u unions, since these were only necessary when ext2fs used ufs code. Reviewed by: mckusick	2002-05-18 18:51:14 +00:00
Poul-Henning Kamp	8fdbc99b69	Fix ufs_daddr_t/daddr_t type problems. Sponsored by: DARPA & NAI labs.	2002-05-17 18:59:53 +00:00
Poul-Henning Kamp	c7ffbdd995	Call ufs_bmaparray() with right parameter type. Sponsored by: DARPA & NAI Labs.	2002-05-17 18:53:29 +00:00
Tom Rhodes	d394511de3	More s/file system/filesystem/g	2002-05-16 21:28:32 +00:00
Poul-Henning Kamp	98b0c78978	Make daddr_t and u_daddr_t 64bits wide. Retire daddr64_t and use daddr_t instead. Sponsored by: DARPA & NAI Labs.	2002-05-14 11:09:43 +00:00
Poul-Henning Kamp	05f4ff5da1	Remove register keyword. Sponsored by: DARPA & NAI Labs. Submitted by: mckusick	2002-05-13 09:22:31 +00:00
Poul-Henning Kamp	2b2df79fad	Remove two "register" and a blank line. Submitted by: mckusick Sponsored by: DARPA & NAI Labs.	2002-05-12 22:54:48 +00:00
Poul-Henning Kamp	7110af7577	ARGH! SBLOCK is not unused. Try to get this right. BBSIZE belongs in <sys/disklabel.h> (but shouldn't be a constant). Define SBLOCK again, using the right math. Sponsored by: DARPA & NAI Labs.	2002-05-12 20:21:40 +00:00
Poul-Henning Kamp	7cb71b749c	Remove #define for BBOFF, it is assumed == 0 so many places that we might as well forget about it. In fact the only thing which used it was the SBOFF macro. Sponsored by: DARPA & NAI Labs.	2002-05-12 20:00:21 +00:00
Poul-Henning Kamp	16910634dd	Remove unused BBLOCK and SBLOCK #defines. Sponsored by: DARPA & NAI Labs.	2002-05-12 19:56:31 +00:00
Alan Cox	c0b6bbb80b	o Condition the compilation and use of vm_freeze_copyopts() on ENABLE_VFS_IOOPT.	2002-05-06 05:45:57 +00:00
Poul-Henning Kamp	d08961bec3	Move some UFS related stuff home where it belongs.	2002-05-05 20:04:33 +00:00
Jeff Roberson	5df148630f	Include systm.h so panic(9) is defined when doing DEBUG_ALL_VFS_LOCKS.	2002-05-04 02:40:37 +00:00
Poul-Henning Kamp	afe564a200	Name ufs_vop_[gs]etextattr() consistently with the rest of our VOPs and put then in the ufs_vnops where they belong, rather than in the ffs_vnops. Ok'ed by: rwatson Sponsored by: DARPA & NAI Labs.	2002-05-03 08:40:33 +00:00
Poul-Henning Kamp	d65b3c73d7	Use vop_panic() instead of our home-rolled version.	2002-05-02 19:15:52 +00:00
Alfred Perlstein	5a6ce14c42	Remove support for using soon to be retired "special" poll(2) ops. Replace with kevent(2) ops. This is untested, but the code would rot even further if this wasn't applied. I've chosen to apply this to prompt some cleanup. Submitted by: bde	2002-04-18 14:52:28 +00:00
Jeff Roberson	5dacf95488	Don't peak into the malloc_type structure for limits. The desired vnodes check should be sufficient. This is required for the pending removal of malloc_type limits.	2002-04-15 03:35:35 +00:00
Poul-Henning Kamp	2dd527b3ac	Move generic disk ioctls from <sys/disklabel.h> to <sys/disk.h>. Sponsored by: DARPA & NAI Labs	2002-04-08 09:20:07 +00:00
John Baldwin	6008862bc2	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64	2002-04-04 21:03:38 +00:00
Poul-Henning Kamp	a463023d6d	Move the FFS parameter MAXFRAG from <sys/param.h> to <ufs/ffs/fs.h> Sponsored by: DARPA & NAI Labs.	2002-04-03 20:39:27 +00:00
Poul-Henning Kamp	46a67eaced	Use DIOCGSECTORSIZE instead of the bogus DIOCGPART ioctl.	2002-04-02 11:23:14 +00:00
John Baldwin	44731cab3b	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@	2002-04-01 21:31:13 +00:00
Bruce Evans	0508986cce	In ffs_mountffs(), set mnt_iosize_max to si_iosize_max unconditionally provided the latter is nonzero. At this point, the former is a fairly arbitrary default value (DFTPHYS), so changing it to any reasonable value specified by the device driver is safe. Using the maximum of these limits broke ffs clustered i/o for devices whose si_iosize_max is < DFLTPHYS. Using the minimum would break device drivers' ability to increase the active limit from DFTLPHYS up to MAXPHYS. Copied the code for this and the associated (unnecessary?) fixup of mp_iosize_max to all other filesystems that use clustering (ext2fs and msdosfs). It was completely missing. PR: 36309 MFC-after: 1 week	2002-03-30 15:12:57 +00:00
David Malone	527f5ce021	Two minor changes to dirhash, which result in some marginal benchmark improvements. 1) If deleting an entry results in a chain of deleted slots ending in an empty slot, then we can be a bit more aggressive about marking slots as empty. 2) The last stage of the FNV hash is to xor the last byte of data into the hash. This means that filenames which differ only in the last byte will be placed close to one another in the hash table, which forms longer chains. To work around this common case, we also hash in the address of the dirhash structure. news/cancel = news/articles/control/cancel for a tradspool inn server squid2 = squid level 2 directory (dirs called 00->FF) squid3 = squid level 3 directory (files called 00001F00->00001FFF) mean #probes for home dir mh inbox news/cancel tmp squid2 squid3 old successful 1.02 3.19 4.07 1.10 7.85 2.06 new successful 1.04 1.32 1.27 1.04 1.93 1.17 old unsuccessful 1.08 4.50 5.37 1.17 10.76 2.69 new unsuccessful 1.08 1.73 1.64 1.17 2.89 1.37 Reviewed by: iedowse MFC after: 2 weeks	2002-03-20 17:58:02 +00:00
Jeff Roberson	e2f8f8a6b6	Remove references to vm_zone.h and switch over to the new uma API.	2002-03-20 08:48:07 +00:00
Alfred Perlstein	6f1e855112	Remove __P.	2002-03-19 22:40:48 +00:00
Bruce Evans	367b50a28f	Fixed some printf format errors (hopefully all of the remaining daddr64_t ones for GENERIC, and all others on the same line as those). Reformat the printfs if necessary to avoid new long lones or old format printf errors.	2002-03-19 04:09:21 +00:00
Kirk McKusick	a0595d0249	Add a flags parameter to VFS_VGET to pass through the desired locking flags when acquiring a vnode. The immediate purpose is to allow polling lock requests (LK_NOWAIT) needed by soft updates to avoid deadlock when enlisting other processes to help with the background cleanup. For the future it will allow the use of shared locks for read access to vnodes. This change touches a lot of files as it affects most filesystems within the system. It has been well tested on FFS, loopback, and CD-ROM filesystems. only lightly on the others, so if you find a problem there, please let me (mckusick@mckusick.com) know.	2002-03-17 01:25:47 +00:00
Kirk McKusick	0d2af52141	Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.	2002-03-15 18:49:47 +00:00
David E. O'Brien	f0c8652ed4	Quiet a warning on the Alpha.	2002-03-15 04:06:10 +00:00
Kirk McKusick	9721068f95	This corrects the first of two known deadlock conditions that come from the presence of a snapshot file.	2002-03-14 01:21:13 +00:00
Ian Dowse	23bd68a426	Fix a bug in ufsdirhash_adjfree() that caused it to incorrectly update the free-space statistics in some cases. The problem affected directory blocks when the free space dropped below the size of the maximum allowed entry size. When this happened, the free-space summary information could claim that there are no further blocks that can fit a maximum-size entry, even if there are. The effect of this bug is that the directory may be enlarged even though there is space within the directory for the new entry. This wastes disk space and has a negative impact on performance. Fix it by correctly computing the dh_firstfree array index, adding a helper macro for clarity. Put an extra sanity check into ufsdirhash_checkblock() to detect the situation in future. Found by: dwmalone Reviewed by: dwmalone MFC after: 1 week	2002-03-11 19:13:22 +00:00
Poul-Henning Kamp	063f776327	I missed one VOP_CLOSE in the previous commit. Pointed out by: bde	2002-03-11 16:27:04 +00:00
Poul-Henning Kamp	3dbceccb78	As a XXX bandaid open the mounted device READ/WRITE even if we only mount read-only. The trouble here is that we don't reopen the device in read/write mode when we remount in read/write mode resulting in a filesystem sending write requests to a device which was only opened read/only. I'm not quite sure how such a reopen would best be done and defer the problem to more agile hackers.	2002-03-11 13:53:00 +00:00
Robert Watson	409b188022	Update DBA for NAI. We have several. We used the wrong one. :-)	2002-03-07 17:49:06 +00:00
Brian Feldman	9d9737ecb2	Add new errno ``ENOATTR''.	2002-03-07 15:13:44 +00:00
Matthew Dillon	2cfaf1e315	cleanup readability syntax prior to ongoing b_resid work commits. MFC after: 1 day	2002-03-06 00:44:30 +00:00
John Baldwin	fdcc1cc09f	Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic and isn't strictly required. However, it lowers the number of false positives found when grep'ing the kernel sources for p_ucred to ensure proper locking.	2002-02-27 19:18:10 +00:00
John Baldwin	a854ed9893	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.	2002-02-27 18:32:23 +00:00
Poul-Henning Kamp	986066d065	Replace bowrite() with BUF_WRITE in ufs. Remove bowrite(), it is now unused. This is the first step in getting entirely rid of BIO_ORDERED which is a generally accepted evil thing. Approved by: mckusick	2002-02-22 09:03:00 +00:00
Robert Watson	15b27e726e	o Minor style fix on #endif, missing '_' in comment.	2002-02-20 15:44:43 +00:00
Poul-Henning Kamp	68edc1b939	Make v_addpollinfo() visible and non-inline. Have callers only call it as needed. Add necessary call in ufs_kqfilter(). Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>	2002-02-18 16:18:02 +00:00
Poul-Henning Kamp	4b55dbe36b	Move the stuff related to select and poll out of struct vnode. The use of the zone allocator may or may not be overkill. There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need to revisit. This shaves about 60 bytes of struct vnode which on my laptop means 600k less RAM used for vnodes.	2002-02-17 21:15:36 +00:00
Poul-Henning Kamp	e8b26e995e	Collect the VN_KNOTE() macro definitions on vnode.h	2002-02-17 21:07:57 +00:00
Julian Elischer	2c1007663f	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)	2002-02-11 20:37:54 +00:00
Robert Watson	cfcd3c783e	Minor style tweaks. Remove an unneeded comment and commented out code that won't be needed.	2002-02-10 04:57:08 +00:00
Robert Watson	41d5a43fa1	Copyright + license update.	2002-02-10 04:50:24 +00:00
Robert Watson	74237f55b0	Part I: Update extended attribute API and ABI: o Modify the system call syntax for extattr_{get,set}_{fd,file}() so as not to use the scatter gather API (which appeared not to be used by any consumers, and be less portable), rather, accepts 'data' and 'nbytes' in the style of other simple read/write interfaces. This changes the API and ABI. o Modify system call semantics so that extattr_get_{fd,file}() return a size_t. When performing a read, the number of bytes read will be returned, unless the data pointer is NULL, in which case the number of bytes of data are returned. This changes the API only. o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t argument so as to return the size, if desirable. If set to NULL, the size will not be returned. o Update various filesystems (pseodofs, ufs) to DTRT. These changes should make extended attributes more useful and more portable. More commits to rebuild the system call files, as well as update userland utilities to follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-02-10 04:43:22 +00:00
Poul-Henning Kamp	b6e1c37356	Remove di_inumber since LFS is long gone.	2002-02-10 00:55:49 +00:00
Kirk McKusick	b06051cf7c	Occationally background fsck would cause a spurious ``freeing free inode'' panic. This change corrects that problem by setting the fs_active flag when the inode map changes to notify the snapshot code that the cylinder group must be rescanned. Submitted by: Robert Watson <rwatson@FreeBSD.org>	2002-02-07 22:13:56 +00:00
Kirk McKusick	cfdaa88697	Occationally deleted files would hang around for hours or days without being reclaimed. This bug was introduced in revision 1.95 dealing with filenames placed in newly allocated directory blocks, thus is not present in 4.X systems. The bug is triggered when a new entry is made in a directory after the data block containing the original new entry has been written, but before the inode that references the data block has been written. Submitted by: Bill Fenner <fenner@research.att.com>	2002-02-07 00:54:32 +00:00
Kirk McKusick	c9f96392c7	When taking a snapshot, we must check for active files that have been unlinked (e.g., with a zero link count). We have to expunge all trace of these files from the snapshot so that they are neither reclaimed prematurely by fsck nor saved unnecessarily by dump.	2002-02-02 01:42:44 +00:00
Kirk McKusick	7b60855308	Add a stub for softdep_request_cleanup() so that compilation without SOFTUPDATES option works properly. Submitted by: Benno Rice <benno@jeamland.net>	2002-01-23 02:18:56 +00:00
Kirk McKusick	03a2057a5b	This patch fixes a long standing complaint with soft updates in which small and/or nearly full filesystems would fail with `file system full' messages when trying to replace a number of existing files (for example during a system installation). When the allocation routines are about to fail with a file system full condition, they make a call to softdep_request_cleanup() which attempts to accelerate the flushing of pending deletion requests in an effort to free up space. In the face of filesystem I/O requests that exceed the available disk transfer capacity, the cleanup request could take an unbounded amount of time. Thus, the softdep_request_cleanup() routine will only try for tickdelay seconds (default 2 seconds) before giving up and returning a filesystem full error. Under typical conditions, the softdep_request_cleanup() routine is able to free up space in under fifty milliseconds.	2002-01-22 06:17:22 +00:00
Kirk McKusick	99bef8782b	Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26 which caused incomplete snapshots to be taken. When background fsck would run on these snapshots, the result would be files being incorrectly released which would subsequently panic the kernel with ``handle_workitem_freefile: inodedep survived'', ``handle_written_inodeblock: live inodedep'', and ``handle_workitem_remove: lost inodedep'' errors.	2002-01-17 08:33:32 +00:00
Kirk McKusick	8af31e7b46	Put write on read-only filesystem panic after we have weeded out block and character devices, fifo's, etc. Submitted by: Bruce Evans <bde@zeta.org.au>	2002-01-16 04:59:09 +00:00
Kirk McKusick	cd6005961f	When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded. This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read. Submitted by: "Sam Leffler" <sam@errno.com>	2002-01-15 07:17:12 +00:00
Alfred Perlstein	426da3bcfb	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.	2002-01-13 11:58:06 +00:00
Kirk McKusick	0bc7a833ec	When going to sleep, we must save our SPL so that it does not get lost if some other process uses the lock while we are sleeping. We restore it after we have slept. This functionality is provided by a new routine interlocked_sleep() that wraps the interlocking with functions that sleep. This function is then used in place of the old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros. Submitted by: Debbie Chu <dchu@juniper.net>	2002-01-12 20:57:36 +00:00
Kirk McKusick	794ef3471f	Must call drain_output() before checking the dirty block list in softdep_sync_metadata(). Otherwise we may miss dependencies that need to be flushed which will result in a later panic with the message ``vinvalbuf: dirty bufs''. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week	2002-01-11 19:59:27 +00:00
Poul-Henning Kamp	9c643340bb	Do not pull quota entries of the cache-list if they have already been removed from the cache-list as part of a previous unmount. This would result in panics (page fault in dqflush()) during subsequent umounts provided that enough distinct UID's to actually make the hash do something are active. This can probably explain a number of weird quota related behaviours. PR: 32331 maybe more. Reproduced by: Søren Schrørder <sch@cybercity.dk>	2002-01-10 15:02:57 +00:00
Mike Smith	b9a4338d29	Initialise the bioops vector hack at runtime rather than at link time. This avoids the use of common variables. Reviewed by: mckusick	2002-01-08 19:32:18 +00:00
Matthew Dillon	23b590188f	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release	2001-12-20 22:42:27 +00:00
Kirk McKusick	f305c5d199	Change the atomic_set_char to atomic_set_int and atomic_clear_char to atomic_clear_int to ease the implementation for the sparc64. Requested by: Jake Burkholder <jake@locore.ca>	2001-12-18 18:05:17 +00:00
Ian Dowse	143a5346c9	Make sure we ignore the value of `fs_active' when reloading the superblock, and move the initialisation of it to beside where other pointer fields are initialised.	2001-12-16 18:54:09 +00:00
Ian Dowse	3fa4044e34	Move the new superblock field `fs_active' into the region of the superblock that is already set up to handle pointer types. This fixes an accidental change in the superblock size on 64-bit platforms caused by revision 1.24.	2001-12-16 18:51:11 +00:00
Kirk McKusick	cc5a92334f	Minimize the time necessary to suspend operations on a filesystem when taking a snapshot. The two time consuming operations are scanning all the filesystem bitmaps to determine which blocks are in use and scanning all the other snapshots so as to be able to expunge their blocks from the view of the current snapshot. The bitmap scanning is broken into two passes. Before suspending the filesystem all bitmaps are scanned. After the suspension, those bitmaps that changed after being scanned the first time are rescanned. Typically there are few bitmaps that need to be rescanned. The expunging of other snapshots is now done after the suspension is released by observing that we can easily identify any blocks that were allocated to them after the suspension (they will be maked as `not needing to be copied' in the just created snapshot). For all the gory details, see the ``Running fsck in the Background'' paper in the Usenix BSDCon 2002 Conference Proceedings, pages 55-64.	2001-12-14 00:15:06 +00:00
Kirk McKusick	9db12e5108	When a file is partially truncated, we first check to see if the new file end will land in the middle of a file hole. Since the last block of a file must always be allocated, the hole is filled by allocating a block at that location. If the hole being filled is a direct block, then the truncation may eventually reduce the full sized block down to a fragment. When running with soft updates, it is necessary to FSYNC the file after allocating the block and before creating the fragment to avoid triggering a soft updates inconsistency when the block unexpectedly shrinks. Found by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week	2001-12-13 05:07:48 +00:00
Robert Watson	24373ce6ed	Use 'mkdir -p /.attribute/system' instead of breaking it into two seperate mkdir targets. Submitted by: jedgar	2001-11-30 15:32:07 +00:00
Robert Watson	cff9580525	Use 'mkdir -p /.attribute/system' instead of breaking it into two seperate mkdir targets.	2001-11-30 15:21:20 +00:00
Robert Watson	15f1c8d3d2	README.extattr incorrectly specified sample command lines for UFS_EXTATTR_AUTOSTART. Insert the missing 'initattr' arguments to extattrctl. Noticed by: green	2001-11-30 15:15:27 +00:00
Guido van Rooij	40e294f796	When mkdir()-ing, the parent dir gets is linkcount increased. Fix VN_KNOTE to reflect that. Found by: tobez@freebsd.org MFC after: 2 days	2001-11-22 15:33:12 +00:00
Ian Dowse	4202b366fc	Oops, when trying the dirhash sequential-access optimisation, compare the slot offset against the predicted offset, not a boolean flag. This typo effectively disabled the sequential optimisation, but was otherwise harmless. Not surprisingly, fixing this improves performance in the sequential access case. I am seeing a 7% speedup on one machine here; using dirhash when sequentially looking up directory entries is now about 5% faster instead of 2% slower than the non-dirhash case. Submitted by: KOIE Hidetaka <koie@suri.co.jp> MFC after: 1 week	2001-11-14 15:08:07 +00:00
Matthew Dillon	7e76bb562e	Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking in wdrain during a write. This flag needs to be used in devices whos strategy routines turn-around and issue another high level I/O, such as when MD turns around and issues a VOP_WRITE to vnode backing store, in order to avoid deadlocking the dirty buffer draining code. Remove a vprintf() warning from MD when the backing vnode is found to be in-use. The syncer of buf_daemon could be flushing the backing vnode at the time of an MD operation so the warning is not correct. MFC after: 1 week	2001-11-05 18:48:54 +00:00
Robert Watson	6d8785434f	o Update copyright dates. o Add reference to TrustedBSD Project in license header. o Update dated comments, including comment in extattr.h claiming that no file systems support extended attributes. o Improve comment consistency.	2001-11-01 21:37:07 +00:00
Robert Watson	b6e0472987	o Althought this is not specified in POSIX.1e, the UFS ACL implementation coerces the deletion of a default ACL on a directory when no default ACL EA is present to success. Because the UFS EA implementation doesn't disinguish the EA failure modes "that EA name has not been administratively enabled" from "that EA name has no defined data", there's a potential conflict in error return values. Normally, the lack of administratively configured EA support is coerced to EOPNOTSUPP to indicate that ACLs are not available; in this case, it is possible to get a successful return, even if ACLs are not available because EA support for them has not been enabled. Expand the comment in ufs_setacl() to identify this case. Obtained from: TrustedBSD Project	2001-10-27 05:39:17 +00:00
Robert Watson	ac8b3dd7dc	o Clarify a comment about the locking condition of the vnode upon exit from ufs_extattr_enable_with_open(). o Print auto-start notifications if (bootverbose). This was previously commented out since it didn't know how to check for bootverbose. o Drop in comments throughout indicating where ENOENT should be replaced with ENOATTR once that is available. Obtained from: TrustedBSD Project	2001-10-27 05:19:14 +00:00
Robert Watson	29543004bd	o The comment about ordering the destruction of the lock and the removal of the flag indicating that the structure was initialized didn't need an XXX, since it didn't need fixing. Obtained from: TrustedBSD Project	2001-10-27 05:05:39 +00:00
Robert Watson	9444746795	o Wrap a number of long lines of code, many of which were introduced due to KSE-related (p) expansions. Obtained from: TrustedBSD Project	2001-10-27 05:03:05 +00:00
Robert Watson	ce5ddec25f	Since namespace support was added to the UFS extended attribute implementation to replace single-character namespace prefixes, '$' is no longer an invalid attribute name, and the namespace is relevant to validity determination. o Remove '$' case from ufs_extattr_valid_attrname() o Add attrnamespace argument to ufs_extattr_valid_attrname(), and fill out appropriately. Currently no decisions are made based on the namespace argument, but may be in the future. Obtained from: TrustedBSD Project	2001-10-27 04:58:28 +00:00
Matthew Dillon	245df27cee	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week	2001-10-26 00:08:05 +00:00
Ian Dowse	71fc5e11c7	Default to not performing ufs_dirhash's extensive directory-block sanity check after every directory modification. This check can be re-enabled at any time by setting the sysctl "vfs.ufs.dirhash_docheck" to 1. This group of sanity tests was there to ensure that any UFS_DIRHASH bugs could be caught by a panic before a potentially corrupted directory block would be written to disk. It has served its main purpose now, so disable it in the interest of performance. MFC after: 1 week	2001-10-25 22:55:59 +00:00
Matthew Dillon	c72ccd014d	Change the vnode list under the mount point from a LIST to a TAILQ in preparation for an implementation of limiting code for kern.maxvnodes. MFC after: 3 days	2001-10-23 01:21:29 +00:00
John Baldwin	bd78cece5d	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.	2001-10-11 23:38:17 +00:00
John Baldwin	7106ca0d1a	Add missing includes of sys/lock.h.	2001-10-11 17:52:20 +00:00
Matthew Dillon	962922dcd2	Remove panics for rename() race conditions. The panics are inappropriate because the IN_RENAME flag only fixes a few of the huge number of race conditions that can result in the source path becoming invalid even prior to the VOP_RENAME() call. The panics created a serious security issue whereby an attacker could fairly easily cause the panic to occur, crashing the machine. The correct solution requires a great deal of work in the namei path cache code. MFC after: 0 days	2001-10-08 00:37:54 +00:00
Robert Watson	ab66aa1468	o Replace two direct uid!=0 comparisons with suser_xxx() calls. Obtained from: TrustedBSD Project	2001-10-02 14:41:43 +00:00
Robert Watson	b73d2870cd	o Replace two direct uid!=0 comparisons with suser_td() calls. Obtained from: TrustedBSD Project	2001-10-02 14:34:22 +00:00
Matthew Dillon	4c94c7bfb9	Backout the last commit. The problem is actually much worse then I first thought and may require serious work to the VOP_RENAME() api itself. Basically, by the time the VOP_RENAME() function is called, it's already too late.	2001-10-02 04:26:58 +00:00
Matthew Dillon	be2a975a9f	IN_RENAME should only be cleared by the routine that set it. This fixes a rename/rmdir race that has been shown to cause a panic. Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com> MFC after: 3 days	2001-10-02 02:58:48 +00:00
John Baldwin	eb46fac565	- Fix some minor whitespace nits. - Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a little nit that caused it to be defined as -(sizeof (struct thread) + 1) instead of -2.	2001-09-27 21:04:13 +00:00
Robert Watson	57358f1e93	o Re-enable support of system file flags in jail() by adding back the PRISON_ROOT to the suser_xxx() check. Since securelevels may now be raised in specific jails, use of system flags can still be restricted in jail(), but in a more configurable way. o Users of jail() expecting system flags (such as schg) to restrict jail()'s should be sure to set the securelevel appropriately in jail()'s. o This fixes activities involving automated system flag removal in jail(), including installkernel and friends. Obtained from: TrustedBSD Project	2001-09-26 20:44:41 +00:00
Robert Watson	6748bcc51e	o Modify ufs_setattr() so that it uses securelevel_gt() instead of direct variable access. Obtained from: TrustedBSD Project	2001-09-26 20:31:37 +00:00
Robert Watson	aaef1c3934	o Further clarify comment: ad Udo's request, re-insert the 'if' refering to securelevels; also, update the unprivileged process text to better indicate the scope of actions permittable when any system flags are already set (limited). Submitted by: Udo Schweigert <udo.schweigert@siemens.com>	2001-09-25 12:02:44 +00:00
Robert Watson	82e83c60b3	o Parallelize the comment on the relationship between privileged un-jailed processes and the actual securelevel check: make the comment use '> 0' instead of inverted '<= 0'.	2001-09-25 02:26:10 +00:00
Ian Dowse	5d76690a7f	The addition of i_dirhash to struct inode pushed RELENG_4's sizeof(struct inode) into a new malloc bucket on the i386. This didn't happen in -current due to the removal of i_lock, but it does no harm to apply the workaround to -current first. Reduce the size of the i_spare[] array in struct inode from 4 to 3 entries, and change ext2fs to use i_din.di_spare[1] so that it does not need i_spare[3]. Reviewed by: bde MFC after: 3 days	2001-09-24 18:29:20 +00:00
Julian Elischer	b40ce4165d	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha	2001-09-12 08:38:13 +00:00
Ian Dowse	4691e9ead0	The "dirpref" directory layout preference improvements make use of an array "fs_contigdirs[]" to avoid too many directories getting created in each cylinder group. The memory required for this and two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with a single malloc() call, and divided up afterwards. However, the 'space' pointer is not advanced correctly, so fs_contigdirs and fs_maxcluster end up pointing to the same address. Add the missing code to advance the 'space' pointer, and remove an unnecessary update of the pointer that follows. This is likely to fix the "ffs_clusteralloc: map mismatch" panics that have been reported recently. Submitted by: Luke Mewburn <lukem@wasabisystems.com>	2001-09-09 23:48:28 +00:00
Chris D. Faulhaber	dac4a67ce7	Use ACL_PERM_NONE instead of hardcoding 0 when initializing ACL entry permissions. Reviewed by: rwatson	2001-09-01 23:18:15 +00:00
Robert Watson	7df97b6117	o At some point, unmounting a non-EA file system with EA's compiled in got a bit broken, when ufs_extattr_stop() was called and failed, ufs_extattr_destroy() would panic. This makes the call to destroy() conditional on the success of stop(). Submitted by: Christian Carstensen <cc@devcon.net> Obtained from: TrustedBSD Project	2001-09-01 20:11:05 +00:00
Peter Wemm	0f7289022b	If a file has been completely unlinked, stop automatically syncing the file. ffs will discard any pending dirty pages when it is closed, so we may as well not waste time trying to clean them. This doesn't stop other things from writing it out, eg: pageout, fsync(2) etc.	2001-08-27 06:09:56 +00:00
Ian Dowse	be70fc04ce	Stop using dirhash when a directory is removed, and ensure that we never attempt to hash directories once they are deleted. This fixes a problem where operations on a deleted directory could trigger dirhash sanity panics.	2001-08-26 20:47:19 +00:00
Ian Dowse	2ed42812bd	When compacting directories, ufs_direnter() always trusted DIRSIZ() to supply the number of bytes to be bcopy()'d to move an entry. If d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible length, so ufs_direnter could end up corrupting a directory during compaction. In practice I believe this can only happen after fsck_ffs has fixed a previously-corrupted directory. We now deal with any mid-block unused entries specially to avoid using DIRSIZ() or bcopy() on such entries. We also ensure that the variables 'dsize' and 'spacefree' contain meaningful values at all times. Add a few comments to describe better this intricate piece of code. The special handling of mid-block unused entries makes the dirhash- specific bugfix in the previous revision (1.53) now uncecessary, so this change removes it. Reviewed by: mckusick	2001-08-26 01:25:12 +00:00
Ian Dowse	7dfb550e0c	When compressing directory blocks, the dirhash code didn't check that the directory entry was in use before attempting to find it in the hash structures to change its offset. Normally, unused entries do not need to be moved, but fsck can leave behind some unused entries that do. A dirhash sanity panic resulted when the entry to be moved was not found. Add a check that stops entries with d_ino == 0 from being passed to ufsdirhash_move().	2001-08-22 01:35:17 +00:00
Peter Wemm	61a4237001	Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS' without 'options FFS' would fail to link.	2001-08-18 03:08:48 +00:00
Ian Dowse	9e27954de1	Two recent commits in sys/ufs/ufs interacted badly with ext2fs because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink is invalid because ext2fs does not maintain this field. In ufs_close(), i_effnlink is also tested, to determines whether or not to call vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of ext2fs filesystems; I believe the other is harmless. Fix both cases by checking um_i_effnlink_valid in the ufsmount struct, and use i_nlink if necessary. Noticed by: bde Reviewed by: mckusick, bde	2001-07-29 22:26:01 +00:00
Ian Dowse	54d6d2dfaf	Disable the dirhash sanity check that panics if an unused directory entry (d_ino == 0) is found in a position that is not the start of a DIRBLKSIZ block. While such entries cannot occur normally (ufs always extends the previous entry to cover the free space instead), they do not cause problems and fsck does not fix them, so panicking is bad.	2001-07-27 18:45:41 +00:00
Peter Wemm	815d14ddab	Use a fixed type for times in on-disk structures for ufs rather than something that could potentially change like time_t.	2001-07-16 00:55:27 +00:00
Ian Dowse	50c7c3a7c8	Return a locked struct buf from ufsdirhash_lookup() to avoid one extra getblk/brelse sequence for each lookup. We already had this buf in ufsdirhash_lookup(), so there was no point in brelse'ing it only to have the caller immediately reaquire the same buffer. This should make the case of sequential lookups marginally faster; in my tests, sequential lookups with dirhash enabled are now only around 1% slower than without dirhash.	2001-07-13 20:50:38 +00:00
Ian Dowse	9b5ad47fb7	Bring in dirhash, a simple hash-based lookup optimisation for large directories. When enabled via "options UFS_DIRHASH", in-core hash arrays are maintained for large directories. These allow all directory operations to take place quickly instead of requiring long linear searches. For now anyway, dirhash is not enabled by default. The in-core hash arrays have a memory requirement that is approximately half the size of the size of the on-disk directory file. A number of new sysctl variables allow control over which directories get hashed and over the maximum amount of memory that dirhash will use: vfs.ufs.dirhash_minsize The minimum on-disk directory size for which hashing should be used. The default is 2560 (2.5k). vfs.ufs.dirhash_maxmem The system-wide maximum total memory to be used by dirhash data structures. The default is 2097152 (2MB). The current amount of memory being used by dirhash is visible through the read-only sysctl variable vfs.ufs.dirhash_maxmem. Finally, some extra sanity checks that are enabled by default, but which may have an impact on performance, can be disabled by setting vfs.ufs.dirhash_docheck to 0. Discussed on: -fs, -hackers	2001-07-10 21:21:29 +00:00
Matthew Dillon	0cddd8f023	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.	2001-07-04 16:20:28 +00:00
John Baldwin	ed87274d16	Fix more mntvnode and vnode interlock order reversals.	2001-06-28 22:21:33 +00:00
John Baldwin	49d2d9f4a4	- Fix a mntvnode and vnode interlock reversal. - Protect the mnt_vnode list with the mntvnode lock. - Use queue(9) macros.	2001-06-28 04:12:56 +00:00
Peter Wemm	78236790cd	Fix warning: 1973: warning: int format, long int arg (arg 5)	2001-06-15 07:44:39 +00:00
Kirk McKusick	eb87cd754f	Build on the change in revision 1.98 by Tor.Egge@fast.no. The symptom being treated in 1.98 was to avoid freeing a pagedep dependency if there was still a newdirblk dependency referencing it. That change is correct and no longer prints a warning message when it occurs. The other part of revision 1.98 was to panic when a newdirblk dependency was encountered during a file truncation. This fix removes that panic and replaces it with code to find and delete the newdirblk dependency so that the truncation can succeed.	2001-06-13 23:13:13 +00:00
Thomas Moestl	1fbcf0ac65	Call vn_close on the backing file vnode if ufs_extattr_enable failed to avoid leaking it. Reviewed by: rwatson	2001-06-07 00:11:32 +00:00
Jonathan Lemon	3b6e32b01e	Add a wrapper for the fifo kqfilter which falls through to the ufs routine. This permits the fifo to inherit the ufs VNODE kqfilter.	2001-06-06 17:40:57 +00:00
Jonathan Lemon	4e92ec9dd0	Add a kqueue filter for writing to ufs filesystems which always returns true. This permits better interoperability with programs which register filters on their stdin/stdout handles. Submitted by: Niels Provos <provos@citi.umich.edu>	2001-06-05 13:52:37 +00:00
David E. O'Brien	1239674238	There seems to be a problem that the order of disk write operation being incorrect due to a missing check for some dependency. This change avoids the freelist corruption (but not the temporarily inconsistent state of the file system). A message is printed as a reminder of the under lying problem when a pagedep structure is not freed due to the NEWBLOCK flag being set. Submitted by: Tor.Egge@fast.no	2001-06-05 01:49:37 +00:00
John Baldwin	1c11b01562	Revert the previous commit in favor of the fix in rev 1.42 of ufs/ffs/ffs_extern.h instead. Requested by: bde	2001-05-30 23:09:19 +00:00
John Baldwin	55d132317c	Forward declare struct cg to quiet a warning. Submitted by: bde	2001-05-30 23:08:40 +00:00
John Baldwin	59718ee556	Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a warning.	2001-05-29 23:53:16 +00:00
Poul-Henning Kamp	c7a3e2379c	Remove last vestiges of MFS.	2001-05-29 21:21:53 +00:00
Poul-Henning Kamp	870b4959b7	Remove MFS from the kernel.	2001-05-29 18:50:30 +00:00
Thomas Moestl	3c436f07b7	Add a check to determine whether extended attributes have been initialized on the file system before trying to grab the lock of the per-mount extattr structure, as this lock is unitialized in that case. This is needed because ufs_extattr_vnode_inactive is called from ufs_inactive, which is also used by EA-unaware file systems such as ext2fs. Reviewed by: rwatson	2001-05-25 18:24:52 +00:00
Robert Watson	b1fc0ec1a7	o Merge contents of struct pcred into struct ucred. Specifically, add the real uid, saved uid, real gid, and saved gid to ucred, as well as the pcred->pc_uidinfo, which was associated with the real uid, only rename it to cr_ruidinfo so as not to conflict with cr_uidinfo, which corresponds to the effective uid. o Remove p_cred from struct proc; add p_ucred to struct proc, replacing original macro that pointed. p->p_ucred to p->p_cred->pc_ucred. o Universally update code so that it makes use of ucred instead of pcred, p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo, cr_{r,sv}{u,g}id instead of p_*, etc. o Remove pcred0 and its initialization from init_main.c; initialize cr_ruidinfo there. o Restruction many credential modification chunks to always crdup while we figure out locking and optimizations; generally speaking, this means moving to a structure like this: newcred = crdup(oldcred); ... p->p_ucred = newcred; crfree(oldcred); It's not race-free, but better than nothing. There are also races in sys_process.c, all inter-process authorization, fork, exec, and exit. o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid; remove comments indicating that the old arrangement was a problem. o Restructure exec1() a little to use newcred/oldcred arrangement, and use improved uid management primitives. o Clean up exit1() so as to do less work in credential cleanup due to pcred removal. o Clean up fork1() so as to do less work in credential cleanup and allocation. o Clean up ktrcanset() to take into account changes, and move to using suser_xxx() instead of performing a direct uid==0 comparision. o Improve commenting in various kern_prot.c credential modification calls to better document current behavior. In a couple of places, current behavior is a little questionable and we need to check POSIX.1 to make sure it's "right". More commenting work still remains to be done. o Update credential management calls, such as crfree(), to take into account new ruidinfo reference. o Modify or add the following uid and gid helper routines: change_euid() change_egid() change_ruid() change_rgid() change_svuid() change_svgid() In each case, the call now acts on a credential not a process, and as such no longer requires more complicated process locking/etc. They now assume the caller will do any necessary allocation of an exclusive credential reference. Each is commented to document its reference requirements. o CANSIGIO() is simplified to require only credentials, not processes and pcreds. o Remove lots of (p_pcred==NULL) checks. o Add an XXX to authorization code in nfs_lock.c, since it's questionable, and needs to be considered carefully. o Simplify posix4 authorization code to require only credentials, not processes and pcreds. Note that this authorization, as well as CANSIGIO(), needs to be updated to use the p_cansignal() and p_cansched() centralized authorization routines, as they currently do not take into account some desirable restrictions that are handled by the centralized routines, as well as being inconsistent with other similar authorization instances. o Update libkvm to take these changes into account. Obtained from: TrustedBSD Project Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit	2001-05-25 16:59:11 +00:00
Matthew Dillon	ac8f990bde	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon	2001-05-24 07:22:27 +00:00
Alfred Perlstein	1752ee59ba	ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when calling it from the pager routine	2001-05-23 10:30:25 +00:00
Ruslan Ermilov	99d300a1ec	- FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file systems were repo-copied from sys/miscfs to sys/fs. - Renamed the following file systems and their modules: fdesc -> fdescfs, portal -> portalfs, union -> unionfs. - Renamed corresponding kernel options: FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS. - Install header files for the above file systems. - Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland Makefiles.	2001-05-23 09:42:29 +00:00
Kirk McKusick	57042c7f72	Update softdep_setup_directory_add prototype to reflect changes in actual function. Obtained from: Jim Bloom <bloom@jbloom.jbloom.org>	2001-05-20 15:59:55 +00:00
Kirk McKusick	dc01275be9	Must ensure that all the entries on the pd_pendinghd list have been committed to disk before clearing them. More specifically, when free_newdirblk is called, we know that the inode claims the new directory block. However, if the associated pagedep is still linked onto the directory buffer dependency chain, then some of the entries on the pd_pendinghd list may not be committed to disk yet. In this case, we will simply note that the inode claims the block and let the pd_pendinghd list be processed when the pagedep is next written. If the pagedep is no longer on the buffer dependency chain, then all the entries on the pd_pending list are committed to disk and we can free them in free_newdirblk. This corrects a window of vulnerability introduced in the code added in version 1.95.	2001-05-19 19:24:26 +00:00
Alfred Perlstein	2395531439	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb	2001-05-19 01:28:09 +00:00
Kirk McKusick	9f5192ff71	Must be a bit less aggressive about freeing pagedep structures. Obtained from: Robert Watson <rwatson@FreeBSD.org> and Matthew Jacob <mjacob@feral.com>	2001-05-18 22:16:28 +00:00
Kirk McKusick	24a83a4b3f	When a new block is allocated to a directory, an fsync of a file whose name is within that block must ensure not only that the block containing the file name has been written, but also that the on-disk directory inode references that block. When a new directory block is created, we allocate a newdirblk structure which is linked to the associated allocdirect (on its ad_newdirblk list). When the allocdirect has been satisfied, the newdirblk structure is moved to the inodedep id_bufwait list of its directory to await the inode being written. When the inode is written, the directory entries are fully committed and can be deleted from their pagedep->id_pendinghd and inodedep->id_pendinghd lists.	2001-05-17 07:24:03 +00:00
Ian Dowse	0864ef1e8a	Change the second argument of vflush() to an integer that specifies the number of references on the filesystem root vnode to be both expected and released. Many filesystems hold an extra reference on the filesystem root vnode, which must be accounted for when determining if the filesystem is busy and then released if it isn't busy. The old `skipvp' approach required individual filesystem xxx_unmount functions to re-implement much of vflush()'s logic to deal with the root vnode. All 9 filesystems that hold an extra reference on the root vnode got the logic wrong in the case of forced unmounts, so `umount -f' would always fail if there were any extra root vnode references. Fix this issue centrally in vflush(), now that we can. This commit also fixes a vnode reference leak in devfs, which could result in idle devfs filesystems that refuse to unmount. Reviewed by: phk, bp	2001-05-16 18:04:37 +00:00
Kirk McKusick	7389126d9a	Further fixes for deadlock in the presence of multiple snapshots. There are still more to find, but this fix should cover the common cases that folks are hitting.	2001-05-14 17:16:49 +00:00
Kirk McKusick	0b04113700	If the effective link count is zero when an NFS file handle request comes in for it, the file is really gone, so return ESTALE. The problem arises when the last reference to an FFS file is released because soft-updates may delay the actual freeing of the inode for some time. Since there are no filesystem links or open file descriptors referencing the inode, from the point of view of the system, the file is inaccessible. However, if the filesystem is NFS exported, then the remote client can still access the inode via ufs_fhtovp() until the inode really goes away. To prevent this anomoly, it is necessary to begin returning ESTALE at the same time that the file ceases to be accessible to the local filesystem. Obtained from: Ian Dowse <iedowse@maths.tcd.ie>	2001-05-13 23:30:45 +00:00
Kirk McKusick	9b35c30cf7	Remove yet another deadlock case.	2001-05-11 07:12:03 +00:00
Kirk McKusick	9ccb939ef0	When running with soft updates, track the number of blocks and files that are committed to being freed and reflect these blocks in the counts returned by statfs (and thus also by the `df' command). This change allows programs such as those that do news expiration to know when to stop if they are trying to create a certain percentage of free space. Note that this change does not solve the much harder problem of making this to-be-freed space available to applications that want it (thus on a nearly full filesystem, you may still encounter out-of-space conditions even though the free space will show up eventually). Hopefully this harder problem will be the subject of a future enhancement.	2001-05-08 07:42:20 +00:00
Kirk McKusick	27b047acf0	Several fixes for units errors: 1) Do not assume that the superblock will be of size fs->fs_bsize. This fixes a panic when taking a snapshot on a filesystem with a block size bigger than 8K. 2) Properly calculate the number of fragments that follow the superblock summary information. This fixes a bug with inconsistent snapshots. 3) When cleaning up a snapshot that is about to be removed, properly calculate the number of blocks that need to be checked. This fixes a bug that created partially allocated inodes. 4) When moving blocks from a snapshot that is about to be removed to another snapshot, properly account for the reduced number of blocks in the snapshot from which they are taken. This fixes a bug in which the number of blocks released from a snapshot did not match the number that it claimed to have.	2001-05-08 07:29:03 +00:00
Kirk McKusick	0c6fbff0a5	When syncing out snapshot metadata, we must temporarily allow recursive buffer locking so as to avoid locking against ourselves if we need to write filesystem metadata.	2001-05-08 07:13:00 +00:00
Kirk McKusick	23371b2f22	Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce the amount of time that the filesystem must be suspended. The current snapshot is elided as well as the earlier snapshots.	2001-05-04 05:49:28 +00:00
Poul-Henning Kamp	3858e5e797	Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.	2001-05-01 09:12:39 +00:00
Poul-Henning Kamp	3c7a8027cb	Remove blatantly pointless call to VOP_BMAP(). Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.	2001-05-01 09:12:31 +00:00
Poul-Henning Kamp	a62615e59b	Implement vop_std{get\|put}pages() and add them to the default vop[]. Un-copy&paste all the VOP_{GET\|PUT}PAGES() functions which do nothing but the default.	2001-05-01 08:34:45 +00:00
Mark Murray	fb919e4d5a	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)	2001-05-01 08:13:21 +00:00
Poul-Henning Kamp	855aa097af	VOP_BALLOC was never really a VOP in the first place, so convert it to UFS_BALLOC like the other "between UFS and FFS function interfaces".	2001-04-29 12:36:52 +00:00
Poul-Henning Kamp	b7ebffbc08	Add a vop_stdbmap(), and make it part of the default vop vector. Make 7 filesystems which don't really know about VOP_BMAP rely on the default vector, rather than more or less complete local vop_nopbmap() implementations.	2001-04-29 11:48:41 +00:00
Poul-Henning Kamp	f2ddd13ad2	Call ufs_bmaparray() directly instead of indirectly via VOP_BMAP().	2001-04-29 10:25:30 +00:00
Poul-Henning Kamp	954a0e256e	Remove two unused arguments from ufs_bmaparray().	2001-04-29 10:24:58 +00:00
Poul-Henning Kamp	e955479077	Remove faint traces of blind copy&paste.	2001-04-29 10:23:50 +00:00
Poul-Henning Kamp	0c25dbeb17	Remove faint traces of non-existant ffs_bmap().	2001-04-29 10:23:32 +00:00
Greg Lehey	60fb0ce365	Revert consequences of changes to mount.h, part 2. Requested by: bde	2001-04-29 02:45:39 +00:00
Kirk McKusick	c9509f5865	Rather than copying all the indirect blocks of the snapshot, simply mark them as BLK_NOCOPY. This trick cuts the initial size of the snapshot in half and cuts the time to take a snapshot by a third.	2001-04-26 00:50:53 +00:00
Kirk McKusick	112f737245	When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.	2001-04-25 08:11:18 +00:00
Poul-Henning Kamp	a13234bb35	Move the netexport structure from the fs-specific mountstructure to struct mount. This makes the "struct netexport *" paramter to the vfs_export and vfs_checkexport interface unneeded. Consequently that all non-stacking filesystems can use vfs_stdcheckexp(). At the same time, make it a pointer to a struct netexport in struct mount, so that we can remove the bogus AF_MAX and #include <net/radix.h> from <sys/mount.h>	2001-04-25 07:07:52 +00:00
Ian Dowse	5d69bac493	Pre-dirpref versions of fsck may zero out the new superblock fields fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause panics if these fields were zeroed while a filesystem was mounted read-only, and then remounted read-write. Add code to ffs_reload() which copies the fs_contigdirs pointer from the previous superblock, and reinitialises fs_avgf* if necessary. Reviewed by: mckusick	2001-04-24 00:37:16 +00:00
Greg Lehey	d98dc34f52	Correct #includes to work with fixed sys/mount.h.	2001-04-23 09:05:15 +00:00
Poul-Henning Kamp	f84e29a06c	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.	2001-04-17 08:56:39 +00:00
Kirk McKusick	5819ab3f12	Add debugging option to always read/write cylinder groups as full sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'. Add debugging option to disable background writes of cylinder groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'. These debugging options should be tried on systems that are panicing with corrupted cylinder group maps to see if it makes the problem go away. The set of panics in question are: ffs_clusteralloc: map mismatch ffs_nodealloccg: map corrupted ffs_nodealloccg: block not in map ffs_alloccg: map corrupted ffs_alloccg: block not in map ffs_alloccgblk: cyl groups corrupted ffs_alloccgblk: can't find blk in cyl ffs_checkblk: partially free fragment The following panics are less likely to be related to this problem, but might be helped by these debugging options: ffs_valloc: dup alloc ffs_blkfree: freeing free block ffs_blkfree: freeing free frag ffs_vfree: freeing free inode If you try these options, please report whether they helped reduce your bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com> and to Matt Dillon <dillon@earth.backplane.com>.	2001-04-17 05:37:51 +00:00
Kirk McKusick	f0f3f19f05	Background fsck sysctl operations must use vn_start_write and vn_finished_write so that they do not attempt to modify a suspended filesystem.	2001-04-17 05:06:37 +00:00
Robert Watson	b114e127e6	In my first reading of POSIX.1e, I misinterpreted handling of the ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the access ACL could be used by privileged processes to change file/directory ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and ACL_OTHER) should have undefined ae_id fields; this commit attempts to correct that misunderstanding. o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid associated with the vnode, as those can no longer be extracted from the ACL passed as an argument. Perform all comparisons against the passed arguments. This actually has the effect of simplifying a number of components of this call, as well as reducing the indent level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP. o Modify acl_posix1e_check() to return EINVAL if the ae_id field of any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value other than ACL_UNDEFINED_ID. As a temporary work-around to allow clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before each check so that this cannot cause a failure in the short term (this work-around will be removed when the userland libraries and utilities are updated to take this change into account). o Modify ufs_sync_acl_from_inode() so that it forces ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID when synchronizing the ACL from the inode. o Modify ufs_sync_inode_from_acl to not propagate uid and gid information to the inode from the ACL during ACL update. Also modify the masking of permission bits that may be set from ALLPERMS to (S_IRWXU\|S_IRWXG\|S_IRWXO), as ACLs currently do not carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT). o Modify ufs_getacl() so that when it emulates an access ACL from the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID. o Clean up ufs_setacl() substantially since it is no longer possible to perform chown/chgrp operations using vop_setacl(), so all the access control for that can be eliminated. o Modify ufs_access() so that it passes owner uid and gid information into vaccess_acl_posix1e(). Pointed out by: jedger Obtained from: TrustedBSD Project	2001-04-17 04:33:34 +00:00
Kirk McKusick	74046077a7	Update to describe use of mdconfig instead of deprecated vnconfig. Submitted by: Steve Ames <steve@virtual-voodoo.com>	2001-04-14 18:32:09 +00:00
Kirk McKusick	1a6a661032	This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag. It is described in ufs/ffs/fs.h as follows: /* * Filesystem flags. * * Note that the FS_NEEDSFSCK flag is set and cleared only by the * fsck utility. It is set when background fsck finds an unexpected * inconsistency which requires a traditional foreground fsck to be * run. Such inconsistencies should only be found after an uncorrectable * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when * it has successfully cleaned up the filesystem. The kernel uses this * flag to enforce that inconsistent filesystems be mounted read-only. / #define FS_UNCLEAN 0x01 / filesystem not clean at mount / #define FS_DOSOFTDEP 0x02 / filesystem using soft dependencies / #define FS_NEEDSFSCK 0x04 / filesystem needs sync fsck before mount */	2001-04-14 05:26:28 +00:00
Kirk McKusick	a61ab64ac4	Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>. His description of the problem and solution follow. My own tests show speedups on typical filesystem intensive workloads of 5% to 12% which is very impressive considering the small amount of code change involved. ------ One day I noticed that some file operations run much faster on small file systems then on big ones. I've looked at the ffs algorithms, thought about them, and redesigned the dirpref algorithm. First I want to describe the results of my tests. These results are old and I have improved the algorithm after these tests were done. Nevertheless they show how big the perfomance speedup may be. I have done two file/directory intensive tests on a two OpenBSD systems with old and new dirpref algorithm. The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports". The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release. It contains 6596 directories and 13868 files. The test systems are: 1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for test is at wd1. Size of test file system is 8 Gb, number of cg=991, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=35 2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system at wd0, file system for test is at wd1. Size of test file system is 40 Gb, number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50 You can get more info about the test systems and methods at: http://www.ptci.ru/gluk/dirpref/old/dirpref.html Test Results tar -xzf ports.tar.gz rm -rf ports mode old dirpref new dirpref speedup old dirprefnew dirpref speedup First system normal 667 472 1.41 477 331 1.44 async 285 144 1.98 130 14 9.29 sync 768 616 1.25 477 334 1.43 softdep 413 252 1.64 241 38 6.34 Second system normal 329 81 4.06 263.5 93.5 2.81 async 302 25.7 11.75 112 2.26 49.56 sync 281 57.0 4.93 263 90.5 2.9 softdep 341 40.6 8.4 284 4.76 59.66 "old dirpref" and "new dirpref" columns give a test time in seconds. speedup - speed increasement in times, ie. old dirpref / new dirpref. ------ Algorithm description The old dirpref algorithm is described in comments: /* * Find a cylinder to place a directory. * * The policy implemented by this algorithm is to select from * among those cylinder groups with above the average number of * free inodes, the one with the smallest number of directories. / A new directory is allocated in a different cylinder groups than its parent directory resulting in a directory tree that is spreaded across all the cylinder groups. This spreading out results in a non-optimal access to the directories and files. When we have a small filesystem it is not a problem but when the filesystem is big then perfomance degradation becomes very apparent. What I mean by a big file system ? 1. A big filesystem is a filesystem which occupy 20-30 or more percent of total drive space, i.e. first and last cylinder are physically located relatively far from each other. 2. It has a relatively large number of cylinder groups, for example more cylinder groups than 50% of the buffers in the buffer cache. The first results in long access times, while the second results in many buffers being used by metadata operations. Such operations use cylinder group blocks and on-disk inode blocks. The cylinder group block (fs->fs_cblkno) contains struct cg, inode and block bit maps. It is 2k in size for the default filesystem parameters. If new and parent directories are located in different cylinder groups then the system performs more input/output operations and uses more buffers. On filesystems with many cylinder groups, lots of cache buffers are used for metadata operations. My solution for this problem is very simple. I allocate many directories in one cylinder group. I also do some things, so that the new allocation method does not cause excessive fragmentation and all directory inodes will not be located at a location far from its file's inodes and data. The algorithm is: / * Find a cylinder group to place a directory. * * The policy implemented by this algorithm is to allocate a * directory inode in the same cylinder group as its parent * directory, but also to reserve space for its files inodes * and data. Restrict the number of directories which may be * allocated one after another in the same cylinder group * without intervening allocation of files. * * If we allocate a first level directory then force allocation * in another cylinder group. / My early versions of dirpref give me a good results for a wide range of file operations and different filesystem capacities except one case: those applications that create their entire directory structure first and only later fill this structure with files. My solution for such and similar cases is to limit a number of directories which may be created one after another in the same cylinder group without intervening file creations. For this purpose, I allocate an array of counters at mount time. This array is linked to the superblock fs->fs_contigdirs[cg]. Each time a directory is created the counter increases and each time a file is created the counter decreases. A 60Gb filesystem with 8mb/cg requires 10kb of memory for the counters array. The maxcontigdirs is a maximum number of directories which may be created without an intervening file creation. I found in my tests that the best performance occurs when I restrict the number of directories in one cylinder group such that all its files may be located in the same cylinder group. There may be some deterioration in performance if all the file inodes are in the same cylinder group as its containing directory, but their data partially resides in a different cylinder group. The maxcontigdirs value is calculated to try to prevent this condition. Since there is no way to know how many files and directories will be allocated later I added two optimization parameters in superblock/tunefs. They are: int32_t fs_avgfilesize; / expected average file size / int32_t fs_avgfpdir; / expected # of files per directory */ These parameters have reasonable defaults but may be tweeked for special uses of a filesystem. They are only necessary in rare cases like better tuning a filesystem being used to store a squid cache. I have been using this algorithm for about 3 months. I have done a lot of testing on filesystems with different capacities, average filesize, average number of files per directory, and so on. I think this algorithm has no negative impact on filesystem perfomance. It works better than the default one in all cases. The new dirpref will greatly improve untarring/removing/coping of big directories, decrease load on cvs servers and much more. The new dirpref doesn't speedup a compilation process, but also doesn't slow it down. Obtained from: Grigoriy Orlov <gluk@ptci.ru>	2001-04-10 08:38:59 +00:00

... 3 4 5 6 7 ...

1190 Commits