freebsd-dev

Author	SHA1	Message	Date
Bruce Evans	367b50a28f	Fixed some printf format errors (hopefully all of the remaining daddr64_t ones for GENERIC, and all others on the same line as those). Reformat the printfs if necessary to avoid new long lones or old format printf errors.	2002-03-19 04:09:21 +00:00
Kirk McKusick	a0595d0249	Add a flags parameter to VFS_VGET to pass through the desired locking flags when acquiring a vnode. The immediate purpose is to allow polling lock requests (LK_NOWAIT) needed by soft updates to avoid deadlock when enlisting other processes to help with the background cleanup. For the future it will allow the use of shared locks for read access to vnodes. This change touches a lot of files as it affects most filesystems within the system. It has been well tested on FFS, loopback, and CD-ROM filesystems. only lightly on the others, so if you find a problem there, please let me (mckusick@mckusick.com) know.	2002-03-17 01:25:47 +00:00
Kirk McKusick	0d2af52141	Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.	2002-03-15 18:49:47 +00:00
David E. O'Brien	f0c8652ed4	Quiet a warning on the Alpha.	2002-03-15 04:06:10 +00:00
Kirk McKusick	9721068f95	This corrects the first of two known deadlock conditions that come from the presence of a snapshot file.	2002-03-14 01:21:13 +00:00
Poul-Henning Kamp	063f776327	I missed one VOP_CLOSE in the previous commit. Pointed out by: bde	2002-03-11 16:27:04 +00:00
Poul-Henning Kamp	3dbceccb78	As a XXX bandaid open the mounted device READ/WRITE even if we only mount read-only. The trouble here is that we don't reopen the device in read/write mode when we remount in read/write mode resulting in a filesystem sending write requests to a device which was only opened read/only. I'm not quite sure how such a reopen would best be done and defer the problem to more agile hackers.	2002-03-11 13:53:00 +00:00
John Baldwin	fdcc1cc09f	Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic and isn't strictly required. However, it lowers the number of false positives found when grep'ing the kernel sources for p_ucred to ensure proper locking.	2002-02-27 19:18:10 +00:00
John Baldwin	a854ed9893	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.	2002-02-27 18:32:23 +00:00
Julian Elischer	2c1007663f	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)	2002-02-11 20:37:54 +00:00
Kirk McKusick	b06051cf7c	Occationally background fsck would cause a spurious ``freeing free inode'' panic. This change corrects that problem by setting the fs_active flag when the inode map changes to notify the snapshot code that the cylinder group must be rescanned. Submitted by: Robert Watson <rwatson@FreeBSD.org>	2002-02-07 22:13:56 +00:00
Kirk McKusick	cfdaa88697	Occationally deleted files would hang around for hours or days without being reclaimed. This bug was introduced in revision 1.95 dealing with filenames placed in newly allocated directory blocks, thus is not present in 4.X systems. The bug is triggered when a new entry is made in a directory after the data block containing the original new entry has been written, but before the inode that references the data block has been written. Submitted by: Bill Fenner <fenner@research.att.com>	2002-02-07 00:54:32 +00:00
Kirk McKusick	c9f96392c7	When taking a snapshot, we must check for active files that have been unlinked (e.g., with a zero link count). We have to expunge all trace of these files from the snapshot so that they are neither reclaimed prematurely by fsck nor saved unnecessarily by dump.	2002-02-02 01:42:44 +00:00
Kirk McKusick	7b60855308	Add a stub for softdep_request_cleanup() so that compilation without SOFTUPDATES option works properly. Submitted by: Benno Rice <benno@jeamland.net>	2002-01-23 02:18:56 +00:00
Kirk McKusick	03a2057a5b	This patch fixes a long standing complaint with soft updates in which small and/or nearly full filesystems would fail with `file system full' messages when trying to replace a number of existing files (for example during a system installation). When the allocation routines are about to fail with a file system full condition, they make a call to softdep_request_cleanup() which attempts to accelerate the flushing of pending deletion requests in an effort to free up space. In the face of filesystem I/O requests that exceed the available disk transfer capacity, the cleanup request could take an unbounded amount of time. Thus, the softdep_request_cleanup() routine will only try for tickdelay seconds (default 2 seconds) before giving up and returning a filesystem full error. Under typical conditions, the softdep_request_cleanup() routine is able to free up space in under fifty milliseconds.	2002-01-22 06:17:22 +00:00
Kirk McKusick	99bef8782b	Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26 which caused incomplete snapshots to be taken. When background fsck would run on these snapshots, the result would be files being incorrectly released which would subsequently panic the kernel with ``handle_workitem_freefile: inodedep survived'', ``handle_written_inodeblock: live inodedep'', and ``handle_workitem_remove: lost inodedep'' errors.	2002-01-17 08:33:32 +00:00
Kirk McKusick	8af31e7b46	Put write on read-only filesystem panic after we have weeded out block and character devices, fifo's, etc. Submitted by: Bruce Evans <bde@zeta.org.au>	2002-01-16 04:59:09 +00:00
Kirk McKusick	cd6005961f	When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded. This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read. Submitted by: "Sam Leffler" <sam@errno.com>	2002-01-15 07:17:12 +00:00
Alfred Perlstein	426da3bcfb	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.	2002-01-13 11:58:06 +00:00
Kirk McKusick	0bc7a833ec	When going to sleep, we must save our SPL so that it does not get lost if some other process uses the lock while we are sleeping. We restore it after we have slept. This functionality is provided by a new routine interlocked_sleep() that wraps the interlocking with functions that sleep. This function is then used in place of the old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros. Submitted by: Debbie Chu <dchu@juniper.net>	2002-01-12 20:57:36 +00:00
Kirk McKusick	794ef3471f	Must call drain_output() before checking the dirty block list in softdep_sync_metadata(). Otherwise we may miss dependencies that need to be flushed which will result in a later panic with the message ``vinvalbuf: dirty bufs''. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week	2002-01-11 19:59:27 +00:00
Mike Smith	b9a4338d29	Initialise the bioops vector hack at runtime rather than at link time. This avoids the use of common variables. Reviewed by: mckusick	2002-01-08 19:32:18 +00:00
Matthew Dillon	23b590188f	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release	2001-12-20 22:42:27 +00:00
Kirk McKusick	f305c5d199	Change the atomic_set_char to atomic_set_int and atomic_clear_char to atomic_clear_int to ease the implementation for the sparc64. Requested by: Jake Burkholder <jake@locore.ca>	2001-12-18 18:05:17 +00:00
Ian Dowse	143a5346c9	Make sure we ignore the value of `fs_active' when reloading the superblock, and move the initialisation of it to beside where other pointer fields are initialised.	2001-12-16 18:54:09 +00:00
Ian Dowse	3fa4044e34	Move the new superblock field `fs_active' into the region of the superblock that is already set up to handle pointer types. This fixes an accidental change in the superblock size on 64-bit platforms caused by revision 1.24.	2001-12-16 18:51:11 +00:00
Kirk McKusick	cc5a92334f	Minimize the time necessary to suspend operations on a filesystem when taking a snapshot. The two time consuming operations are scanning all the filesystem bitmaps to determine which blocks are in use and scanning all the other snapshots so as to be able to expunge their blocks from the view of the current snapshot. The bitmap scanning is broken into two passes. Before suspending the filesystem all bitmaps are scanned. After the suspension, those bitmaps that changed after being scanned the first time are rescanned. Typically there are few bitmaps that need to be rescanned. The expunging of other snapshots is now done after the suspension is released by observing that we can easily identify any blocks that were allocated to them after the suspension (they will be maked as `not needing to be copied' in the just created snapshot). For all the gory details, see the ``Running fsck in the Background'' paper in the Usenix BSDCon 2002 Conference Proceedings, pages 55-64.	2001-12-14 00:15:06 +00:00
Kirk McKusick	9db12e5108	When a file is partially truncated, we first check to see if the new file end will land in the middle of a file hole. Since the last block of a file must always be allocated, the hole is filled by allocating a block at that location. If the hole being filled is a direct block, then the truncation may eventually reduce the full sized block down to a fragment. When running with soft updates, it is necessary to FSYNC the file after allocating the block and before creating the fragment to avoid triggering a soft updates inconsistency when the block unexpectedly shrinks. Found by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week	2001-12-13 05:07:48 +00:00
Matthew Dillon	245df27cee	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week	2001-10-26 00:08:05 +00:00
Matthew Dillon	c72ccd014d	Change the vnode list under the mount point from a LIST to a TAILQ in preparation for an implementation of limiting code for kern.maxvnodes. MFC after: 3 days	2001-10-23 01:21:29 +00:00
Robert Watson	ab66aa1468	o Replace two direct uid!=0 comparisons with suser_xxx() calls. Obtained from: TrustedBSD Project	2001-10-02 14:41:43 +00:00
Robert Watson	b73d2870cd	o Replace two direct uid!=0 comparisons with suser_td() calls. Obtained from: TrustedBSD Project	2001-10-02 14:34:22 +00:00
John Baldwin	eb46fac565	- Fix some minor whitespace nits. - Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a little nit that caused it to be defined as -(sizeof (struct thread) + 1) instead of -2.	2001-09-27 21:04:13 +00:00
Julian Elischer	b40ce4165d	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha	2001-09-12 08:38:13 +00:00
Ian Dowse	4691e9ead0	The "dirpref" directory layout preference improvements make use of an array "fs_contigdirs[]" to avoid too many directories getting created in each cylinder group. The memory required for this and two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with a single malloc() call, and divided up afterwards. However, the 'space' pointer is not advanced correctly, so fs_contigdirs and fs_maxcluster end up pointing to the same address. Add the missing code to advance the 'space' pointer, and remove an unnecessary update of the pointer that follows. This is likely to fix the "ffs_clusteralloc: map mismatch" panics that have been reported recently. Submitted by: Luke Mewburn <lukem@wasabisystems.com>	2001-09-09 23:48:28 +00:00
Robert Watson	7df97b6117	o At some point, unmounting a non-EA file system with EA's compiled in got a bit broken, when ufs_extattr_stop() was called and failed, ufs_extattr_destroy() would panic. This makes the call to destroy() conditional on the success of stop(). Submitted by: Christian Carstensen <cc@devcon.net> Obtained from: TrustedBSD Project	2001-09-01 20:11:05 +00:00
Peter Wemm	815d14ddab	Use a fixed type for times in on-disk structures for ufs rather than something that could potentially change like time_t.	2001-07-16 00:55:27 +00:00
John Baldwin	ed87274d16	Fix more mntvnode and vnode interlock order reversals.	2001-06-28 22:21:33 +00:00
John Baldwin	49d2d9f4a4	- Fix a mntvnode and vnode interlock reversal. - Protect the mnt_vnode list with the mntvnode lock. - Use queue(9) macros.	2001-06-28 04:12:56 +00:00
Peter Wemm	78236790cd	Fix warning: 1973: warning: int format, long int arg (arg 5)	2001-06-15 07:44:39 +00:00
Kirk McKusick	eb87cd754f	Build on the change in revision 1.98 by Tor.Egge@fast.no. The symptom being treated in 1.98 was to avoid freeing a pagedep dependency if there was still a newdirblk dependency referencing it. That change is correct and no longer prints a warning message when it occurs. The other part of revision 1.98 was to panic when a newdirblk dependency was encountered during a file truncation. This fix removes that panic and replaces it with code to find and delete the newdirblk dependency so that the truncation can succeed.	2001-06-13 23:13:13 +00:00
David E. O'Brien	1239674238	There seems to be a problem that the order of disk write operation being incorrect due to a missing check for some dependency. This change avoids the freelist corruption (but not the temporarily inconsistent state of the file system). A message is printed as a reminder of the under lying problem when a pagedep structure is not freed due to the NEWBLOCK flag being set. Submitted by: Tor.Egge@fast.no	2001-06-05 01:49:37 +00:00
John Baldwin	1c11b01562	Revert the previous commit in favor of the fix in rev 1.42 of ufs/ffs/ffs_extern.h instead. Requested by: bde	2001-05-30 23:09:19 +00:00
John Baldwin	55d132317c	Forward declare struct cg to quiet a warning. Submitted by: bde	2001-05-30 23:08:40 +00:00
John Baldwin	59718ee556	Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a warning.	2001-05-29 23:53:16 +00:00
Poul-Henning Kamp	c7a3e2379c	Remove last vestiges of MFS.	2001-05-29 21:21:53 +00:00
Kirk McKusick	57042c7f72	Update softdep_setup_directory_add prototype to reflect changes in actual function. Obtained from: Jim Bloom <bloom@jbloom.jbloom.org>	2001-05-20 15:59:55 +00:00
Kirk McKusick	dc01275be9	Must ensure that all the entries on the pd_pendinghd list have been committed to disk before clearing them. More specifically, when free_newdirblk is called, we know that the inode claims the new directory block. However, if the associated pagedep is still linked onto the directory buffer dependency chain, then some of the entries on the pd_pendinghd list may not be committed to disk yet. In this case, we will simply note that the inode claims the block and let the pd_pendinghd list be processed when the pagedep is next written. If the pagedep is no longer on the buffer dependency chain, then all the entries on the pd_pending list are committed to disk and we can free them in free_newdirblk. This corrects a window of vulnerability introduced in the code added in version 1.95.	2001-05-19 19:24:26 +00:00
Kirk McKusick	9f5192ff71	Must be a bit less aggressive about freeing pagedep structures. Obtained from: Robert Watson <rwatson@FreeBSD.org> and Matthew Jacob <mjacob@feral.com>	2001-05-18 22:16:28 +00:00
Kirk McKusick	24a83a4b3f	When a new block is allocated to a directory, an fsync of a file whose name is within that block must ensure not only that the block containing the file name has been written, but also that the on-disk directory inode references that block. When a new directory block is created, we allocate a newdirblk structure which is linked to the associated allocdirect (on its ad_newdirblk list). When the allocdirect has been satisfied, the newdirblk structure is moved to the inodedep id_bufwait list of its directory to await the inode being written. When the inode is written, the directory entries are fully committed and can be deleted from their pagedep->id_pendinghd and inodedep->id_pendinghd lists.	2001-05-17 07:24:03 +00:00
Ian Dowse	0864ef1e8a	Change the second argument of vflush() to an integer that specifies the number of references on the filesystem root vnode to be both expected and released. Many filesystems hold an extra reference on the filesystem root vnode, which must be accounted for when determining if the filesystem is busy and then released if it isn't busy. The old `skipvp' approach required individual filesystem xxx_unmount functions to re-implement much of vflush()'s logic to deal with the root vnode. All 9 filesystems that hold an extra reference on the root vnode got the logic wrong in the case of forced unmounts, so `umount -f' would always fail if there were any extra root vnode references. Fix this issue centrally in vflush(), now that we can. This commit also fixes a vnode reference leak in devfs, which could result in idle devfs filesystems that refuse to unmount. Reviewed by: phk, bp	2001-05-16 18:04:37 +00:00
Kirk McKusick	7389126d9a	Further fixes for deadlock in the presence of multiple snapshots. There are still more to find, but this fix should cover the common cases that folks are hitting.	2001-05-14 17:16:49 +00:00
Kirk McKusick	9b35c30cf7	Remove yet another deadlock case.	2001-05-11 07:12:03 +00:00
Kirk McKusick	9ccb939ef0	When running with soft updates, track the number of blocks and files that are committed to being freed and reflect these blocks in the counts returned by statfs (and thus also by the `df' command). This change allows programs such as those that do news expiration to know when to stop if they are trying to create a certain percentage of free space. Note that this change does not solve the much harder problem of making this to-be-freed space available to applications that want it (thus on a nearly full filesystem, you may still encounter out-of-space conditions even though the free space will show up eventually). Hopefully this harder problem will be the subject of a future enhancement.	2001-05-08 07:42:20 +00:00
Kirk McKusick	27b047acf0	Several fixes for units errors: 1) Do not assume that the superblock will be of size fs->fs_bsize. This fixes a panic when taking a snapshot on a filesystem with a block size bigger than 8K. 2) Properly calculate the number of fragments that follow the superblock summary information. This fixes a bug with inconsistent snapshots. 3) When cleaning up a snapshot that is about to be removed, properly calculate the number of blocks that need to be checked. This fixes a bug that created partially allocated inodes. 4) When moving blocks from a snapshot that is about to be removed to another snapshot, properly account for the reduced number of blocks in the snapshot from which they are taken. This fixes a bug in which the number of blocks released from a snapshot did not match the number that it claimed to have.	2001-05-08 07:29:03 +00:00
Kirk McKusick	0c6fbff0a5	When syncing out snapshot metadata, we must temporarily allow recursive buffer locking so as to avoid locking against ourselves if we need to write filesystem metadata.	2001-05-08 07:13:00 +00:00
Kirk McKusick	23371b2f22	Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce the amount of time that the filesystem must be suspended. The current snapshot is elided as well as the earlier snapshots.	2001-05-04 05:49:28 +00:00
Poul-Henning Kamp	3c7a8027cb	Remove blatantly pointless call to VOP_BMAP(). Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.	2001-05-01 09:12:31 +00:00
Poul-Henning Kamp	a62615e59b	Implement vop_std{get\|put}pages() and add them to the default vop[]. Un-copy&paste all the VOP_{GET\|PUT}PAGES() functions which do nothing but the default.	2001-05-01 08:34:45 +00:00
Poul-Henning Kamp	855aa097af	VOP_BALLOC was never really a VOP in the first place, so convert it to UFS_BALLOC like the other "between UFS and FFS function interfaces".	2001-04-29 12:36:52 +00:00
Poul-Henning Kamp	0c25dbeb17	Remove faint traces of non-existant ffs_bmap().	2001-04-29 10:23:32 +00:00
Greg Lehey	60fb0ce365	Revert consequences of changes to mount.h, part 2. Requested by: bde	2001-04-29 02:45:39 +00:00
Kirk McKusick	c9509f5865	Rather than copying all the indirect blocks of the snapshot, simply mark them as BLK_NOCOPY. This trick cuts the initial size of the snapshot in half and cuts the time to take a snapshot by a third.	2001-04-26 00:50:53 +00:00
Kirk McKusick	112f737245	When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.	2001-04-25 08:11:18 +00:00
Poul-Henning Kamp	a13234bb35	Move the netexport structure from the fs-specific mountstructure to struct mount. This makes the "struct netexport *" paramter to the vfs_export and vfs_checkexport interface unneeded. Consequently that all non-stacking filesystems can use vfs_stdcheckexp(). At the same time, make it a pointer to a struct netexport in struct mount, so that we can remove the bogus AF_MAX and #include <net/radix.h> from <sys/mount.h>	2001-04-25 07:07:52 +00:00
Ian Dowse	5d69bac493	Pre-dirpref versions of fsck may zero out the new superblock fields fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause panics if these fields were zeroed while a filesystem was mounted read-only, and then remounted read-write. Add code to ffs_reload() which copies the fs_contigdirs pointer from the previous superblock, and reinitialises fs_avgf* if necessary. Reviewed by: mckusick	2001-04-24 00:37:16 +00:00
Greg Lehey	d98dc34f52	Correct #includes to work with fixed sys/mount.h.	2001-04-23 09:05:15 +00:00
Kirk McKusick	5819ab3f12	Add debugging option to always read/write cylinder groups as full sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'. Add debugging option to disable background writes of cylinder groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'. These debugging options should be tried on systems that are panicing with corrupted cylinder group maps to see if it makes the problem go away. The set of panics in question are: ffs_clusteralloc: map mismatch ffs_nodealloccg: map corrupted ffs_nodealloccg: block not in map ffs_alloccg: map corrupted ffs_alloccg: block not in map ffs_alloccgblk: cyl groups corrupted ffs_alloccgblk: can't find blk in cyl ffs_checkblk: partially free fragment The following panics are less likely to be related to this problem, but might be helped by these debugging options: ffs_valloc: dup alloc ffs_blkfree: freeing free block ffs_blkfree: freeing free frag ffs_vfree: freeing free inode If you try these options, please report whether they helped reduce your bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com> and to Matt Dillon <dillon@earth.backplane.com>.	2001-04-17 05:37:51 +00:00
Kirk McKusick	f0f3f19f05	Background fsck sysctl operations must use vn_start_write and vn_finished_write so that they do not attempt to modify a suspended filesystem.	2001-04-17 05:06:37 +00:00
Kirk McKusick	74046077a7	Update to describe use of mdconfig instead of deprecated vnconfig. Submitted by: Steve Ames <steve@virtual-voodoo.com>	2001-04-14 18:32:09 +00:00
Kirk McKusick	1a6a661032	This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag. It is described in ufs/ffs/fs.h as follows: /* * Filesystem flags. * * Note that the FS_NEEDSFSCK flag is set and cleared only by the * fsck utility. It is set when background fsck finds an unexpected * inconsistency which requires a traditional foreground fsck to be * run. Such inconsistencies should only be found after an uncorrectable * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when * it has successfully cleaned up the filesystem. The kernel uses this * flag to enforce that inconsistent filesystems be mounted read-only. / #define FS_UNCLEAN 0x01 / filesystem not clean at mount / #define FS_DOSOFTDEP 0x02 / filesystem using soft dependencies / #define FS_NEEDSFSCK 0x04 / filesystem needs sync fsck before mount */	2001-04-14 05:26:28 +00:00
Kirk McKusick	a61ab64ac4	Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>. His description of the problem and solution follow. My own tests show speedups on typical filesystem intensive workloads of 5% to 12% which is very impressive considering the small amount of code change involved. ------ One day I noticed that some file operations run much faster on small file systems then on big ones. I've looked at the ffs algorithms, thought about them, and redesigned the dirpref algorithm. First I want to describe the results of my tests. These results are old and I have improved the algorithm after these tests were done. Nevertheless they show how big the perfomance speedup may be. I have done two file/directory intensive tests on a two OpenBSD systems with old and new dirpref algorithm. The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports". The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release. It contains 6596 directories and 13868 files. The test systems are: 1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for test is at wd1. Size of test file system is 8 Gb, number of cg=991, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=35 2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system at wd0, file system for test is at wd1. Size of test file system is 40 Gb, number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50 You can get more info about the test systems and methods at: http://www.ptci.ru/gluk/dirpref/old/dirpref.html Test Results tar -xzf ports.tar.gz rm -rf ports mode old dirpref new dirpref speedup old dirprefnew dirpref speedup First system normal 667 472 1.41 477 331 1.44 async 285 144 1.98 130 14 9.29 sync 768 616 1.25 477 334 1.43 softdep 413 252 1.64 241 38 6.34 Second system normal 329 81 4.06 263.5 93.5 2.81 async 302 25.7 11.75 112 2.26 49.56 sync 281 57.0 4.93 263 90.5 2.9 softdep 341 40.6 8.4 284 4.76 59.66 "old dirpref" and "new dirpref" columns give a test time in seconds. speedup - speed increasement in times, ie. old dirpref / new dirpref. ------ Algorithm description The old dirpref algorithm is described in comments: /* * Find a cylinder to place a directory. * * The policy implemented by this algorithm is to select from * among those cylinder groups with above the average number of * free inodes, the one with the smallest number of directories. / A new directory is allocated in a different cylinder groups than its parent directory resulting in a directory tree that is spreaded across all the cylinder groups. This spreading out results in a non-optimal access to the directories and files. When we have a small filesystem it is not a problem but when the filesystem is big then perfomance degradation becomes very apparent. What I mean by a big file system ? 1. A big filesystem is a filesystem which occupy 20-30 or more percent of total drive space, i.e. first and last cylinder are physically located relatively far from each other. 2. It has a relatively large number of cylinder groups, for example more cylinder groups than 50% of the buffers in the buffer cache. The first results in long access times, while the second results in many buffers being used by metadata operations. Such operations use cylinder group blocks and on-disk inode blocks. The cylinder group block (fs->fs_cblkno) contains struct cg, inode and block bit maps. It is 2k in size for the default filesystem parameters. If new and parent directories are located in different cylinder groups then the system performs more input/output operations and uses more buffers. On filesystems with many cylinder groups, lots of cache buffers are used for metadata operations. My solution for this problem is very simple. I allocate many directories in one cylinder group. I also do some things, so that the new allocation method does not cause excessive fragmentation and all directory inodes will not be located at a location far from its file's inodes and data. The algorithm is: / * Find a cylinder group to place a directory. * * The policy implemented by this algorithm is to allocate a * directory inode in the same cylinder group as its parent * directory, but also to reserve space for its files inodes * and data. Restrict the number of directories which may be * allocated one after another in the same cylinder group * without intervening allocation of files. * * If we allocate a first level directory then force allocation * in another cylinder group. / My early versions of dirpref give me a good results for a wide range of file operations and different filesystem capacities except one case: those applications that create their entire directory structure first and only later fill this structure with files. My solution for such and similar cases is to limit a number of directories which may be created one after another in the same cylinder group without intervening file creations. For this purpose, I allocate an array of counters at mount time. This array is linked to the superblock fs->fs_contigdirs[cg]. Each time a directory is created the counter increases and each time a file is created the counter decreases. A 60Gb filesystem with 8mb/cg requires 10kb of memory for the counters array. The maxcontigdirs is a maximum number of directories which may be created without an intervening file creation. I found in my tests that the best performance occurs when I restrict the number of directories in one cylinder group such that all its files may be located in the same cylinder group. There may be some deterioration in performance if all the file inodes are in the same cylinder group as its containing directory, but their data partially resides in a different cylinder group. The maxcontigdirs value is calculated to try to prevent this condition. Since there is no way to know how many files and directories will be allocated later I added two optimization parameters in superblock/tunefs. They are: int32_t fs_avgfilesize; / expected average file size / int32_t fs_avgfpdir; / expected # of files per directory */ These parameters have reasonable defaults but may be tweeked for special uses of a filesystem. They are only necessary in rare cases like better tuning a filesystem being used to store a squid cache. I have been using this algorithm for about 3 months. I have done a lot of testing on filesystems with different capacities, average filesize, average number of files per directory, and so on. I think this algorithm has no negative impact on filesystem perfomance. It works better than the default one in all cases. The new dirpref will greatly improve untarring/removing/coping of big directories, decrease load on cvs servers and much more. The new dirpref doesn't speedup a compilation process, but also doesn't slow it down. Obtained from: Grigoriy Orlov <gluk@ptci.ru>	2001-04-10 08:38:59 +00:00
Jeroen Ruigrok van der Werven	5d0b660f2a	Fix typo ); -> ,	2001-03-24 15:25:04 +00:00
Kirk McKusick	fca26df055	Check that background fsck operation is being done on a ufs filesystem. Obtained from: Robert Watson <rwatson@FreeBSD.org>	2001-03-23 20:58:25 +00:00
Kirk McKusick	812b1d416c	Add kernel support for running fsck on active filesystems.	2001-03-21 04:09:01 +00:00
Kirk McKusick	31c6ce0aed	Clear the fs_clean flag only when the FS_UNCLEAN flag is not set (as is done in unmount). Remove a snapshot inode from the superblock list when its last name goes away rather than when its last reference goes away. That way it will be properly reclaimed by fsck after a crash rather than reenabled when the filesystem is mounted.	2001-03-21 04:05:20 +00:00
Kirk McKusick	7e72e9918a	Report the correct inode number when panicing with freeing free inode. Report the correct block number when panicing with freeing free block.	2001-03-21 04:01:02 +00:00
Robert Watson	516081f288	o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change reflects the fact that our EA support is implemented entirely at the UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount events). This also better reflects the fact that [shortly] MFS will also support EAs, as well as possibly IFS. o Consumers of the EA support in FFS are reminded that as a result, they must change kernel config files to reflect the new option names. Obtained from: TrustedBSD Project	2001-03-19 04:35:40 +00:00
Robert Watson	f5161237ad	o Implement "options FFS_EXTATTR_AUTOSTART", which depends on "options FFS_EXTATTR". When extended attribute auto-starting is enabled, FFS will scan the .attribute directory off of the root of each file system, as it is mounted. If .attribute exists, EA support will be started for the file system. If there are files in the directory, FFS will attempt to start them as attribute backing files for attributes baring the same name. All attributes are started before access to the file system is permitted, so this permits race-free enabling of attributes. For attributes backing support for security features, such as ACLs, MAC, Capabilities, this is vital, as it prevents the file system attributes from getting out of sync as a result of file system operations between mount-time and the enabling of the extended attribute. The userland extattrctl tool will still function exactly as previously. Files must be placed directly in .attribute, which must be directly off of the file system root: symbolic links are not permitted. FFS_EXTATTR will continue to be able to function without FFS_EXTATTR_AUTOSTART for sites that do not want/require auto-starting. If you're using the UFS_ACL code available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART is recommended. o This support is implemented by adding an invocation of ufs_extattr_autostart() to ffs_mountfs(). In addition, several new supporting calls are introduced in ufs_extattr.c: ufs_extattr_autostart(): start EAs on the specified mount ufs_extattr_lookup(): given a directory and filename, return the vnode for the file. ufs_extattr_enable_with_open(): invoke ufs_extattr_enable() after doing the equililent of vn_open() on the passed file. ufs_extattr_iterate_directory(): iterate over a directory, invoking ufs_extattr_lookup() and ufs_extattr_enable_with_open() on each entry. o This feature is not widely tested, and therefore may contain bugs, caution is advised. Several changes are in the pipeline for this feature, including breaking out of EA namespaces into subdirectories of .attribute (this is waiting on the updated EA API), as well as a per-filesystem flag indicating whether or not EAs should be auto-started. This is required because administrators may not want .attribute auto-started on all file systems, especially if non-administrators have write access to the root of a file system. Obtained from: TrustedBSD Project	2001-03-14 05:32:31 +00:00
Kirk McKusick	589c7af992	Fixes to track snapshot copy-on-write checking in the specinfo structure rather than assuming that the device vnode would reside in the FFS filesystem (which is obviously a broken assumption with the device filesystem).	2001-03-07 07:09:55 +00:00
Kirk McKusick	8775e64a5d	Free lock before returning from process_worklist_item. Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>	2001-03-01 21:43:46 +00:00
Adrian Chadd	f3a90da995	Reviewed by: jlemon An initial tidyup of the mount() syscall and VFS mount code. This code replaces the earlier work done by jlemon in an attempt to make linux_mount() work. * the guts of the mount work has been moved into vfs_mount(). * move `type', `path' and `flags' from being userland variables into being kernel variables in vfs_mount(). `data' remains a pointer into userspace. * Attempt to verify the `type' and `path' strings passed to vfs_mount() aren't too long. * rework mount() and linux_mount() to take the userland parameters (besides data, as mentioned) and pass kernel variables to vfs_mount(). (linux_mount() already did this, I've just tidied it up a little more.) * remove the copyin() stuff for `path'. `data' still requires copyin() since its a pointer into userland. * set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each filesystem. This variable is generally initialised with `path', and each filesystem can override it if they want to. * NOTE: f_mntonname is intiailised with "/" in the case of a root mount.	2001-03-01 21:00:17 +00:00
Kirk McKusick	a5a94e3936	Free lock before calling panic so that subsequent attempt to write out buffers does not re-panic with `locking against myself'. This change should not affect normal operations of soft updates in any way.	2001-02-23 09:01:31 +00:00
Kirk McKusick	cc686e21c0	When cleaning up excess inode dependencies, check for being done. Reviewed by: Jan Koum <jkb@yahoo-inc.com>	2001-02-22 10:17:57 +00:00
Kirk McKusick	2cf5d587a9	This patch corrects two problems with the rate limiting code that was introduced in revision 1.80. The problem manifested itself with a `locking against myself' panic and could also result in soft updates inconsistences associated with inodedeps. The two problems are: 1) One of the background operations could manipulate the bitmap while holding it locked with intent to create. This held lock results in a `locking against myself' panic, when the background processing that we have been coopted to do tries to lock the bitmap which we are already holding locked. To understand how to fix this problem, first, observe that we can do the background cleanups in inodedep_lookup only when allocating inodedeps (DEPALLOC is set in the call to inodedep_lookup). Second observe that calls to inodedep_lookup with DEPALLOC set can only happen from the following calls into the softdep code: softdep_setup_inomapdep softdep_setup_allocdirect softdep_setup_remove softdep_setup_freeblocks softdep_setup_directory_change softdep_setup_directory_add softdep_change_linkcnt Only the first two of these can come from ffs_alloc.c while holding a bitmap locked. Thus, inodedep_lookup must not go off to do request_cleanups when being called from these functions. This change adds a flag, NODELAY, that can be passed to inodedep_lookup to let it know that it should not do background processing in those cases. 2) The return value from request_cleanup when helping out with the cleanup was 0 instead of 1. This meant that despite the fact that we may have slept while doing the cleanups, the code did not recheck for the appearance of an inodedep (e.g., goto top in inodedep_lookup). This lead to the softdep inconsistency in which we ended up with two inodedep's for the same inode. Reviewed by: Peter Wemm <peter@yahoo-inc.com>, Matt Dillon <dillon@earth.backplane.com>	2001-02-20 11:14:38 +00:00
Jeroen Ruigrok van der Werven	d7d97eb0aa	Preceed/preceeding are not english words. Use precede and preceding.	2001-02-18 10:43:53 +00:00
Jake Burkholder	d5a08a6065	Implement a unified run queue and adjust priority levels accordingly. - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.	2001-02-12 00:20:08 +00:00
Bosko Milekic	9ed346bab0	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)	2001-02-09 06:11:45 +00:00
Poul-Henning Kamp	37d4006626	Another round of the <sys/queue.h> FOREACH transmogriffer. Created with: sed(1) Reviewed by: md5(1)	2001-02-04 16:08:18 +00:00
Poul-Henning Kamp	fc2ffbe604	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)	2001-02-04 13:13:25 +00:00
Poul-Henning Kamp	ef9e85abba	Use <sys/queue.h> macro API.	2001-02-04 12:37:48 +00:00
Matthew Dillon	f8e071a1eb	Fix a race between the syncer and umount. When you umount a softupdates filesystem softdep_process_worklist() is called in a loop until it indicates that no dependancies remain, but the determination of that fact depends on there only being one softdep_process_worklist() instance running. It was possible for the syncer to also be running softdep_process_worklist() and the pre-existing checks in the code to prevent this were not sufficient to prevent the race. This patch solves the problem. Approved-by: mckusick	2001-01-30 06:31:59 +00:00
Jason Evans	1b367556b5	Convert all simplelocks to mutexes and remove the simplelock implementations.	2001-01-24 12:35:55 +00:00
Ian Dowse	f55ff3f3ef	The ffs superblock includes a 128-byte region for use by temporary in-core pointers to summary information. An array in this region (fs_csp) could overflow on filesystems with a very large number of cylinder groups (~16000 on i386 with 8k blocks). When this happens, other fields in the superblock get corrupted, and fsck refuses to check the filesystem. Solve this problem by replacing the fs_csp array in 'struct fs' with a single pointer, and add padding to keep the length of the 128-byte region fixed. Update the kernel and userland utilities to use just this single pointer. With this change, the kernel no longer makes use of the superblock fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c to indicate that these fields must be calculated for compatibility with older kernels. Reviewed by: mckusick	2001-01-15 18:30:40 +00:00
Kirk McKusick	cb3ab5aaf7	Properly compute the size of the final block of superblock summary information. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>	2001-01-12 21:56:55 +00:00
Kirk McKusick	48d617487d	Several small but important fixes for snapshots: 1) Be more tolerant of missing snapshot files by only trying to decrement their reference count if they are registered as active. 2) Fix for snapshots of filesystems with block sizes larger than 8K (from Ollivier Robert <roberto@eurocontrol.fr>). 3) Fix to avoid losing last block in snapshot file when calculating blocks that need to be copied (from Don Coleman <coleman@coleman.org>).	2000-12-19 04:41:09 +00:00
Kirk McKusick	6da443cb22	Get rid of spurious check in ffs_truncate for i_size == length which fails to set the modification time on the file. The same check a few lines later takes the correct action. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>	2000-12-19 04:20:13 +00:00
Assar Westerlund	ca85ca6099	add a stub for softdep_slowdown so that it's possible to build the kernel without SOFTUPDATES	2000-12-17 23:59:56 +00:00
Seigo Tanimura	937c4dfa08	Do not race for the lock of an inode hash. Reviewed by: jhb	2000-12-13 10:04:01 +00:00
Kirk McKusick	1d733bbd10	Preventing runaway kernel soft updates memory, take three. Previously, the syncer process was the only process in the system that could process the soft updates background work list. If enough other processes were adding requests to that list, it would eventually grow without bound. Because some of the work list requests require vnodes to be locked, it was not generally safe to let random processes process the work list while they already held vnodes locked. By adding a flag to the work list queue processing function to indicate whether the calling process could safely lock vnodes, it becomes possible to co-opt other processes into helping out with the work list. Now when the worklist gets too large, other processes can safely help out by picking off those work requests that can be handled without locking a vnode, leaving only the small number of requests requiring a vnode lock for the syncer process. With this change, it appears possible to keep even the nastiest workloads under control. Submitted by: Paul Saab <ps@yahoo-inc.com>	2000-12-13 08:30:35 +00:00

1 2 3 4 5 ...

501 Commits