freebsd-skq

Author	SHA1	Message	Date
mckusick	670fad8a9e	This corrects the first of two known deadlock conditions that come from the presence of a snapshot file.	2002-03-14 01:21:13 +00:00
iedowse	cf446bea56	Fix a bug in ufsdirhash_adjfree() that caused it to incorrectly update the free-space statistics in some cases. The problem affected directory blocks when the free space dropped below the size of the maximum allowed entry size. When this happened, the free-space summary information could claim that there are no further blocks that can fit a maximum-size entry, even if there are. The effect of this bug is that the directory may be enlarged even though there is space within the directory for the new entry. This wastes disk space and has a negative impact on performance. Fix it by correctly computing the dh_firstfree array index, adding a helper macro for clarity. Put an extra sanity check into ufsdirhash_checkblock() to detect the situation in future. Found by: dwmalone Reviewed by: dwmalone MFC after: 1 week	2002-03-11 19:13:22 +00:00
phk	2098322771	I missed one VOP_CLOSE in the previous commit. Pointed out by: bde	2002-03-11 16:27:04 +00:00
phk	c68bd5ed71	As a XXX bandaid open the mounted device READ/WRITE even if we only mount read-only. The trouble here is that we don't reopen the device in read/write mode when we remount in read/write mode resulting in a filesystem sending write requests to a device which was only opened read/only. I'm not quite sure how such a reopen would best be done and defer the problem to more agile hackers.	2002-03-11 13:53:00 +00:00
rwatson	1a2700eed9	Update DBA for NAI. We have several. We used the wrong one. :-)	2002-03-07 17:49:06 +00:00
green	965906f60d	Add new errno ``ENOATTR''.	2002-03-07 15:13:44 +00:00
dillon	e05fb41e7e	cleanup readability syntax prior to ongoing b_resid work commits. MFC after: 1 day	2002-03-06 00:44:30 +00:00
jhb	b8b3ac8816	Use thread0.td_ucred instead of proc0.p_ucred. This change is cosmetic and isn't strictly required. However, it lowers the number of false positives found when grep'ing the kernel sources for p_ucred to ensure proper locking.	2002-02-27 19:18:10 +00:00
jhb	3706cd3509	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.	2002-02-27 18:32:23 +00:00
phk	62d248fb9e	Replace bowrite() with BUF_WRITE in ufs. Remove bowrite(), it is now unused. This is the first step in getting entirely rid of BIO_ORDERED which is a generally accepted evil thing. Approved by: mckusick	2002-02-22 09:03:00 +00:00
rwatson	ad827f33f7	o Minor style fix on #endif, missing '_' in comment.	2002-02-20 15:44:43 +00:00
phk	68389bd8ba	Make v_addpollinfo() visible and non-inline. Have callers only call it as needed. Add necessary call in ufs_kqfilter(). Test-case found by: Andrew Gallatin <gallatin@cs.duke.edu>	2002-02-18 16:18:02 +00:00
phk	c2a47cdbe8	Move the stuff related to select and poll out of struct vnode. The use of the zone allocator may or may not be overkill. There is an XXX: over in ufs/ufs/ufs_vnops.c that jlemon may need to revisit. This shaves about 60 bytes of struct vnode which on my laptop means 600k less RAM used for vnodes.	2002-02-17 21:15:36 +00:00
phk	3348974369	Collect the VN_KNOTE() macro definitions on vnode.h	2002-02-17 21:07:57 +00:00
julian	37369620df	In a threaded world, differnt priorirites become properties of different entities. Make it so. Reviewed by: jhb@freebsd.org (john baldwin)	2002-02-11 20:37:54 +00:00
rwatson	c00ccc43a6	Minor style tweaks. Remove an unneeded comment and commented out code that won't be needed.	2002-02-10 04:57:08 +00:00
rwatson	ce1e1df4d1	Copyright + license update.	2002-02-10 04:50:24 +00:00
rwatson	6ca91a055c	Part I: Update extended attribute API and ABI: o Modify the system call syntax for extattr_{get,set}_{fd,file}() so as not to use the scatter gather API (which appeared not to be used by any consumers, and be less portable), rather, accepts 'data' and 'nbytes' in the style of other simple read/write interfaces. This changes the API and ABI. o Modify system call semantics so that extattr_get_{fd,file}() return a size_t. When performing a read, the number of bytes read will be returned, unless the data pointer is NULL, in which case the number of bytes of data are returned. This changes the API only. o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t argument so as to return the size, if desirable. If set to NULL, the size will not be returned. o Update various filesystems (pseodofs, ufs) to DTRT. These changes should make extended attributes more useful and more portable. More commits to rebuild the system call files, as well as update userland utilities to follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-02-10 04:43:22 +00:00
phk	d2872f5d10	Remove di_inumber since LFS is long gone.	2002-02-10 00:55:49 +00:00
mckusick	88b4f0b921	Occationally background fsck would cause a spurious ``freeing free inode'' panic. This change corrects that problem by setting the fs_active flag when the inode map changes to notify the snapshot code that the cylinder group must be rescanned. Submitted by: Robert Watson <rwatson@FreeBSD.org>	2002-02-07 22:13:56 +00:00
mckusick	9d51e5cc24	Occationally deleted files would hang around for hours or days without being reclaimed. This bug was introduced in revision 1.95 dealing with filenames placed in newly allocated directory blocks, thus is not present in 4.X systems. The bug is triggered when a new entry is made in a directory after the data block containing the original new entry has been written, but before the inode that references the data block has been written. Submitted by: Bill Fenner <fenner@research.att.com>	2002-02-07 00:54:32 +00:00
mckusick	c142ab455c	When taking a snapshot, we must check for active files that have been unlinked (e.g., with a zero link count). We have to expunge all trace of these files from the snapshot so that they are neither reclaimed prematurely by fsck nor saved unnecessarily by dump.	2002-02-02 01:42:44 +00:00
mckusick	35edc11259	Add a stub for softdep_request_cleanup() so that compilation without SOFTUPDATES option works properly. Submitted by: Benno Rice <benno@jeamland.net>	2002-01-23 02:18:56 +00:00
mckusick	4f050b1765	This patch fixes a long standing complaint with soft updates in which small and/or nearly full filesystems would fail with `file system full' messages when trying to replace a number of existing files (for example during a system installation). When the allocation routines are about to fail with a file system full condition, they make a call to softdep_request_cleanup() which attempts to accelerate the flushing of pending deletion requests in an effort to free up space. In the face of filesystem I/O requests that exceed the available disk transfer capacity, the cleanup request could take an unbounded amount of time. Thus, the softdep_request_cleanup() routine will only try for tickdelay seconds (default 2 seconds) before giving up and returning a filesystem full error. Under typical conditions, the softdep_request_cleanup() routine is able to free up space in under fifty milliseconds.	2002-01-22 06:17:22 +00:00
mckusick	4e7dcb216b	Fix a bug introduced in ffs_snapshot.c -r1.25 and fs.h -r1.26 which caused incomplete snapshots to be taken. When background fsck would run on these snapshots, the result would be files being incorrectly released which would subsequently panic the kernel with ``handle_workitem_freefile: inodedep survived'', ``handle_written_inodeblock: live inodedep'', and ``handle_workitem_remove: lost inodedep'' errors.	2002-01-17 08:33:32 +00:00
mckusick	0ed7ba2c74	Put write on read-only filesystem panic after we have weeded out block and character devices, fifo's, etc. Submitted by: Bruce Evans <bde@zeta.org.au>	2002-01-16 04:59:09 +00:00
mckusick	b8d6599e4c	When downgrading a filesystem from read-write to read-only, operations involving file removal or file update were not always being fully committed to disk. The result was lost files or corrupted file data. This change ensures that the filesystem is properly synced to disk before the filesystem is down-graded. This delta also fixes a long standing bug in which a file open for reading has been unlinked. When the last open reference to the file is closed, the inode is reclaimed by the filesystem. Previously, if the filesystem had been down-graded to read-only, the inode could not be reclaimed, and thus was lost and had to be later recovered by fsck. With this change, such files are found at the time of the down-grade. Normally they will result in the filesystem down-grade failing with `device busy'. If a forcible down-grade is done, then the affected files will be revoked causing the inode to be released and the open file descriptors to begin failing on attempts to read. Submitted by: "Sam Leffler" <sam@errno.com>	2002-01-15 07:17:12 +00:00
alfred	844237b396	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.	2002-01-13 11:58:06 +00:00
mckusick	c6e526cffc	When going to sleep, we must save our SPL so that it does not get lost if some other process uses the lock while we are sleeping. We restore it after we have slept. This functionality is provided by a new routine interlocked_sleep() that wraps the interlocking with functions that sleep. This function is then used in place of the old ACQUIRE_LOCK_INTERLOCKED() and FREE_LOCK_INTERLOCKED() macros. Submitted by: Debbie Chu <dchu@juniper.net>	2002-01-12 20:57:36 +00:00
mckusick	b86f75b78b	Must call drain_output() before checking the dirty block list in softdep_sync_metadata(). Otherwise we may miss dependencies that need to be flushed which will result in a later panic with the message ``vinvalbuf: dirty bufs''. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week	2002-01-11 19:59:27 +00:00
phk	64c925b060	Do not pull quota entries of the cache-list if they have already been removed from the cache-list as part of a previous unmount. This would result in panics (page fault in dqflush()) during subsequent umounts provided that enough distinct UID's to actually make the hash do something are active. This can probably explain a number of weird quota related behaviours. PR: 32331 maybe more. Reproduced by: Søren Schrørder <sch@cybercity.dk>	2002-01-10 15:02:57 +00:00
msmith	6fa204fd80	Initialise the bioops vector hack at runtime rather than at link time. This avoids the use of common variables. Reviewed by: mckusick	2002-01-08 19:32:18 +00:00
dillon	ac9876d609	Fix a BUF_TIMELOCK race against BUF_LOCK and fix a deadlock in vget() against VM_WAIT in the pageout code. Both fixes involve adjusting the lockmgr's timeout capability so locks obtained with timeouts do not interfere with locks obtained without a timeout. Hopefully MFC: before the 4.5 release	2001-12-20 22:42:27 +00:00
mckusick	eeb2a6d271	Change the atomic_set_char to atomic_set_int and atomic_clear_char to atomic_clear_int to ease the implementation for the sparc64. Requested by: Jake Burkholder <jake@locore.ca>	2001-12-18 18:05:17 +00:00
iedowse	5c873cad57	Make sure we ignore the value of `fs_active' when reloading the superblock, and move the initialisation of it to beside where other pointer fields are initialised.	2001-12-16 18:54:09 +00:00
iedowse	64972486c2	Move the new superblock field `fs_active' into the region of the superblock that is already set up to handle pointer types. This fixes an accidental change in the superblock size on 64-bit platforms caused by revision 1.24.	2001-12-16 18:51:11 +00:00
mckusick	735db38be3	Minimize the time necessary to suspend operations on a filesystem when taking a snapshot. The two time consuming operations are scanning all the filesystem bitmaps to determine which blocks are in use and scanning all the other snapshots so as to be able to expunge their blocks from the view of the current snapshot. The bitmap scanning is broken into two passes. Before suspending the filesystem all bitmaps are scanned. After the suspension, those bitmaps that changed after being scanned the first time are rescanned. Typically there are few bitmaps that need to be rescanned. The expunging of other snapshots is now done after the suspension is released by observing that we can easily identify any blocks that were allocated to them after the suspension (they will be maked as `not needing to be copied' in the just created snapshot). For all the gory details, see the ``Running fsck in the Background'' paper in the Usenix BSDCon 2002 Conference Proceedings, pages 55-64.	2001-12-14 00:15:06 +00:00
mckusick	d375501d04	When a file is partially truncated, we first check to see if the new file end will land in the middle of a file hole. Since the last block of a file must always be allocated, the hole is filled by allocating a block at that location. If the hole being filled is a direct block, then the truncation may eventually reduce the full sized block down to a fragment. When running with soft updates, it is necessary to FSYNC the file after allocating the block and before creating the fragment to avoid triggering a soft updates inconsistency when the block unexpectedly shrinks. Found by: Matthew Dillon <dillon@apollo.backplane.com> MFC after: 1 week	2001-12-13 05:07:48 +00:00
rwatson	523aaa204d	Use 'mkdir -p /.attribute/system' instead of breaking it into two seperate mkdir targets. Submitted by: jedgar	2001-11-30 15:32:07 +00:00
rwatson	df873c5111	Use 'mkdir -p /.attribute/system' instead of breaking it into two seperate mkdir targets.	2001-11-30 15:21:20 +00:00
rwatson	3014016ccd	README.extattr incorrectly specified sample command lines for UFS_EXTATTR_AUTOSTART. Insert the missing 'initattr' arguments to extattrctl. Noticed by: green	2001-11-30 15:15:27 +00:00
guido	6ece6a6a1b	When mkdir()-ing, the parent dir gets is linkcount increased. Fix VN_KNOTE to reflect that. Found by: tobez@freebsd.org MFC after: 2 days	2001-11-22 15:33:12 +00:00
iedowse	61d89377de	Oops, when trying the dirhash sequential-access optimisation, compare the slot offset against the predicted offset, not a boolean flag. This typo effectively disabled the sequential optimisation, but was otherwise harmless. Not surprisingly, fixing this improves performance in the sequential access case. I am seeing a 7% speedup on one machine here; using dirhash when sequentially looking up directory entries is now about 5% faster instead of 2% slower than the non-dirhash case. Submitted by: KOIE Hidetaka <koie@suri.co.jp> MFC after: 1 week	2001-11-14 15:08:07 +00:00
dillon	1147eaf58a	Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking in wdrain during a write. This flag needs to be used in devices whos strategy routines turn-around and issue another high level I/O, such as when MD turns around and issues a VOP_WRITE to vnode backing store, in order to avoid deadlocking the dirty buffer draining code. Remove a vprintf() warning from MD when the backing vnode is found to be in-use. The syncer of buf_daemon could be flushing the backing vnode at the time of an MD operation so the warning is not correct. MFC after: 1 week	2001-11-05 18:48:54 +00:00
rwatson	58469c01a1	o Update copyright dates. o Add reference to TrustedBSD Project in license header. o Update dated comments, including comment in extattr.h claiming that no file systems support extended attributes. o Improve comment consistency.	2001-11-01 21:37:07 +00:00
rwatson	5e557739dc	o Althought this is not specified in POSIX.1e, the UFS ACL implementation coerces the deletion of a default ACL on a directory when no default ACL EA is present to success. Because the UFS EA implementation doesn't disinguish the EA failure modes "that EA name has not been administratively enabled" from "that EA name has no defined data", there's a potential conflict in error return values. Normally, the lack of administratively configured EA support is coerced to EOPNOTSUPP to indicate that ACLs are not available; in this case, it is possible to get a successful return, even if ACLs are not available because EA support for them has not been enabled. Expand the comment in ufs_setacl() to identify this case. Obtained from: TrustedBSD Project	2001-10-27 05:39:17 +00:00
rwatson	82286fef85	o Clarify a comment about the locking condition of the vnode upon exit from ufs_extattr_enable_with_open(). o Print auto-start notifications if (bootverbose). This was previously commented out since it didn't know how to check for bootverbose. o Drop in comments throughout indicating where ENOENT should be replaced with ENOATTR once that is available. Obtained from: TrustedBSD Project	2001-10-27 05:19:14 +00:00
rwatson	b985eb702b	o The comment about ordering the destruction of the lock and the removal of the flag indicating that the structure was initialized didn't need an XXX, since it didn't need fixing. Obtained from: TrustedBSD Project	2001-10-27 05:05:39 +00:00
rwatson	1d903fb701	o Wrap a number of long lines of code, many of which were introduced due to KSE-related (p) expansions. Obtained from: TrustedBSD Project	2001-10-27 05:03:05 +00:00
rwatson	25c2879cc9	Since namespace support was added to the UFS extended attribute implementation to replace single-character namespace prefixes, '$' is no longer an invalid attribute name, and the namespace is relevant to validity determination. o Remove '$' case from ufs_extattr_valid_attrname() o Add attrnamespace argument to ufs_extattr_valid_attrname(), and fill out appropriately. Currently no decisions are made based on the namespace argument, but may be in the future. Obtained from: TrustedBSD Project	2001-10-27 04:58:28 +00:00
dillon	f883ef447a	Implement kern.maxvnodes. adjusting kern.maxvnodes now actually has a real effect. Optimize vfs_msync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. Improves looping case by 500%. Optimize ffs_sync(). Avoid having to continually drop and re-obtain mutexes when scanning the vnode list. This makes a couple of assumptions, which I believe are ok, in regards to vnode stability when the mount list mutex is held. Improves looping case by 500%. (more optimization work is needed on top of these fixes) MFC after: 1 week	2001-10-26 00:08:05 +00:00
iedowse	6c4623e897	Default to not performing ufs_dirhash's extensive directory-block sanity check after every directory modification. This check can be re-enabled at any time by setting the sysctl "vfs.ufs.dirhash_docheck" to 1. This group of sanity tests was there to ensure that any UFS_DIRHASH bugs could be caught by a panic before a potentially corrupted directory block would be written to disk. It has served its main purpose now, so disable it in the interest of performance. MFC after: 1 week	2001-10-25 22:55:59 +00:00
dillon	45a6fabe87	Change the vnode list under the mount point from a LIST to a TAILQ in preparation for an implementation of limiting code for kern.maxvnodes. MFC after: 3 days	2001-10-23 01:21:29 +00:00
jhb	4806d88677	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.	2001-10-11 23:38:17 +00:00
jhb	4c62ba7c58	Add missing includes of sys/lock.h.	2001-10-11 17:52:20 +00:00
dillon	dbebfe18a1	Remove panics for rename() race conditions. The panics are inappropriate because the IN_RENAME flag only fixes a few of the huge number of race conditions that can result in the source path becoming invalid even prior to the VOP_RENAME() call. The panics created a serious security issue whereby an attacker could fairly easily cause the panic to occur, crashing the machine. The correct solution requires a great deal of work in the namei path cache code. MFC after: 0 days	2001-10-08 00:37:54 +00:00
rwatson	63b7da3bb4	o Replace two direct uid!=0 comparisons with suser_xxx() calls. Obtained from: TrustedBSD Project	2001-10-02 14:41:43 +00:00
rwatson	bf46cc0b03	o Replace two direct uid!=0 comparisons with suser_td() calls. Obtained from: TrustedBSD Project	2001-10-02 14:34:22 +00:00
dillon	c998ec6a97	Backout the last commit. The problem is actually much worse then I first thought and may require serious work to the VOP_RENAME() api itself. Basically, by the time the VOP_RENAME() function is called, it's already too late.	2001-10-02 04:26:58 +00:00
dillon	5eaf310a14	IN_RENAME should only be cleared by the routine that set it. This fixes a rename/rmdir race that has been shown to cause a panic. Bug reported by: Yevgeniy Aleynikov <eugenea@infospace.com> MFC after: 3 days	2001-10-02 02:58:48 +00:00
jhb	fe3fc4701b	- Fix some minor whitespace nits. - Move the SPECIAL_FLAG #define up next to the NOHOLDER #define and fix a little nit that caused it to be defined as -(sizeof (struct thread) + 1) instead of -2.	2001-09-27 21:04:13 +00:00
rwatson	9eed33b643	o Re-enable support of system file flags in jail() by adding back the PRISON_ROOT to the suser_xxx() check. Since securelevels may now be raised in specific jails, use of system flags can still be restricted in jail(), but in a more configurable way. o Users of jail() expecting system flags (such as schg) to restrict jail()'s should be sure to set the securelevel appropriately in jail()'s. o This fixes activities involving automated system flag removal in jail(), including installkernel and friends. Obtained from: TrustedBSD Project	2001-09-26 20:44:41 +00:00
rwatson	4e4d85b5d1	o Modify ufs_setattr() so that it uses securelevel_gt() instead of direct variable access. Obtained from: TrustedBSD Project	2001-09-26 20:31:37 +00:00
rwatson	56db2a389f	o Further clarify comment: ad Udo's request, re-insert the 'if' refering to securelevels; also, update the unprivileged process text to better indicate the scope of actions permittable when any system flags are already set (limited). Submitted by: Udo Schweigert <udo.schweigert@siemens.com>	2001-09-25 12:02:44 +00:00
rwatson	de37df3afa	o Parallelize the comment on the relationship between privileged un-jailed processes and the actual securelevel check: make the comment use '> 0' instead of inverted '<= 0'.	2001-09-25 02:26:10 +00:00
iedowse	af5c7958aa	The addition of i_dirhash to struct inode pushed RELENG_4's sizeof(struct inode) into a new malloc bucket on the i386. This didn't happen in -current due to the removal of i_lock, but it does no harm to apply the workaround to -current first. Reduce the size of the i_spare[] array in struct inode from 4 to 3 entries, and change ext2fs to use i_din.di_spare[1] so that it does not need i_spare[3]. Reviewed by: bde MFC after: 3 days	2001-09-24 18:29:20 +00:00
julian	5596676e6c	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha	2001-09-12 08:38:13 +00:00
iedowse	ddaced1703	The "dirpref" directory layout preference improvements make use of an array "fs_contigdirs[]" to avoid too many directories getting created in each cylinder group. The memory required for this and two other arrays (fs_csp[] and fs_maxcluster[]) is allocated with a single malloc() call, and divided up afterwards. However, the 'space' pointer is not advanced correctly, so fs_contigdirs and fs_maxcluster end up pointing to the same address. Add the missing code to advance the 'space' pointer, and remove an unnecessary update of the pointer that follows. This is likely to fix the "ffs_clusteralloc: map mismatch" panics that have been reported recently. Submitted by: Luke Mewburn <lukem@wasabisystems.com>	2001-09-09 23:48:28 +00:00
jedgar	5b9926665b	Use ACL_PERM_NONE instead of hardcoding 0 when initializing ACL entry permissions. Reviewed by: rwatson	2001-09-01 23:18:15 +00:00
rwatson	00ec5e8482	o At some point, unmounting a non-EA file system with EA's compiled in got a bit broken, when ufs_extattr_stop() was called and failed, ufs_extattr_destroy() would panic. This makes the call to destroy() conditional on the success of stop(). Submitted by: Christian Carstensen <cc@devcon.net> Obtained from: TrustedBSD Project	2001-09-01 20:11:05 +00:00
peter	4b437abe78	If a file has been completely unlinked, stop automatically syncing the file. ffs will discard any pending dirty pages when it is closed, so we may as well not waste time trying to clean them. This doesn't stop other things from writing it out, eg: pageout, fsync(2) etc.	2001-08-27 06:09:56 +00:00
iedowse	bff1ce4c20	Stop using dirhash when a directory is removed, and ensure that we never attempt to hash directories once they are deleted. This fixes a problem where operations on a deleted directory could trigger dirhash sanity panics.	2001-08-26 20:47:19 +00:00
iedowse	c8ef91ce6c	When compacting directories, ufs_direnter() always trusted DIRSIZ() to supply the number of bytes to be bcopy()'d to move an entry. If d_ino == 0 however, DIRSIZ() is not guaranteed to return a sensible length, so ufs_direnter could end up corrupting a directory during compaction. In practice I believe this can only happen after fsck_ffs has fixed a previously-corrupted directory. We now deal with any mid-block unused entries specially to avoid using DIRSIZ() or bcopy() on such entries. We also ensure that the variables 'dsize' and 'spacefree' contain meaningful values at all times. Add a few comments to describe better this intricate piece of code. The special handling of mid-block unused entries makes the dirhash- specific bugfix in the previous revision (1.53) now uncecessary, so this change removes it. Reviewed by: mckusick	2001-08-26 01:25:12 +00:00
iedowse	c37a1a0228	When compressing directory blocks, the dirhash code didn't check that the directory entry was in use before attempting to find it in the hash structures to change its offset. Normally, unused entries do not need to be moved, but fsck can leave behind some unused entries that do. A dirhash sanity panic resulted when the entry to be moved was not found. Add a check that stops entries with d_ino == 0 from being passed to ufsdirhash_move().	2001-08-22 01:35:17 +00:00
peter	886fd9ad64	Sigh. ufs_lookup() calls ffs_snapgone(), meaning that 'options EXT2FS' without 'options FFS' would fail to link.	2001-08-18 03:08:48 +00:00
iedowse	c768482a0f	Two recent commits in sys/ufs/ufs interacted badly with ext2fs because it shares ufs code. In ufs_fhtovp(), the test on i_effnlink is invalid because ext2fs does not maintain this field. In ufs_close(), i_effnlink is also tested, to determines whether or not to call vn_start_write(). The ufs_fhtovp issue breaks NFS exporting of ext2fs filesystems; I believe the other is harmless. Fix both cases by checking um_i_effnlink_valid in the ufsmount struct, and use i_nlink if necessary. Noticed by: bde Reviewed by: mckusick, bde	2001-07-29 22:26:01 +00:00
iedowse	5857132545	Disable the dirhash sanity check that panics if an unused directory entry (d_ino == 0) is found in a position that is not the start of a DIRBLKSIZ block. While such entries cannot occur normally (ufs always extends the previous entry to cover the free space instead), they do not cause problems and fsck does not fix them, so panicking is bad.	2001-07-27 18:45:41 +00:00
peter	57a6887663	Use a fixed type for times in on-disk structures for ufs rather than something that could potentially change like time_t.	2001-07-16 00:55:27 +00:00
iedowse	991ee42593	Return a locked struct buf from ufsdirhash_lookup() to avoid one extra getblk/brelse sequence for each lookup. We already had this buf in ufsdirhash_lookup(), so there was no point in brelse'ing it only to have the caller immediately reaquire the same buffer. This should make the case of sequential lookups marginally faster; in my tests, sequential lookups with dirhash enabled are now only around 1% slower than without dirhash.	2001-07-13 20:50:38 +00:00
iedowse	ecbac42d61	Bring in dirhash, a simple hash-based lookup optimisation for large directories. When enabled via "options UFS_DIRHASH", in-core hash arrays are maintained for large directories. These allow all directory operations to take place quickly instead of requiring long linear searches. For now anyway, dirhash is not enabled by default. The in-core hash arrays have a memory requirement that is approximately half the size of the size of the on-disk directory file. A number of new sysctl variables allow control over which directories get hashed and over the maximum amount of memory that dirhash will use: vfs.ufs.dirhash_minsize The minimum on-disk directory size for which hashing should be used. The default is 2560 (2.5k). vfs.ufs.dirhash_maxmem The system-wide maximum total memory to be used by dirhash data structures. The default is 2097152 (2MB). The current amount of memory being used by dirhash is visible through the read-only sysctl variable vfs.ufs.dirhash_maxmem. Finally, some extra sanity checks that are enabled by default, but which may have an impact on performance, can be disabled by setting vfs.ufs.dirhash_docheck to 0. Discussed on: -fs, -hackers	2001-07-10 21:21:29 +00:00
dillon	e028603b7e	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.	2001-07-04 16:20:28 +00:00
jhb	1bc2ddffa0	Fix more mntvnode and vnode interlock order reversals.	2001-06-28 22:21:33 +00:00
jhb	822af43e20	- Fix a mntvnode and vnode interlock reversal. - Protect the mnt_vnode list with the mntvnode lock. - Use queue(9) macros.	2001-06-28 04:12:56 +00:00
peter	efc05ef868	Fix warning: 1973: warning: int format, long int arg (arg 5)	2001-06-15 07:44:39 +00:00
mckusick	2f9709c0f1	Build on the change in revision 1.98 by Tor.Egge@fast.no. The symptom being treated in 1.98 was to avoid freeing a pagedep dependency if there was still a newdirblk dependency referencing it. That change is correct and no longer prints a warning message when it occurs. The other part of revision 1.98 was to panic when a newdirblk dependency was encountered during a file truncation. This fix removes that panic and replaces it with code to find and delete the newdirblk dependency so that the truncation can succeed.	2001-06-13 23:13:13 +00:00
tmm	126158251a	Call vn_close on the backing file vnode if ufs_extattr_enable failed to avoid leaking it. Reviewed by: rwatson	2001-06-07 00:11:32 +00:00
jlemon	bf3a0c1bb1	Add a wrapper for the fifo kqfilter which falls through to the ufs routine. This permits the fifo to inherit the ufs VNODE kqfilter.	2001-06-06 17:40:57 +00:00
jlemon	404f6dd2da	Add a kqueue filter for writing to ufs filesystems which always returns true. This permits better interoperability with programs which register filters on their stdin/stdout handles. Submitted by: Niels Provos <provos@citi.umich.edu>	2001-06-05 13:52:37 +00:00
obrien	c3d076fe28	There seems to be a problem that the order of disk write operation being incorrect due to a missing check for some dependency. This change avoids the freelist corruption (but not the temporarily inconsistent state of the file system). A message is printed as a reminder of the under lying problem when a pagedep structure is not freed due to the NEWBLOCK flag being set. Submitted by: Tor.Egge@fast.no	2001-06-05 01:49:37 +00:00
jhb	ff6bb62be3	Revert the previous commit in favor of the fix in rev 1.42 of ufs/ffs/ffs_extern.h instead. Requested by: bde	2001-05-30 23:09:19 +00:00
jhb	b033de0fdc	Forward declare struct cg to quiet a warning. Submitted by: bde	2001-05-30 23:08:40 +00:00
jhb	3d4bb0e38d	Include <ufs/ffs/fs.h> to get the definition of struct cg to quiet a warning.	2001-05-29 23:53:16 +00:00
phk	d442761285	Remove last vestiges of MFS.	2001-05-29 21:21:53 +00:00
phk	785e1e1e17	Remove MFS from the kernel.	2001-05-29 18:50:30 +00:00
tmm	76ca05f26f	Add a check to determine whether extended attributes have been initialized on the file system before trying to grab the lock of the per-mount extattr structure, as this lock is unitialized in that case. This is needed because ufs_extattr_vnode_inactive is called from ufs_inactive, which is also used by EA-unaware file systems such as ext2fs. Reviewed by: rwatson	2001-05-25 18:24:52 +00:00
rwatson	f504530d9f	o Merge contents of struct pcred into struct ucred. Specifically, add the real uid, saved uid, real gid, and saved gid to ucred, as well as the pcred->pc_uidinfo, which was associated with the real uid, only rename it to cr_ruidinfo so as not to conflict with cr_uidinfo, which corresponds to the effective uid. o Remove p_cred from struct proc; add p_ucred to struct proc, replacing original macro that pointed. p->p_ucred to p->p_cred->pc_ucred. o Universally update code so that it makes use of ucred instead of pcred, p->p_ucred instead of p->p_pcred, cr_ruidinfo instead of p_uidinfo, cr_{r,sv}{u,g}id instead of p_*, etc. o Remove pcred0 and its initialization from init_main.c; initialize cr_ruidinfo there. o Restruction many credential modification chunks to always crdup while we figure out locking and optimizations; generally speaking, this means moving to a structure like this: newcred = crdup(oldcred); ... p->p_ucred = newcred; crfree(oldcred); It's not race-free, but better than nothing. There are also races in sys_process.c, all inter-process authorization, fork, exec, and exit. o Remove sigio->sio_ruid since sigio->sio_ucred now contains the ruid; remove comments indicating that the old arrangement was a problem. o Restructure exec1() a little to use newcred/oldcred arrangement, and use improved uid management primitives. o Clean up exit1() so as to do less work in credential cleanup due to pcred removal. o Clean up fork1() so as to do less work in credential cleanup and allocation. o Clean up ktrcanset() to take into account changes, and move to using suser_xxx() instead of performing a direct uid==0 comparision. o Improve commenting in various kern_prot.c credential modification calls to better document current behavior. In a couple of places, current behavior is a little questionable and we need to check POSIX.1 to make sure it's "right". More commenting work still remains to be done. o Update credential management calls, such as crfree(), to take into account new ruidinfo reference. o Modify or add the following uid and gid helper routines: change_euid() change_egid() change_ruid() change_rgid() change_svuid() change_svgid() In each case, the call now acts on a credential not a process, and as such no longer requires more complicated process locking/etc. They now assume the caller will do any necessary allocation of an exclusive credential reference. Each is commented to document its reference requirements. o CANSIGIO() is simplified to require only credentials, not processes and pcreds. o Remove lots of (p_pcred==NULL) checks. o Add an XXX to authorization code in nfs_lock.c, since it's questionable, and needs to be considered carefully. o Simplify posix4 authorization code to require only credentials, not processes and pcreds. Note that this authorization, as well as CANSIGIO(), needs to be updated to use the p_cansignal() and p_cansched() centralized authorization routines, as they currently do not take into account some desirable restrictions that are handled by the centralized routines, as well as being inconsistent with other similar authorization instances. o Update libkvm to take these changes into account. Obtained from: TrustedBSD Project Reviewed by: green, bde, jhb, freebsd-arch, freebsd-audit	2001-05-25 16:59:11 +00:00
dillon	a179ee09ab	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon	2001-05-24 07:22:27 +00:00
alfred	edd79c680a	ufs_bmaparray() may block on IO, drop vm mutex and aquire Giant when calling it from the pager routine	2001-05-23 10:30:25 +00:00
ru	35437d86aa	- FDESC, FIFO, NULL, PORTAL, PROC, UMAP and UNION file systems were repo-copied from sys/miscfs to sys/fs. - Renamed the following file systems and their modules: fdesc -> fdescfs, portal -> portalfs, union -> unionfs. - Renamed corresponding kernel options: FDESC -> FDESCFS, PORTAL -> PORTALFS, UNION -> UNIONFS. - Install header files for the above file systems. - Removed bogus -I${.CURDIR}/../../sys CFLAGS from userland Makefiles.	2001-05-23 09:42:29 +00:00
mckusick	c9a89dbc03	Update softdep_setup_directory_add prototype to reflect changes in actual function. Obtained from: Jim Bloom <bloom@jbloom.jbloom.org>	2001-05-20 15:59:55 +00:00
mckusick	aac9daff0f	Must ensure that all the entries on the pd_pendinghd list have been committed to disk before clearing them. More specifically, when free_newdirblk is called, we know that the inode claims the new directory block. However, if the associated pagedep is still linked onto the directory buffer dependency chain, then some of the entries on the pd_pendinghd list may not be committed to disk yet. In this case, we will simply note that the inode claims the block and let the pd_pendinghd list be processed when the pagedep is next written. If the pagedep is no longer on the buffer dependency chain, then all the entries on the pd_pending list are committed to disk and we can free them in free_newdirblk. This corrects a window of vulnerability introduced in the code added in version 1.95.	2001-05-19 19:24:26 +00:00
alfred	a3f0842419	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb	2001-05-19 01:28:09 +00:00
mckusick	c8947d0f03	Must be a bit less aggressive about freeing pagedep structures. Obtained from: Robert Watson <rwatson@FreeBSD.org> and Matthew Jacob <mjacob@feral.com>	2001-05-18 22:16:28 +00:00
mckusick	5411edc0fb	When a new block is allocated to a directory, an fsync of a file whose name is within that block must ensure not only that the block containing the file name has been written, but also that the on-disk directory inode references that block. When a new directory block is created, we allocate a newdirblk structure which is linked to the associated allocdirect (on its ad_newdirblk list). When the allocdirect has been satisfied, the newdirblk structure is moved to the inodedep id_bufwait list of its directory to await the inode being written. When the inode is written, the directory entries are fully committed and can be deleted from their pagedep->id_pendinghd and inodedep->id_pendinghd lists.	2001-05-17 07:24:03 +00:00
iedowse	dafd513732	Change the second argument of vflush() to an integer that specifies the number of references on the filesystem root vnode to be both expected and released. Many filesystems hold an extra reference on the filesystem root vnode, which must be accounted for when determining if the filesystem is busy and then released if it isn't busy. The old `skipvp' approach required individual filesystem xxx_unmount functions to re-implement much of vflush()'s logic to deal with the root vnode. All 9 filesystems that hold an extra reference on the root vnode got the logic wrong in the case of forced unmounts, so `umount -f' would always fail if there were any extra root vnode references. Fix this issue centrally in vflush(), now that we can. This commit also fixes a vnode reference leak in devfs, which could result in idle devfs filesystems that refuse to unmount. Reviewed by: phk, bp	2001-05-16 18:04:37 +00:00
mckusick	dca0cbadc3	Further fixes for deadlock in the presence of multiple snapshots. There are still more to find, but this fix should cover the common cases that folks are hitting.	2001-05-14 17:16:49 +00:00
mckusick	0c59ac9908	If the effective link count is zero when an NFS file handle request comes in for it, the file is really gone, so return ESTALE. The problem arises when the last reference to an FFS file is released because soft-updates may delay the actual freeing of the inode for some time. Since there are no filesystem links or open file descriptors referencing the inode, from the point of view of the system, the file is inaccessible. However, if the filesystem is NFS exported, then the remote client can still access the inode via ufs_fhtovp() until the inode really goes away. To prevent this anomoly, it is necessary to begin returning ESTALE at the same time that the file ceases to be accessible to the local filesystem. Obtained from: Ian Dowse <iedowse@maths.tcd.ie>	2001-05-13 23:30:45 +00:00
mckusick	8b011b16b3	Remove yet another deadlock case.	2001-05-11 07:12:03 +00:00
mckusick	d8ef9cc3b5	When running with soft updates, track the number of blocks and files that are committed to being freed and reflect these blocks in the counts returned by statfs (and thus also by the `df' command). This change allows programs such as those that do news expiration to know when to stop if they are trying to create a certain percentage of free space. Note that this change does not solve the much harder problem of making this to-be-freed space available to applications that want it (thus on a nearly full filesystem, you may still encounter out-of-space conditions even though the free space will show up eventually). Hopefully this harder problem will be the subject of a future enhancement.	2001-05-08 07:42:20 +00:00
mckusick	1be6156a3d	Several fixes for units errors: 1) Do not assume that the superblock will be of size fs->fs_bsize. This fixes a panic when taking a snapshot on a filesystem with a block size bigger than 8K. 2) Properly calculate the number of fragments that follow the superblock summary information. This fixes a bug with inconsistent snapshots. 3) When cleaning up a snapshot that is about to be removed, properly calculate the number of blocks that need to be checked. This fixes a bug that created partially allocated inodes. 4) When moving blocks from a snapshot that is about to be removed to another snapshot, properly account for the reduced number of blocks in the snapshot from which they are taken. This fixes a bug in which the number of blocks released from a snapshot did not match the number that it claimed to have.	2001-05-08 07:29:03 +00:00
mckusick	36984adaef	When syncing out snapshot metadata, we must temporarily allow recursive buffer locking so as to avoid locking against ourselves if we need to write filesystem metadata.	2001-05-08 07:13:00 +00:00
mckusick	547825394d	Refinement to revision 1.16 of ufs/ffs/ffs_snapshot.c to reduce the amount of time that the filesystem must be suspended. The current snapshot is elided as well as the earlier snapshots.	2001-05-04 05:49:28 +00:00
phk	c41ac84ca2	Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.	2001-05-01 09:12:39 +00:00
phk	279d435d13	Remove blatantly pointless call to VOP_BMAP(). Use ufs_bmaparray() rather than VOP_BMAP() on our own vnodes.	2001-05-01 09:12:31 +00:00
phk	5948c9ed5b	Implement vop_std{get\|put}pages() and add them to the default vop[]. Un-copy&paste all the VOP_{GET\|PUT}PAGES() functions which do nothing but the default.	2001-05-01 08:34:45 +00:00
markm	bcca5847d5	Undo part of the tangle of having sys/lock.h and sys/mutex.h included in other "system" header files. Also help the deprecation of lockmgr.h by making it a sub-include of sys/lock.h and removing sys/lockmgr.h form kernel .c files. Sort sys/*.h includes where possible in affected files. OK'ed by: bde (with reservations)	2001-05-01 08:13:21 +00:00
phk	8e3fa89968	VOP_BALLOC was never really a VOP in the first place, so convert it to UFS_BALLOC like the other "between UFS and FFS function interfaces".	2001-04-29 12:36:52 +00:00
phk	608c1caf3b	Add a vop_stdbmap(), and make it part of the default vop vector. Make 7 filesystems which don't really know about VOP_BMAP rely on the default vector, rather than more or less complete local vop_nopbmap() implementations.	2001-04-29 11:48:41 +00:00
phk	a86fd16038	Call ufs_bmaparray() directly instead of indirectly via VOP_BMAP().	2001-04-29 10:25:30 +00:00
phk	cc0f43bb51	Remove two unused arguments from ufs_bmaparray().	2001-04-29 10:24:58 +00:00
phk	cc62125a67	Remove faint traces of blind copy&paste.	2001-04-29 10:23:50 +00:00
phk	d2b1930b66	Remove faint traces of non-existant ffs_bmap().	2001-04-29 10:23:32 +00:00
grog	4b9d9cbaac	Revert consequences of changes to mount.h, part 2. Requested by: bde	2001-04-29 02:45:39 +00:00
mckusick	ed86738068	Rather than copying all the indirect blocks of the snapshot, simply mark them as BLK_NOCOPY. This trick cuts the initial size of the snapshot in half and cuts the time to take a snapshot by a third.	2001-04-26 00:50:53 +00:00
mckusick	f863141979	When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.	2001-04-25 08:11:18 +00:00
phk	cdc83afc7f	Move the netexport structure from the fs-specific mountstructure to struct mount. This makes the "struct netexport *" paramter to the vfs_export and vfs_checkexport interface unneeded. Consequently that all non-stacking filesystems can use vfs_stdcheckexp(). At the same time, make it a pointer to a struct netexport in struct mount, so that we can remove the bogus AF_MAX and #include <net/radix.h> from <sys/mount.h>	2001-04-25 07:07:52 +00:00
iedowse	383dd0a265	Pre-dirpref versions of fsck may zero out the new superblock fields fs_contigdirs, fs_avgfilesize and fs_avgfpdir. This could cause panics if these fields were zeroed while a filesystem was mounted read-only, and then remounted read-write. Add code to ffs_reload() which copies the fs_contigdirs pointer from the previous superblock, and reinitialises fs_avgf* if necessary. Reviewed by: mckusick	2001-04-24 00:37:16 +00:00
grog	1f5de30718	Correct #includes to work with fixed sys/mount.h.	2001-04-23 09:05:15 +00:00
phk	378e561228	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.	2001-04-17 08:56:39 +00:00
mckusick	ba66879022	Add debugging option to always read/write cylinder groups as full sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'. Add debugging option to disable background writes of cylinder groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'. These debugging options should be tried on systems that are panicing with corrupted cylinder group maps to see if it makes the problem go away. The set of panics in question are: ffs_clusteralloc: map mismatch ffs_nodealloccg: map corrupted ffs_nodealloccg: block not in map ffs_alloccg: map corrupted ffs_alloccg: block not in map ffs_alloccgblk: cyl groups corrupted ffs_alloccgblk: can't find blk in cyl ffs_checkblk: partially free fragment The following panics are less likely to be related to this problem, but might be helped by these debugging options: ffs_valloc: dup alloc ffs_blkfree: freeing free block ffs_blkfree: freeing free frag ffs_vfree: freeing free inode If you try these options, please report whether they helped reduce your bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com> and to Matt Dillon <dillon@earth.backplane.com>.	2001-04-17 05:37:51 +00:00
mckusick	6ea67910b6	Background fsck sysctl operations must use vn_start_write and vn_finished_write so that they do not attempt to modify a suspended filesystem.	2001-04-17 05:06:37 +00:00
rwatson	678b28a532	In my first reading of POSIX.1e, I misinterpreted handling of the ACL_USER_OBJ and ACL_GROUP_OBJ fields, believing that modification of the access ACL could be used by privileged processes to change file/directory ownership. In fact, this is incorrect; ACL_*_OBJ (+ ACL_MASK and ACL_OTHER) should have undefined ae_id fields; this commit attempts to correct that misunderstanding. o Modify arguments to vaccess_acl_posix1e() to accept the uid and gid associated with the vnode, as those can no longer be extracted from the ACL passed as an argument. Perform all comparisons against the passed arguments. This actually has the effect of simplifying a number of components of this call, as well as reducing the indent level, but now seperates handling of ACL_GROUP_OBJ from ACL_GROUP. o Modify acl_posix1e_check() to return EINVAL if the ae_id field of any of the ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} entries is a value other than ACL_UNDEFINED_ID. As a temporary work-around to allow clean upgrades, set the ae_id field to ACL_UNDEFINED_ID before each check so that this cannot cause a failure in the short term (this work-around will be removed when the userland libraries and utilities are updated to take this change into account). o Modify ufs_sync_acl_from_inode() so that it forces ACL_{USER_OBJ,GROUP_OBJ,MASK,OTHER} ae_id fields to ACL_UNDEFINED_ID when synchronizing the ACL from the inode. o Modify ufs_sync_inode_from_acl to not propagate uid and gid information to the inode from the ACL during ACL update. Also modify the masking of permission bits that may be set from ALLPERMS to (S_IRWXU\|S_IRWXG\|S_IRWXO), as ACLs currently do not carry none-ACCESSPERMS (S_ISUID, S_ISGID, S_ISTXT). o Modify ufs_getacl() so that when it emulates an access ACL from the inode, it initializes the ae_id fields to ACL_UNDEFINED_ID. o Clean up ufs_setacl() substantially since it is no longer possible to perform chown/chgrp operations using vop_setacl(), so all the access control for that can be eliminated. o Modify ufs_access() so that it passes owner uid and gid information into vaccess_acl_posix1e(). Pointed out by: jedger Obtained from: TrustedBSD Project	2001-04-17 04:33:34 +00:00
mckusick	27094e6d21	Update to describe use of mdconfig instead of deprecated vnconfig. Submitted by: Steve Ames <steve@virtual-voodoo.com>	2001-04-14 18:32:09 +00:00
mckusick	6a7a6ab20d	This checkin adds support in ufs/ffs for the FS_NEEDSFSCK flag. It is described in ufs/ffs/fs.h as follows: /* * Filesystem flags. * * Note that the FS_NEEDSFSCK flag is set and cleared only by the * fsck utility. It is set when background fsck finds an unexpected * inconsistency which requires a traditional foreground fsck to be * run. Such inconsistencies should only be found after an uncorrectable * disk error. A foreground fsck will clear the FS_NEEDSFSCK flag when * it has successfully cleaned up the filesystem. The kernel uses this * flag to enforce that inconsistent filesystems be mounted read-only. / #define FS_UNCLEAN 0x01 / filesystem not clean at mount / #define FS_DOSOFTDEP 0x02 / filesystem using soft dependencies / #define FS_NEEDSFSCK 0x04 / filesystem needs sync fsck before mount */	2001-04-14 05:26:28 +00:00
mckusick	3931e94b1f	Directory layout preference improvements from Grigoriy Orlov <gluk@ptci.ru>. His description of the problem and solution follow. My own tests show speedups on typical filesystem intensive workloads of 5% to 12% which is very impressive considering the small amount of code change involved. ------ One day I noticed that some file operations run much faster on small file systems then on big ones. I've looked at the ffs algorithms, thought about them, and redesigned the dirpref algorithm. First I want to describe the results of my tests. These results are old and I have improved the algorithm after these tests were done. Nevertheless they show how big the perfomance speedup may be. I have done two file/directory intensive tests on a two OpenBSD systems with old and new dirpref algorithm. The first test is "tar -xzf ports.tar.gz", the second is "rm -rf ports". The ports.tar.gz file is the ports collection from the OpenBSD 2.8 release. It contains 6596 directories and 13868 files. The test systems are: 1. Celeron-450, 128Mb, two IDE drives, the system at wd0, file system for test is at wd1. Size of test file system is 8 Gb, number of cg=991, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=35 2. PIII-600, 128Mb, two IBM DTLA-307045 IDE drives at i815e, the system at wd0, file system for test is at wd1. Size of test file system is 40 Gb, number of cg=5324, size of cg is 8m, block size = 8k, fragment size = 1k OpenBSD-current from Dec 2000 with BUFCACHEPERCENT=50 You can get more info about the test systems and methods at: http://www.ptci.ru/gluk/dirpref/old/dirpref.html Test Results tar -xzf ports.tar.gz rm -rf ports mode old dirpref new dirpref speedup old dirprefnew dirpref speedup First system normal 667 472 1.41 477 331 1.44 async 285 144 1.98 130 14 9.29 sync 768 616 1.25 477 334 1.43 softdep 413 252 1.64 241 38 6.34 Second system normal 329 81 4.06 263.5 93.5 2.81 async 302 25.7 11.75 112 2.26 49.56 sync 281 57.0 4.93 263 90.5 2.9 softdep 341 40.6 8.4 284 4.76 59.66 "old dirpref" and "new dirpref" columns give a test time in seconds. speedup - speed increasement in times, ie. old dirpref / new dirpref. ------ Algorithm description The old dirpref algorithm is described in comments: /* * Find a cylinder to place a directory. * * The policy implemented by this algorithm is to select from * among those cylinder groups with above the average number of * free inodes, the one with the smallest number of directories. / A new directory is allocated in a different cylinder groups than its parent directory resulting in a directory tree that is spreaded across all the cylinder groups. This spreading out results in a non-optimal access to the directories and files. When we have a small filesystem it is not a problem but when the filesystem is big then perfomance degradation becomes very apparent. What I mean by a big file system ? 1. A big filesystem is a filesystem which occupy 20-30 or more percent of total drive space, i.e. first and last cylinder are physically located relatively far from each other. 2. It has a relatively large number of cylinder groups, for example more cylinder groups than 50% of the buffers in the buffer cache. The first results in long access times, while the second results in many buffers being used by metadata operations. Such operations use cylinder group blocks and on-disk inode blocks. The cylinder group block (fs->fs_cblkno) contains struct cg, inode and block bit maps. It is 2k in size for the default filesystem parameters. If new and parent directories are located in different cylinder groups then the system performs more input/output operations and uses more buffers. On filesystems with many cylinder groups, lots of cache buffers are used for metadata operations. My solution for this problem is very simple. I allocate many directories in one cylinder group. I also do some things, so that the new allocation method does not cause excessive fragmentation and all directory inodes will not be located at a location far from its file's inodes and data. The algorithm is: / * Find a cylinder group to place a directory. * * The policy implemented by this algorithm is to allocate a * directory inode in the same cylinder group as its parent * directory, but also to reserve space for its files inodes * and data. Restrict the number of directories which may be * allocated one after another in the same cylinder group * without intervening allocation of files. * * If we allocate a first level directory then force allocation * in another cylinder group. / My early versions of dirpref give me a good results for a wide range of file operations and different filesystem capacities except one case: those applications that create their entire directory structure first and only later fill this structure with files. My solution for such and similar cases is to limit a number of directories which may be created one after another in the same cylinder group without intervening file creations. For this purpose, I allocate an array of counters at mount time. This array is linked to the superblock fs->fs_contigdirs[cg]. Each time a directory is created the counter increases and each time a file is created the counter decreases. A 60Gb filesystem with 8mb/cg requires 10kb of memory for the counters array. The maxcontigdirs is a maximum number of directories which may be created without an intervening file creation. I found in my tests that the best performance occurs when I restrict the number of directories in one cylinder group such that all its files may be located in the same cylinder group. There may be some deterioration in performance if all the file inodes are in the same cylinder group as its containing directory, but their data partially resides in a different cylinder group. The maxcontigdirs value is calculated to try to prevent this condition. Since there is no way to know how many files and directories will be allocated later I added two optimization parameters in superblock/tunefs. They are: int32_t fs_avgfilesize; / expected average file size / int32_t fs_avgfpdir; / expected # of files per directory */ These parameters have reasonable defaults but may be tweeked for special uses of a filesystem. They are only necessary in rare cases like better tuning a filesystem being used to store a squid cache. I have been using this algorithm for about 3 months. I have done a lot of testing on filesystems with different capacities, average filesize, average number of files per directory, and so on. I think this algorithm has no negative impact on filesystem perfomance. It works better than the default one in all cases. The new dirpref will greatly improve untarring/removing/coping of big directories, decrease load on cvs servers and much more. The new dirpref doesn't speedup a compilation process, but also doesn't slow it down. Obtained from: Grigoriy Orlov <gluk@ptci.ru>	2001-04-10 08:38:59 +00:00
rwatson	2208cab11f	o Indent sub-section headings to be consistent with README.extattr. Obtained from: TrustedBSD Project	2001-04-03 18:05:03 +00:00
rwatson	f39773137b	o Introduce a README file describing briefly how to use access control lists, in the style of FFS README files for soft updates and snapshots. Obtained from: TrustedBSD Project	2001-04-03 17:58:25 +00:00
rwatson	d43ef707ba	o Introduce a README file describing briefly how to use extended attributes, in the style of FFS README files for soft updates and snapshots. Obtained from: TrustedBSD Project	2001-04-03 17:31:36 +00:00
rwatson	6805eb2bf4	o Change the default from using IO_SYNC on EA set and delete operations to not using IO_SYNC. Expose a sysctl (debug.ufs_extattr_sync) for enabling the use of IO_SYNC. - Use of IO_SYNC substantially degrades ACL performance when a default ACL is set on a directory, as there are four synchronous writes initiated to define both supporting EAs for new sub-directories, and to set the data; two for new files. Later, this may be optimized to two writes for sub-directories, one for new files. - IO_SYNC does not substantially improve consistency properties due to the poor consistency properties of existing permissions (which ACLs are a superset of), due to interaction with soft updates, and due to differences in handling consistency for data and file system meta-data. - In macro-benchmarks, this reduces the overhead of setting default ACLs down to the same overhead as enabling ACLs on a file system and not using them. Enabling ACLs still introduces a small overhead (I measure 7% on a -j 2 buildworld with pre-allocated EA backing store, but this is not rigorous testing, nor in any way optimized). - The sysctl will probably change to another administration method (or at least, a better name) in the near future, but consistency properties of EAs are still being worked out. The toggle is defined right now to allow easier performance analysis and exploration of possible guarantees. Obtained from: TrustedBSD Project	2001-04-03 04:09:53 +00:00
rwatson	3100bf9079	o Correct an ACL implementation bug that could result in a system panic under heavy use when default ACLs were bgin inherited by new files or directories. This is done by removing a bug in default ACL reading, and improving error handling for this failure case: - Move the setting of the buffer length (len) variable to above the ACL type (ap->a_type) switch rather than having it only for ACL_TYPE_ACCESS. Otherwise, the len variable is unitialized in the ACL_TYPE_DEFAULT case, which generally worked right, but could result in failure. - Add a check for a short/long read of the ACL_TYPE_DEFAULT type from the underlying EA, resulting in EPERM rather than passing a potentially corrupted ACL back to the caller (resulting "cleaner" failures if the EA is damaged: right now, the caller will almost always panic in the presence of a corrupted EA). This code is similar to code in the ACL_TYPE_ACCESS handling in the previous switch case. - While I'm fixing this code, remove a redundant bzero() of the ACL reader buffer; it need only be initialized above the acl_type switch. Obtained from: TrustedBSD Project	2001-04-02 01:02:32 +00:00
rwatson	737ae0941e	Introduce support for POSIX.1e ACLs on UFS-based file systems. This implementation is still experimental, and while fairly broadly tested, is not yet intended for production use. Support for POSIX.1e ACLs on UFS will not be MFC'd to RELENG_4. This implementation works by providing implementations of VOP_[GS]ETACL() for FFS, as well as modifying the appropriate access control and file creation routines. In this implementation, ACLs are backed into extended attributes; the base ACL (owner, group, other) permissions remain in the inode for performance and compatibility reasons, so only the extended and default ACLs are placed in extended attributes. The logic for ACL evaluation is provided by the fs-independent kern/kern_acl.c. o Introduce UFS_ACL, a compile-time configuration option that enables support for ACLs on FFS (and potentially other UFS-based file systems). o Introduce ufs_getacl(), ufs_setacl(), ufs_aclcheck(), which respectively get, set, and check the ACLs on the passed vnode. o Introduce ufs_sync_acl_from_inode(), ufs_sync_inode_from_acl() to maintain access control information between inode permissions and extended attribute data. o Modify ufs_access() to load a file access ACL and invoke vaccess_acl_posix1e() if ACLs are available on the file system o Modify ufs_mkdir() and ufs_makeinode() to associate ACLs with newly created directories and files, inheriting from the parent directory's default ACL. o Enable these new vnode operations and conditionally compiled code paths if UFS_ACL is defined. A few notes: o This implementation is fairly widely tested, but still should be considered experimental. o Currently, ACLs are not exported via NFS, instead, the summarizing file mode/etc from the inode is. This results in conservative protection behavior, similar to the behavior of ACL-nonaware programs acting locally. o It is possible that underlying binary data formats associated with this implementation may change. Consumers of the implementation should expect to find their local configuration obsoleted in the next few months, resulting in possible loss of ACL data during an upgrade. o The extended attributes interface and implementation is still undergoing modification to address portable interface concerns, as well as performance. o Many applications do not yet correctly handle ACLs. In general, due to the POSIX.1e ACL model, behavior of ACL-unaware applications will be conservative with respects to file protection; some caution is recommended. o Instructions for configuring and maintaining ACLs on UFS will be committed in the near future; in the mean time it is possible to reference the README included in the last UFS ACL distribution placed in the TrustedBSD web site: http://www.TrustedBSD.org/downloads/ Substantial debugging, hardware, travel, or connectivity support for this project was provided by: BSDi, Safeport Network Services, and NAI Labs. Significant coding contributions were made by Chris Faulhaber. Additional support was provided by Brian Feldman, Thomas Moestl, and Ilmar Habibulin. Reviewed by: jedgar, keichii, mckusick, trustedbsd-discuss, freebsd-fs Obtained from: TrustedBSD Project	2001-03-26 17:53:19 +00:00
phk	c47745e977	Send the remains (such as I have located) of "block major numbers" to the bit-bucket.	2001-03-26 12:41:29 +00:00
asmodai	05c87e82c2	Fix typo ); -> ,	2001-03-24 15:25:04 +00:00
mckusick	c6fdb61aa7	Check that background fsck operation is being done on a ufs filesystem. Obtained from: Robert Watson <rwatson@FreeBSD.org>	2001-03-23 20:58:25 +00:00
rwatson	cae18aa0fd	o Remove an unnecessary debugging printf from ufs_extattr_lookup(), which resulted in the output of warning messages at boot if UFS_EXTATTR_AUTOSTART was enabled but ".attribute" and possible sub-directories weren't in a mounted MFS or UFS file systems. Pointed out by: dcs Obtained from: TrustedBSD Project	2001-03-21 23:00:39 +00:00
mckusick	69603157de	Add kernel support for running fsck on active filesystems.	2001-03-21 04:09:01 +00:00
mckusick	39275d892c	Clear the fs_clean flag only when the FS_UNCLEAN flag is not set (as is done in unmount). Remove a snapshot inode from the superblock list when its last name goes away rather than when its last reference goes away. That way it will be properly reclaimed by fsck after a crash rather than reenabled when the filesystem is mounted.	2001-03-21 04:05:20 +00:00
mckusick	d22815bec3	Report the correct inode number when panicing with freeing free inode. Report the correct block number when panicing with freeing free block.	2001-03-21 04:01:02 +00:00
rwatson	b777dece07	o Enable UFS-based extended attribute support on MFS. Note that this change is under-tested, and that MFS appears to be in the process of being deprecated in favor of FFS over md. Note also that UFS_EXTATTR_AUTOSTART doesn't make much sense on MFS unless the MFSROOT is compiled in, so manual configuration is generally required. Obtained from: TrustedBSD Project	2001-03-19 06:44:18 +00:00
rwatson	0012887962	o Rename "namespace" argument to "attrnamespace" as namespace is a C++ reserved word. Submitted by: jkh Obtained from: TrustedBSD Project	2001-03-19 05:44:15 +00:00
rwatson	8a937bbc3a	o Change options FFS_EXTATTR and options FFS_EXTATTR_AUTOSTART to options UFS_EXTATTR and UFS_EXTATTR_AUTOSTART respectively. This change reflects the fact that our EA support is implemented entirely at the UFS layer (modulo FFS start/stop/autostart hooks for mount and unmount events). This also better reflects the fact that [shortly] MFS will also support EAs, as well as possibly IFS. o Consumers of the EA support in FFS are reminded that as a result, they must change kernel config files to reflect the new option names. Obtained from: TrustedBSD Project	2001-03-19 04:35:40 +00:00
rwatson	90215b05ec	o Caused FFS_EXTATTR_AUTOSTART to scan two sub-directories of ".attribute" off of the file system root: "user" for user attributes, and "system" for system attributes. When the scan occurs, attribute backing files discovered in those directories will be started in the respective namespaces. This re-introduces support for auto-starting of user attributes, which was removed when the "$" prefix for system attributes was replaced with explicit namespacing. For users of the TrustedBSD UFS POSIX.1e ACL code, you'll need to: mv ${FSROOT}/'$posix1e.acl_access' ${FSROOT}/system/posix1e.acl_access mv ${FSROOT}/'$posix1e.acl_default' ${FSROOT}/system/posix1e.acl_default For users of the TrustedBSD POSIX.1e Capability code, you'll need to: mv ${FSROOT}/'$posix1e.cap' ${FSROOT}/system/posix1e.cap For users of the TrustedBSD MAC code, you'll need to: mv ${FSROOT}/'$freebsd.mac' ${FSROOT}/system/freebsd.mac Updated versions of relevant patches will be released in the near future. Obtained from: TrustedBSD Project	2001-03-18 04:04:23 +00:00
rwatson	f773ff5a87	o Change the API and ABI of the Extended Attribute kernel interfaces to introduce a new argument, "namespace", rather than relying on a first- character namespace indicator. This is in line with more recent thinking on EA interfaces on various mailing lists, including the posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces are defined by default, EXTATTR_NAMESPACE_SYSTEM and EXTATTR_NAMESPACE_USER, where the primary distinction lies in the access control model: user EAs are accessible based on the normal MAC and DAC file/directory protections, and system attributes are limited to kernel-originated or appropriately privileged userland requests. o These API changes occur at several levels: the namespace argument is introduced in the extattr_{get,set}_file() system call interfaces, at the vnode operation level in the vop_{get,set}extattr() interfaces, and in the UFS extended attribute implementation. Changes are also introduced in the VFS extattrctl() interface (system call, VFS, and UFS implementation), where the arguments are modified to include a namespace field, as well as modified to advoid direct access to userspace variables from below the VFS layer (in the style of recent changes to mount by adrian@FreeBSD.org). This required some cleanup and bug fixing regarding VFS locks and the VFS interface, as a vnode pointer may now be optionally submitted to the VFS_EXTATTRCTL() call. Updated documentation for the VFS interface will be committed shortly. o In the near future, the auto-starting feature will be updated to search two sub-directories to the ".attribute" directory in appropriate file systems: "user" and "system" to locate attributes intended for those namespaces, as the single filename is no longer sufficient to indicate what namespace the attribute is intended for. Until this is committed, all attributes auto-started by UFS will be placed in the EXTATTR_NAMESPACE_SYSTEM namespace. o The default POSIX.1e attribute names for ACLs and Capabilities have been updated to no longer include the '$' in their filename. As such, if you're using these features, you'll need to rename the attribute backing files to the same names without '$' symbols in front. o Note that these changes will require changes in userland, which will be committed shortly. These include modifications to the extended attribute utilities, as well as to libutil for new namespace string conversion routines. Once the matching userland changes are committed, a buildworld is recommended to update all the necessary include files and verify that the kernel and userland environments are in sync. Note: If you do not use extended attributes (most people won't), upgrading is not imperative although since the system call API has changed, the new userland extended attribute code will no longer compile with old include files. o Couple of minor cleanups while I'm there: make more code compilation conditional on FFS_EXTATTR, which should recover a bit of space on kernels running without EA's, as well as update copyright dates. Obtained from: TrustedBSD Project	2001-03-15 02:54:29 +00:00
rwatson	1ffbfa2634	o In my merge, missed the one-line patch to ufs_vnops.c that removed the static prototype for ufs_readdir(). Note that ufs_readdir() was actually already non-static, the prototype was incorrect. Submitted by: jedgar	2001-03-14 18:27:04 +00:00
rwatson	3c831c500f	o Implement "options FFS_EXTATTR_AUTOSTART", which depends on "options FFS_EXTATTR". When extended attribute auto-starting is enabled, FFS will scan the .attribute directory off of the root of each file system, as it is mounted. If .attribute exists, EA support will be started for the file system. If there are files in the directory, FFS will attempt to start them as attribute backing files for attributes baring the same name. All attributes are started before access to the file system is permitted, so this permits race-free enabling of attributes. For attributes backing support for security features, such as ACLs, MAC, Capabilities, this is vital, as it prevents the file system attributes from getting out of sync as a result of file system operations between mount-time and the enabling of the extended attribute. The userland extattrctl tool will still function exactly as previously. Files must be placed directly in .attribute, which must be directly off of the file system root: symbolic links are not permitted. FFS_EXTATTR will continue to be able to function without FFS_EXTATTR_AUTOSTART for sites that do not want/require auto-starting. If you're using the UFS_ACL code available from www.TrustedBSD.org, using FFS_EXTATTR_AUTOSTART is recommended. o This support is implemented by adding an invocation of ufs_extattr_autostart() to ffs_mountfs(). In addition, several new supporting calls are introduced in ufs_extattr.c: ufs_extattr_autostart(): start EAs on the specified mount ufs_extattr_lookup(): given a directory and filename, return the vnode for the file. ufs_extattr_enable_with_open(): invoke ufs_extattr_enable() after doing the equililent of vn_open() on the passed file. ufs_extattr_iterate_directory(): iterate over a directory, invoking ufs_extattr_lookup() and ufs_extattr_enable_with_open() on each entry. o This feature is not widely tested, and therefore may contain bugs, caution is advised. Several changes are in the pipeline for this feature, including breaking out of EA namespaces into subdirectories of .attribute (this is waiting on the updated EA API), as well as a per-filesystem flag indicating whether or not EAs should be auto-started. This is required because administrators may not want .attribute auto-started on all file systems, especially if non-administrators have write access to the root of a file system. Obtained from: TrustedBSD Project	2001-03-14 05:32:31 +00:00
mckusick	61db3f4296	Fixes to track snapshot copy-on-write checking in the specinfo structure rather than assuming that the device vnode would reside in the FFS filesystem (which is obviously a broken assumption with the device filesystem).	2001-03-07 07:09:55 +00:00
jhb	9cd254601b	Grab the process lock while calling psignal and before calling psignal.	2001-03-07 03:37:06 +00:00
jhb	ace71d59bf	Protect SIGDELSET of p_siglist with the proc lock.	2001-03-07 03:34:55 +00:00
mckusick	6e8fd9ef89	Free lock before returning from process_worklist_item. Obtained from: Constantine Sapuntzakis <csapuntz@stanford.edu>	2001-03-01 21:43:46 +00:00
adrian	4018955334	Reviewed by: jlemon An initial tidyup of the mount() syscall and VFS mount code. This code replaces the earlier work done by jlemon in an attempt to make linux_mount() work. * the guts of the mount work has been moved into vfs_mount(). * move `type', `path' and `flags' from being userland variables into being kernel variables in vfs_mount(). `data' remains a pointer into userspace. * Attempt to verify the `type' and `path' strings passed to vfs_mount() aren't too long. * rework mount() and linux_mount() to take the userland parameters (besides data, as mentioned) and pass kernel variables to vfs_mount(). (linux_mount() already did this, I've just tidied it up a little more.) * remove the copyin() stuff for `path'. `data' still requires copyin() since its a pointer into userland. * set `mount->mnt_statf_mntonname' in vfs_mount() rather than in each filesystem. This variable is generally initialised with `path', and each filesystem can override it if they want to. * NOTE: f_mntonname is intiailised with "/" in the case of a root mount.	2001-03-01 21:00:17 +00:00
jlemon	58f9dcd6ce	Add a NOTE_REVOKE flag for vnodes, which is triggered from within vclean(). Use this to tell a filter attached to a vnode that the underlying vnode is no longer valid, by returning EV_EOF. PR: kern/25309, kern/25206	2001-02-23 20:06:01 +00:00
jlemon	36e83fc67d	Use correct list pointer when detaching knote from list.	2001-02-23 19:20:21 +00:00
mckusick	bb8c09c678	Free lock before calling panic so that subsequent attempt to write out buffers does not re-panic with `locking against myself'. This change should not affect normal operations of soft updates in any way.	2001-02-23 09:01:31 +00:00
mckusick	b6410fb7dc	When cleaning up excess inode dependencies, check for being done. Reviewed by: Jan Koum <jkb@yahoo-inc.com>	2001-02-22 10:17:57 +00:00
mckusick	d6b473bae1	This patch corrects two problems with the rate limiting code that was introduced in revision 1.80. The problem manifested itself with a `locking against myself' panic and could also result in soft updates inconsistences associated with inodedeps. The two problems are: 1) One of the background operations could manipulate the bitmap while holding it locked with intent to create. This held lock results in a `locking against myself' panic, when the background processing that we have been coopted to do tries to lock the bitmap which we are already holding locked. To understand how to fix this problem, first, observe that we can do the background cleanups in inodedep_lookup only when allocating inodedeps (DEPALLOC is set in the call to inodedep_lookup). Second observe that calls to inodedep_lookup with DEPALLOC set can only happen from the following calls into the softdep code: softdep_setup_inomapdep softdep_setup_allocdirect softdep_setup_remove softdep_setup_freeblocks softdep_setup_directory_change softdep_setup_directory_add softdep_change_linkcnt Only the first two of these can come from ffs_alloc.c while holding a bitmap locked. Thus, inodedep_lookup must not go off to do request_cleanups when being called from these functions. This change adds a flag, NODELAY, that can be passed to inodedep_lookup to let it know that it should not do background processing in those cases. 2) The return value from request_cleanup when helping out with the cleanup was 0 instead of 1. This meant that despite the fact that we may have slept while doing the cleanups, the code did not recheck for the appearance of an inodedep (e.g., goto top in inodedep_lookup). This lead to the softdep inconsistency in which we ended up with two inodedep's for the same inode. Reviewed by: Peter Wemm <peter@yahoo-inc.com>, Matt Dillon <dillon@earth.backplane.com>	2001-02-20 11:14:38 +00:00
asmodai	3065478332	Preceed/preceeding are not english words. Use precede and preceding.	2001-02-18 10:43:53 +00:00
jlemon	11781a7431	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter	2001-02-15 16:34:11 +00:00
jake	55d5108ac5	Implement a unified run queue and adjust priority levels accordingly. - All processes go into the same array of queues, with different scheduling classes using different portions of the array. This allows user processes to have their priorities propogated up into interrupt thread range if need be. - I chose 64 run queues as an arbitrary number that is greater than 32. We used to have 4 separate arrays of 32 queues each, so this may not be optimal. The new run queue code was written with this in mind; changing the number of run queues only requires changing constants in runq.h and adjusting the priority levels. - The new run queue code takes the run queue as a parameter. This is intended to be used to create per-cpu run queues. Implement wrappers for compatibility with the old interface which pass in the global run queue structure. - Group the priority level, user priority, native priority (before propogation) and the scheduling class into a struct priority. - Change any hard coded priority levels that I found to use symbolic constants (TTIPRI and TTOPRI). - Remove the curpriority global variable and use that of curproc. This was used to detect when a process' priority had lowered and it should yield. We now effectively yield on every interrupt. - Activate propogate_priority(). It should now have the desired effect without needing to also propogate the scheduling class. - Temporarily comment out the call to vm_page_zero_idle() in the idle loop. It interfered with propogate_priority() because the idle process needed to do a non-blocking acquire of Giant and then other processes would try to propogate their priority onto it. The idle process should not do anything except idle. vm_page_zero_idle() will return in the form of an idle priority kernel thread which is woken up at apprioriate times by the vm system. - Update struct kinfo_proc to the new priority interface. Deliberately change its size by adjusting the spare fields. It remained the same size, but the layout has changed, so userland processes that use it would parse the data incorrectly. The size constraint should really be changed to an arbitrary version number. Also add a debug.sizeof sysctl node for struct kinfo_proc.	2001-02-12 00:20:08 +00:00
bmilekic	f364d4ac36	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)	2001-02-09 06:11:45 +00:00
phk	709379c1ae	Another round of the <sys/queue.h> FOREACH transmogriffer. Created with: sed(1) Reviewed by: md5(1)	2001-02-04 16:08:18 +00:00
phk	e87f7a15ad	Mechanical change to use <sys/queue.h> macro API instead of fondling implementation details. Created with: sed(1) Reviewed by: md5(1)	2001-02-04 13:13:25 +00:00
phk	f3b4fbe35f	Use <sys/queue.h> macro API.	2001-02-04 12:37:48 +00:00
phk	236808f33a	Remove a DIAGNOSTIC check which belongs in <sys/queue.h> if anyplace at all.	2001-02-04 11:53:51 +00:00
iedowse	be2876f24f	Extend the sanity checks in ufs_lookup to ensure that each directory entry fits within its DIRBLKSIZ block. The surrounding code is extremely fragile with respect to corruption of the directory entry 'd_reclen' field; if directory corruption occurs, it can blindly scan forward beyond the end of the filesystem block. Usually this results in a 'fault on nofault entry' panic. Directory corruption is now much more likely to be detected, resulting in a 'ufs_dirbad' panic. If the filesystem is read-only, it will simply print a warning message, and skip the corrupted block. Reviewed by: mckusick	2001-02-04 01:52:11 +00:00
iedowse	e061532c92	Use the correct flags field when checking for a read-only filesystem in ufs_dirbad(). The mnt_stat.f_flags field is only updated by the syscalls *statfs and getfsstat, so mnt_flag should be used instead. This only affects whether or not a panic is generated on detection of certain types of directory corruption. Reviewed by: mckusick	2001-02-03 21:25:32 +00:00
dillon	1f3366270d	Fix a race between the syncer and umount. When you umount a softupdates filesystem softdep_process_worklist() is called in a loop until it indicates that no dependancies remain, but the determination of that fact depends on there only being one softdep_process_worklist() instance running. It was possible for the syncer to also be running softdep_process_worklist() and the pre-existing checks in the code to prevent this were not sufficient to prevent the race. This patch solves the problem. Approved-by: mckusick	2001-01-30 06:31:59 +00:00
jasone	8d2ec1ebc4	Convert all simplelocks to mutexes and remove the simplelock implementations.	2001-01-24 12:35:55 +00:00
iedowse	5cc8ff22fa	The ffs superblock includes a 128-byte region for use by temporary in-core pointers to summary information. An array in this region (fs_csp) could overflow on filesystems with a very large number of cylinder groups (~16000 on i386 with 8k blocks). When this happens, other fields in the superblock get corrupted, and fsck refuses to check the filesystem. Solve this problem by replacing the fs_csp array in 'struct fs' with a single pointer, and add padding to keep the length of the 128-byte region fixed. Update the kernel and userland utilities to use just this single pointer. With this change, the kernel no longer makes use of the superblock fields 'fs_csshift' and 'fs_csmask'. Add a comment to newfs/mkfs.c to indicate that these fields must be calculated for compatibility with older kernels. Reviewed by: mckusick	2001-01-15 18:30:40 +00:00
mckusick	8ef2c7028e	Properly compute the size of the final block of superblock summary information. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>	2001-01-12 21:56:55 +00:00
rwatson	8e64b8803d	o Commit reems of style(9) changes, whitespace improvements, and comment cleanups. Obtained from: TrustedBSD Project	2001-01-07 23:45:56 +00:00
rwatson	34031fcb8d	o Zero the ufs_extattr_header length field (not necessary, but not a bad idea either) in ufs_extattr_rm. o More completely fill out the local_aio structure when writing out the zero'd extended attribute in ufs_extattr_rm -- previoulsy, this worked fine, but probably should not have. This corrects extraneous warnings about inconsistent inodes following file deletion. Reviewed by: jedgar	2001-01-07 23:31:51 +00:00
rwatson	d0fa22cb5d	o Add an additional EA inconsistency reporting opportunity in ufs_extattr_rm. o Make both reporting locations report the function name where the inconsistency is discovered, as well as the inode number in question. Reviewed by: jedgar	2001-01-07 23:27:58 +00:00
rwatson	17469eca96	o Make call to ufs_extattr_rm() in ufs_extattr_vnode_inactive() use NULL as the credential, not 0, so as to make it more clear what's going on. Obtained from: TrustedBSD Project	2001-01-07 21:38:26 +00:00
rwatson	e32240b99a	o Remove unnecessary sanity check involving requested offset of extended attribute read--the offset is required to be 0 by an earlier check, meaning that it will always be within the scope of the attribute data. This change should have no impact on executed code paths other than removing the unnecessary check: please report if any new failures start to occur as a result. Obtained from: TrustedBSD Project	2001-01-07 21:07:22 +00:00
dillon	fd223545d4	This implements a better launder limiting solution. There was a solution in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.	2000-12-26 19:41:38 +00:00
mckusick	d1a83511c9	Several small but important fixes for snapshots: 1) Be more tolerant of missing snapshot files by only trying to decrement their reference count if they are registered as active. 2) Fix for snapshots of filesystems with block sizes larger than 8K (from Ollivier Robert <roberto@eurocontrol.fr>). 3) Fix to avoid losing last block in snapshot file when calculating blocks that need to be copied (from Don Coleman <coleman@coleman.org>).	2000-12-19 04:41:09 +00:00
mckusick	1198265231	Get rid of spurious check in ffs_truncate for i_size == length which fails to set the modification time on the file. The same check a few lines later takes the correct action. Submitted by: Ian Dowse <iedowse@maths.tcd.ie>	2000-12-19 04:20:13 +00:00
assar	4d01c1e206	add a stub for softdep_slowdown so that it's possible to build the kernel without SOFTUPDATES	2000-12-17 23:59:56 +00:00
dillon	69242be380	Avoid a data-consistency race between write() and mmap() by ensuring that newly allocated blocks are zerod. The race can occur even in the case where the write covers the entire block. Reported by: Sven Berkvens <sven@berkvens.net>, Marc Olzheim <zlo@zlo.nu>	2000-12-17 23:57:05 +00:00
tanimura	016329311e	- Move ifs_init() so that it can initialize ifs_inode_hash_mtx. - s/ffs_inode_hash_lock/ifs_inode_hash_lock/	2000-12-14 09:15:27 +00:00
tanimura	635b424e75	Do not race for the lock of an inode hash. Reviewed by: jhb	2000-12-13 10:04:01 +00:00
mckusick	f52f7b670a	Preventing runaway kernel soft updates memory, take three. Previously, the syncer process was the only process in the system that could process the soft updates background work list. If enough other processes were adding requests to that list, it would eventually grow without bound. Because some of the work list requests require vnodes to be locked, it was not generally safe to let random processes process the work list while they already held vnodes locked. By adding a flag to the work list queue processing function to indicate whether the calling process could safely lock vnodes, it becomes possible to co-opt other processes into helping out with the work list. Now when the worklist gets too large, other processes can safely help out by picking off those work requests that can be handled without locking a vnode, leaving only the small number of requests requiring a vnode lock for the syncer process. With this change, it appears possible to keep even the nastiest workloads under control. Submitted by: Paul Saab <ps@yahoo-inc.com>	2000-12-13 08:30:35 +00:00
dwmalone	dd75d1d73b	Convert more malloc+bzero to malloc+M_ZERO. Submitted by: josh@zipperup.org Submitted by: Robert Drehmel <robd@gmx.net>	2000-12-08 21:51:06 +00:00
phk	c3f2ee9700	Staticize some malloc M_ instances.	2000-12-08 20:09:00 +00:00
dillon	978bf0288d	Add necessary bwillwrite() in writev() entry point. Deal with excessive dirty buffers when msync() syncs non-contiguous dirty buffers by checking for the case in UFS before checking for clusterability.	2000-12-06 20:55:09 +00:00
mckusick	690af6322b	More aggressively rate limit the growth of soft dependency structures in the face of multiple processes doing massive numbers of filesystem operations. While this patch will work in nearly all situations, there are still some perverse workloads that can overwhelm the system. Detecting and handling these perverse workloads will be the subject of another patch. Reviewed by: Paul Saab <ps@yahoo-inc.com> Obtained from: Ethan Solomita <ethan@geocast.com>	2000-11-20 06:22:39 +00:00
dillon	2ace352085	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>	2000-11-18 23:06:26 +00:00
mckusick	56cb617a9d	When deleting a file, the ordering of events imposed by soft updates is to first write the deleted directory entry to disk, second write the zero'ed inode to disk, and finally to release the freed blocks and the inode back to the cylinder-group map. As this ordering requires two disk writes to occur which are normally spaced about 30 seconds apart (except when memory is under duress), it takes about a minute from the time that a file is deleted until its inode and data blocks show up in the cylinder-group map for reallocation. If a file has had only a brief lifetime (less than 30 seconds from creation to deletion), neither its inode nor its directory entry may have been written to disk. If its directory entry has not been written to disk, then we need not wait for that directory block to be written as the on-disk directory block does not reference the inode. Similarly, if the allocated inode has never been written to disk, we do not have to wait for it to be written back either as its on-disk representation is still zero'ed out. Thus, in the case of a short lived file, we can simply release the blocks and inode to the cylinder-group map immediately. As the inode and its blocks are released immediately, they are immediately available for other uses. If they are not released for a minute, then other inodes and blocks must be allocated for short lived files, cluttering up the vnode and buffer caches. The previous code was a bit too aggressive in trying to release the blocks and inode back to the cylinder-group map resulting in their being made available when in fact the inode on disk had not yet been zero'ed. This patch takes a more conservative approach to doing the release which avoids doing the release prematurely.	2000-11-14 09:00:25 +00:00
bde	fee8e1a16f	Fixed breakage of mknod() in rev.1.48 of ext2_vnops.c and rev.1.126 of ufs_vnops.c: 1) i_ino was confused with i_number, so the inode number passed to VFS_VGET() was usually wrong (usually 0U). 2) ip was dereferenced after vgone() freed it, so the inode number passed to VFS_VGET() was sometimes not even wrong. Bug (1) was usually fatal in ext2_mknod(), since ext2fs doesn't have space for inode 0 on the disk; ino_to_fsba() subtracts 1 from the inode number, so inode number 0U gives a way out of bounds array index. Bug(1) was usually harmless in ufs_mknod(); ino_to_fsba() doesn't subtract 1, and VFS_VGET() reads suitable garbage (all 0's?) from the disk for the invalid inode number 0U; ufs_mknod() returns a wrong vnode, but most callers just vput() it; the correct vnode is eventually obtained by an implicit VFS_VGET() just like it used to be. Bug (2) usually doesn't happen.	2000-11-04 08:10:56 +00:00
eivind	1afa7eea27	Give vop_mmap an untimely death. The opportunity to give it a timely death timed out in 1996.	2000-11-01 17:57:24 +00:00
phk	cbbd2b08c2	Add a missing <sys/systm.h>	2000-10-30 20:37:19 +00:00
phk	ff5cdfae2d	Move suser() and suser_xxx() prototypes and a related #define from <sys/proc.h> to <sys/systm.h>. Correctly document the #includes needed in the manpage. Add one now needed #include of <sys/systm.h>. Remove the consequent 48 unused #includes of <sys/proc.h>.	2000-10-29 16:06:56 +00:00
phk	f82e4ca62c	Weaken a bogus dependency on <sys/proc.h> in <sys/buf.h> by #ifdef'ing the offending inline function (BUF_KERNPROC) on it being #included already. I'm not sure BUF_KERNPROC() is even the right thing to do or in the right place or implemented the right way (inline vs normal function). Remove consequently unneeded #includes of <sys/proc.h>	2000-10-29 14:54:55 +00:00
phk	94a5006c9a	Remove unneeded #include <sys/proc.h> lines.	2000-10-29 13:57:19 +00:00
rwatson	9c993b44d0	o Introduce new VOP_ACCESS() flag VADMIN, allowing file systems to perform "administrative" authorization checks. In most cases, the VADMIN test checks to make sure the credential effective uid is the same as the file owner. o Modify vaccess() to set VADMIN as an available right if the uid is appropriate. o Modify references to uid-based access control operations such that they now always invoke VOP_ACCESS() instead of using hard-coded policy checks. o This allows alternative UFS policies to be implemented by replacing only ufs_access() (such as mandatory system policies). o VOP_ACCESS() requires the caller to hold an exclusive vnode lock on the vnode: I believe that new invocations of VOP_ACCESS() are always called with the lock held. o Some direct checks of the uid remain, largely associated with the QUOTA and SUIDDIR code. Reviewed by: eivind Obtained from: TrustedBSD Project	2000-10-19 07:53:59 +00:00
adrian	0458054c4e	Initial commit of IFS - a inode-namespaced FFS. Here is a short description: How it works: -- Basically ifs is a copy of ffs, overriding some vfs/vnops. (Yes, hack.) I didn't see the need in duplicating all of sys/ufs/ffs to get this off the ground. File creation is done through a special file - 'newfile' . When newfile is called, the system allocates and returns an inode. Note that newfile is done in a cloning fashion: fd = open("newfile", O_CREAT\|O_RDWR, 0644); fstat(fd, &st); printf("new file is %d\n", (int)st.st_ino); Once you have created a file, you can open() and unlink() it by its returned inode number retrieved from the stat call, ie: fd = open("5", O_RDWR); The creation permissions depend entirely if you have write access to the root directory of the filesystem. To get the list of currently allocated inodes, VOP_READDIR has been added which returns a directory listing of those currently allocated. -- What this entails: * patching conf/files and conf/options to include IFS as a new compile option (and since ifs depends upon FFS, include the FFS routines) * An entry in i386/conf/NOTES indicating IFS exists and where to go for an explanation * Unstaticize a couple of routines in src/sys/ufs/ffs/ which the IFS routines require (ffs_mount() and ffs_reload()) * a new bunch of routines in src/sys/ufs/ifs/ which implement the IFS routines. IFS replaces some of the vfsops, and a handful of vnops - most notably are VFS_VGET(), VOP_LOOKUP(), VOP_UNLINK() and VOP_READDIR(). Any other directory operation is marked as invalid. What this results in: * an IFS partition's create permissions are controlled by the perm/ownership of the root mount point, just like a normal directory * Each inode has perm and ownership too * IFS does NOT mean an FFS partition can be opened per inode. This is a completely seperate filesystem here * Softupdates doesn't work with IFS, and really I don't think it needs it. Besides, fsck's are FAST. (Try it :-) * Inodes 0 and 1 aren't allocatable because they are special (dump/swap IIRC). Inode 2 isn't allocatable since UFS/FFS locks all inodes in the system against this particular inode, and unravelling THAT code isn't trivial. Therefore, useful inodes start at 3. Enjoy, and feedback is definitely appreciated!	2000-10-14 03:02:30 +00:00
rwatson	113a9e1b35	o Sanity check was inverted, resulting in a possible spurious panic during unmount if extended attributes were in use. Correct by removing an unneeded (and undesirable) '!'.	2000-10-09 20:04:39 +00:00
eivind	4a39f454a0	Blow away the v_specmountpoint define, replacing it with what it was defined as (rdev->si_mountpoint)	2000-10-09 17:31:39 +00:00
rwatson	8d202cd4d0	o Move initialization of ump from mp to the top of the function so that it is defined whenm used in ufs_extattr_uepm_destroy(), fixing a panic due to a NULL pointer dereference. Submitted by: Wesley Morgan <morganw@chemicals.tacorp.com>	2000-10-06 15:31:28 +00:00
rwatson	469c6c2bf3	o Add call to ufs_extattr_uepm_destroy() in ffs_unmount() so as to clean up lock on extattrs. o Get for free a comment indicating where auto-starting of extended attributes will eventually occur, as it was in my commit tree also. No implementation change here, only a comment.	2000-10-04 04:44:51 +00:00
rwatson	4e14377293	o Correct use of lockdestroy() by adding a new ufs_extattr_uepm_destroy() call, which should be the last thing down to a per-mount extattr management structure, after ufs_extattr_stop() on the file system. This currently has the effect only of destroying the per-mount lock on extended attributes, and clearing appropriate flags. o Remove inappropriate invocation in ufs_extattr_vnode_inactive().	2000-10-04 04:41:33 +00:00
jasone	4e290e67b7	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.	2000-10-04 01:29:17 +00:00
bp	6110b03d24	Add a lock structure to vnode structure. Previously it was either allocated separately (nfs, cd9660 etc) or keept as a first element of structure referenced by v_data pointer(ffs). Such organization leads to known problems with stacked filesystems. From this point vop_nolock() functions maintain only interlock lock. vop_stdlock() functions maintain built-in v_lock structure using lockmgr(). vop_sharedlock() is compatible with vop_stdunlock(), but maintains a shared lock on vnode. If filesystem wishes to export lockmgr compatible lock, it can put an address of this lock to v_vnlock field. This indicates that the upper filesystem can take advantage of it and use single lock structure for entire (or part) of stack of vnodes. This field shouldn't be examined or modified by VFS code except for initialization purposes. Reviewed in general by: mckusick	2000-09-25 15:24:04 +00:00
rwatson	40aced8438	o Permit UFS Extended Attributes to be associated with special devices and FIFOs. Obtained from: TrustedBSD Project	2000-09-21 19:06:02 +00:00
rwatson	07ac219faf	o Disallow privileged processes in jail() from directly accessing system namespace extended attributes. o Document privilege/jail() interaction relating to extended attributes. Obtained from: TrustedBSD Project	2000-09-18 18:10:13 +00:00
rwatson	3546d27e15	o Allow privileged processes in jail() to override sticky bit behavior on directories. o Allow privileged processes in jail() to create inodes with the setgid bit set even if they are not a member of the group denoted by the file creation gid. This occurs due to inherited gid's from parent directories on file creation, allowing a user to create a file with a gid that is not in the creating process's credentials. Obtained from: TrustedBSD Project	2000-09-18 18:03:49 +00:00
rwatson	b324dcbd3d	o Add a comment clarifying interaction between jail(), privileged processes, and UFS file flags. Here's what the comment says, for reference: Privileged processes in jail() are permitted to modify arbitrary user flags on files, but are not permitted to modify system flags. In other words, privilege does allow a process in jail to modify user flags for objects that the process does not own, but privilege will not permit the setting of system flags on the file. Obtained from: TrustedBSD Project	2000-09-18 17:58:15 +00:00
rwatson	f193def48e	o Add missing PRISON_ROOT allowing a privileged process in a jail() to not remove the setuid/setgid bits by virtue of a change to a file with those bits set, even if the process doesn't own the file, or isn't a group member of the file's gid. Obtained from: TrustedBSD Project	2000-09-18 17:53:22 +00:00
rwatson	4ba86892be	o Substitute suser() calls for direct credential checks, which is now safe as suser() no longer sets ASU. o Note that in some cases, the PRISON_ROOT flag is used even though no process structure is passed, to indicate that if a process structure (and hence jail) was available, it would be ok. In the long run, the jail identifier should probably be moved to ucred, as the uidinfo information was. o Some uid 0 checks remain relating to the quota code, which I'll leave for another day. Reviewed by: phk, eivind Obtained from: TrustedBSD Project	2000-09-18 16:13:02 +00:00
des	86bd96948b	Silence a warning.	2000-09-17 19:41:26 +00:00
bp	02544af7d4	Add new flag PDIRUNLOCK to the component.cn_flags which should be set by filesystem lookup() routine if it unlocks parent directory. This flag should be carefully tracked by filesystems if they want to work properly with nullfs and other stacked filesystems. VFS takes advantage of this flag to perform symantically correct usage of vrele() instead of vput() if parent directory already unlocked. If filesystem fails to track this flag then previous codepath in VFS left unchanged. Convert UFS code to set PDIRUNLOCK flag if necessary. Other filesystmes will be changed after some period of testing. Reviewed in general by: mckusick, dillon, adrian Obtained from: NetBSD	2000-09-17 07:26:42 +00:00
phk	f2b4e59044	Remove a pointless casting of a gid_t to a gid_t.	2000-09-16 18:20:27 +00:00
bp	8437d5b6f4	Add VOP_*VOBJECT vops, because MFS requires explicit vop specification. Noted by: knu	2000-09-12 16:21:16 +00:00
rwatson	d12caa21f3	o Variety of extended attribute fixes - In ufs_extattr_enable(), return EEXIST instead of EOPNOTSUPP if the caller tries to configure an attribute name that is already configured - Throughout, add IO_NODELOCKED to VOP_{READ,WRITE} calls to indicate lock status of passed vnode. Apparently not a problem, but worth fixing. - For all writes, make use of IO_SYNC consistent. Really, IO_UNIT and combining of VOP_WRITE's should happen, but I don't have that tested. At least with this, it's consistent usage. (pointed out by: bde) - In ufs_extattr_get(), fixed nested locking of backing vnode (fine due to recursive lock support, but make it more consistent with other code) - In ufs_extattr_get(), clean up return code to set uio_resid more consistently with other pieces of code (worked fine, this is just a cleanup) - Fix ufs_extattr_rm(), which was broken--effectively a nop. - Minor comment and whitespace fixes. Obtained from: TrustedBSD Project	2000-09-12 05:35:47 +00:00
jhb	e467813373	Fix a 64-bitism. Use size_t instead of int for 4th argument to copyinstr. Approved by: rwatson	2000-09-11 05:43:02 +00:00
mckusick	7438b4ca6f	Cannot do MALLOC with M_WAITOK while holding ACQUIRE_LOCK Obtained from: Ethan Solomita <ethan@geocast.com>	2000-09-07 23:02:55 +00:00
jasone	769e0f974d	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh	2000-09-07 01:33:02 +00:00
rwatson	e6a536221c	Modify extended attribute protection model to authorize based on attribute namespace and DAC protection on file: - Attribute names beginning with '$' are in the system namespace - The attribute name "$" is reserved - System namespace attributes may only be read/set by suser() or by kernel (cred == NULL) - Other attribute names are in the application namespace - The attribute name "" is reserved - Application namespace attributes are protected in the manner of the target file permission o Kernel changes - Add ufs_extattr_valid_attrname() to check whether the requested attribute "set" or "enable" is appropriate (i.e., non-reserved) - Modify ufs_extattr_credcheck() to accept target file vnode, not to take inode uid - Modify ufs_extattr_credcheck() to check namespace, then enforce either kernel/suser for system namespace, or vaccess() for application namespace o EA backing file format changes - Remove permission fields from extended attribute backing file header - Bump extended attribute backing file header version to 3 o Update extattrctl.c and extattrctl.8 - Remove now deprecated -r and -w arguments to initattr, as permissions are now implicit - (unrelated) fix error reporting and unlinking during failed initattr to remove duplicate/inaccurate error messages, and to only unlink if the failure wasn't in the backing file open() Obtained from: TrustedBSD Project	2000-09-02 20:31:26 +00:00
rwatson	e54ea574fa	o Restructure vaccess() so as to check for DAC permission to modify the object before falling back on privilege. Make vaccess() accept an additional optional argument, privused, to determine whether privilege was required for vaccess() to return 0. Add commented out capability checks for reference. Rename some variables to make it more clear which modes/uids/etc are associated with the object, and which with the access mode. o Update file system use of vaccess() to pass NULL as the optional privused argument. Once additional patches are applied, suser() will no longer set ASU, so privused will permit passing of privilege information up the stack to the caller. Reviewed by: bde, green, phk, -security, others Obtained from: TrustedBSD Project	2000-08-29 14:45:49 +00:00
rwatson	251e663e8a	o Correct spelling of ufs_exttatr_find_attr -> ufs_extattr_find_attr o Add "const" qualifier to attrname argument of various calls to remove warnings Obtained from: TrustedBSD Project	2000-08-26 22:00:58 +00:00
phk	b648921acc	Remove all traces of Julians DEVFS (incl from kern/subr_diskslice.c) Remove old DEVFS support fields from dev_t. Make uid, gid & mode members of dev_t and set them in make_dev(). Use correct uid, gid & mode in make_dev in disk minilayer. Add support for registering alias names for a dev_t using the new function make_dev_alias(). These will show up as symlinks in DEVFS. Use makedev() rather than make_dev() for MFSs magic devices to prevent DEVFS from noticing this abuse. Add a field for DEVFS inode number in dev_t. Add new DEVFS in fs/devfs. Add devfs cloning to: disk minilayer (ie: ad(4), sd(4), cd(4) etc etc) md(4), tun(4), bpf(4), fd(4) If DEVFS add -d flag to /sbin/inits args to make it mount devfs. Add commented out DEVFS to GENERIC	2000-08-20 21:34:39 +00:00
phk	3d2aecdc81	Centralize the canonical vop_access user/group/other check in vaccess(). Discussed with: bde	2000-08-20 08:36:26 +00:00
tegge	6dac8645b8	Initialize *countp to 0 in stub for softdep_flushworklist(). This allows ffs_fsync() to break out of a loop that might otherwise be infinite on kernels compiled without the SOFTUPDATES option. The observed symptom was a system hang at the first unmount attempt.	2000-08-09 00:41:54 +00:00
roberto	3d4cf3c369	Fix the lockmgr panic everyone is seeing at shutdown time. vput assumes curproc is the lock holder, but it's not true in this case. Thanks a lot Luoqi ! Submitted by: luoqi Tested by: phk	2000-08-01 14:15:07 +00:00
peter	3f9fc32ece	Minor tweak - removed unused variable 'struct mount *mp';	2000-07-28 22:28:05 +00:00
peter	cfc0cd38b7	Minor change: fix warning - move a 'struct vnode *vp' declaration inside a #ifdef DIAGNOSTIC to match its corresponding usage.	2000-07-28 22:27:00 +00:00
mckusick	b86877bef0	Clean up the snapshot code so that it no longer depends on the use of the SF_IMMUTABLE flag to prevent writing. Instead put in explicit checking for the SF_SNAPSHOT flag in the appropriate places. With this change, it is now possible to rename and link to snapshot files. It is also possible to set or clear any of the owner, group, or other read bits on the file, though none of the write or execute bits can be set. There is also an explicit test to prevent the setting or clearing of the SF_SNAPSHOT flag via chflags() or fchflags(). Note also that the modify time cannot be changed as it needs to accurately reflect the time that the snapshot was taken. Submitted by: Robert Watson <rwatson@FreeBSD.org>	2000-07-26 23:07:01 +00:00
phk	9aed458325	Fix the "mfs_badop[vop_getwritemount] = 45" messages.	2000-07-26 17:53:04 +00:00
mckusick	4223e4856e	Add stub for softdep_flushworklist() so that kernels compiled without the SOFTUPDATES option will load correctly. Obtained from: John Baldwin <jhb@bsdi.com>	2000-07-25 05:28:59 +00:00
mckusick	7d9ff6c133	Eliminate periodic 'mfs_badop[vop_getwritemount] = 45' messages. Submitted by: Sheldon Hearn <sheldonh@uunet.co.za>	2000-07-25 05:11:57 +00:00
mckusick	acc66855bf	This patch corrects the first round of panics and hangs reported with the new snapshot code. Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem. Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write(). Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations. Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot. Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation. Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress. Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior. Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic. Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation. Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.	2000-07-24 05:28:33 +00:00
rwatson	1293199940	o Marius pointed out an unusually inconvenient upper bound on extended attribute data size. o Fortunately it turned out to be an unused constant left over from an earlier implementation, and is therefore being removed so as not to confuse casual observers. Submitted by: mbendiks@eunet.no	2000-07-14 03:30:52 +00:00
bp	c956469da1	Prevent possible dereference of NULL pointer. Submitted by: Marius Bendiksen <mbendiks@eunet.no>	2000-07-13 02:17:14 +00:00
mckusick	6e81eafe20	Brain fault, forgot to update ffs_snapshot.c with the new calling convention for vn_start_write.	2000-07-12 00:27:27 +00:00
mckusick	a3d0c189ea	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).	2000-07-11 22:07:57 +00:00
mckusick	b2ed023a03	Clean up warning about undeclared function by declaring softdep_fsync in mount.h instead of ffs_extern.h. The correct solution is to use an indirect function pointer so that the kernel does not have to be built with options FFS, but that will be left for another day.	2000-07-11 19:28:26 +00:00
phk	9241ff9fc6	Finish repo-copy: Move ufs/ufs/ufs_disksubr.c to kern/subr_disklabel.c. These functions are not UFS specific and are in fact used all over the place.	2000-07-10 13:48:06 +00:00
mckusick	92dfcced5b	Delete README as it is now obsolete. Relevant information is in README.softupdates.	2000-07-08 02:32:49 +00:00
mckusick	0ab089c771	Update to reflect current status.	2000-07-08 02:31:21 +00:00
mckusick	5e6b00a0a7	Get userland visible flags added for snapshots to give a few days advance preparation for them to get migrated into place so that subsequent changes in utilities will not fail to compile for lack of up-to-date header files in /usr/include.	2000-07-04 04:58:34 +00:00

... 3 4 5 6 7 ...

1075 Commits