freebsd-dev

Author	SHA1	Message	Date
Kirk McKusick	6524dddcd5	This patch fixes a size problem with the stat structure for 64-bit architectures that was introduced in the UFS2 code merge two days ago. The stat structure change that caused the problem was the addition of the file create time. Submitted by: Bruce Evans <bde@zeta.org.au> Sponsored by: DARPA & NAI Labs.	2002-06-22 22:01:13 +00:00
Kirk McKusick	1c85e6a35d	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>	2002-06-21 06:18:05 +00:00
Jeff Roberson	0e2d6cc899	Disable the shared locking namei() code for now. It breaks several stacking filesystems. This is on hold until the rest of VFS Locking is reviewed and deemed safe. It can be enabled with 'options LOOKUP_SHARED'.	2002-05-14 21:59:49 +00:00
John Baldwin	ba626c1db2	Lock proctree_lock instead of pgrpsess_lock.	2002-04-16 17:11:34 +00:00
Jeff Roberson	79a3e97054	Use VOP_GETVOBJECT instead of accessing the member directly. This fixed an issue with nullfs and NAMEI shared. Submitted by: Alexander Kabaev	2002-04-14 10:18:48 +00:00
Jeff Roberson	a59f8b9e6c	Turn #ifdef LOOKUP_SHARED into #ifndef LOOKUP_EXCLUSIVE to enable this behavior by default. Also, change the options line to reflect this. If there are no problems reported this will become the only behavior and the knob will be removed in a month or so. Demanded by: obrien	2002-04-09 05:14:17 +00:00
John Baldwin	44731cab3b	Change the suser() API to take advantage of td_ucred as well as do a general cleanup of the API. The entire API now consists of two functions similar to the pre-KSE API. The suser() function takes a thread pointer as its only argument. The td_ucred member of this thread must be valid so the only valid thread pointers are curthread and a few kernel threads such as thread0. The suser_cred() function takes a pointer to a struct ucred as its first argument and an integer flag as its second argument. The flag is currently only used for the PRISON_ROOT flag. Discussed on: smp@	2002-04-01 21:31:13 +00:00
Bruce Evans	237e41fc58	Added used include of <sys/sx.h>. Don't depend on namespace pollution in <sys/file.h>.	2002-03-26 01:09:51 +00:00
Alfred Perlstein	4d77a549fe	Remove __P.	2002-03-19 21:25:46 +00:00
Alfred Perlstein	628abf6c69	Giant pushdown for read/write/pread/pwrite syscalls. kern/kern_descrip.c: Aquire Giant in fdrop_locked when file refcount hits zero, this removes the requirement for the caller to own Giant for the most part. kern/kern_ktrace.c: Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls. kern/vfs_bio.c: Aquire Giant in bwillwrite if needed. kern/sys_generic.c Giant pushdown, remove Giant for: read, pread, write and pwrite. readv and writev aren't done yet because of the possible malloc calls for iov to uio processing. kern/sys_socket.c Grab giant in the socket fo_read/write functions. kern/vfs_vnops.c Grab giant in the vnode fo_read/write functions.	2002-03-15 08:03:46 +00:00
Jeff Roberson	8de00f4a87	This patch adds the "LOCKSHARED" option to namei which causes it to only acquire shared locks on leafs. The stat() and open() calls have been changed to make use of this new functionality. Using shared locks in these cases is sufficient and can significantly reduce their latency if IO is pending to these vnodes. Also, this reduces the number of exclusive locks that are floating around in the system, which helps reduce the number of deadlocks that occur. A new kernel option "LOOKUP_SHARED" has been added. It defaults to off so this patch can be turned on for testing, and should eventually go away once it is proven to be stable. I have personally been running this patch for over a year now, so it is believed to be fully stable. Reviewed by: jake, obrien Approved by: jake	2002-03-12 04:00:11 +00:00
Seigo Tanimura	183ccde6c6	Stop abusing the pgrpsess_lock.	2002-03-11 07:53:13 +00:00
Eivind Eklund	eb8e6d5276	Document all functions, global and static variables, and sysctls. Includes some minor whitespace changes, and re-ordering to be able to document properly (e.g, grouping of variables and the SYSCTL macro calls for them, where the documentation has been added.) Reviewed by: phk (but all errors are mine)	2002-03-05 15:38:49 +00:00
John Baldwin	a854ed9893	Simple p_ucred -> td_ucred changes to start using the per-thread ucred reference.	2002-02-27 18:32:23 +00:00
Seigo Tanimura	f591779bb5	Lock struct pgrp, session and sigio. New locks are: - pgrpsess_lock which locks the whole pgrps and sessions, - pg_mtx which protects the pgrp members, and - s_mtx which protects the session members. Please refer to sys/proc.h for the coverage of these locks. Changes on the pgrp/session interface: - pgfind() needs the pgrpsess_lock held. - The caller of enterpgrp() is responsible to allocate a new pgrp and session. - Call enterthispgrp() in order to enter an existing pgrp. - pgsignal() requires a pgrp lock held. Reviewed by: jhb, alfred Tested on: cvsup.jp.FreeBSD.org (which is a quad-CPU machine running -current)	2002-02-23 11:12:57 +00:00
Robert Watson	ec20f901a2	More cleanups relating to vm object allocation failure: make sure we call VOP_CLOSE() with vp unlocked; clean up the return path a little, in as much as our namei/vnode operation return paths can be cleared up. For a return case that was apparently never taken, this sure is ugly. Reviewed by: jeffr	2002-02-20 00:11:57 +00:00
Ian Dowse	b01bcf4c74	Add the braces missed by revision 1.131. Pointy hat to: rwatson	2002-02-18 12:46:18 +00:00
Robert Watson	4729fbd85f	When vn_open() is failing because it cannot allocate a vm object, call VOP_CLOSE() on the vnode, so that VOP_OPEN() and VOP_CLOSE() calls are symmetric in all failure cases. This prevents an 'open' reference from being leaked in that unlikely failure scenario.	2002-02-18 00:26:10 +00:00
Robert Watson	1ea030d8fe	Make sure to hold vnode lock when calling into VOP_GETATTR(). Discussed with: mckusick, phk	2002-02-10 21:44:30 +00:00
Robert Watson	74237f55b0	Part I: Update extended attribute API and ABI: o Modify the system call syntax for extattr_{get,set}_{fd,file}() so as not to use the scatter gather API (which appeared not to be used by any consumers, and be less portable), rather, accepts 'data' and 'nbytes' in the style of other simple read/write interfaces. This changes the API and ABI. o Modify system call semantics so that extattr_get_{fd,file}() return a size_t. When performing a read, the number of bytes read will be returned, unless the data pointer is NULL, in which case the number of bytes of data are returned. This changes the API only. o Modify the VOP_GETEXTATTR() vnode operation to accept a *size_t argument so as to return the size, if desirable. If set to NULL, the size will not be returned. o Update various filesystems (pseodofs, ufs) to DTRT. These changes should make extended attributes more useful and more portable. More commits to rebuild the system call files, as well as update userland utilities to follow. Obtained from: TrustedBSD Project Sponsored by: DARPA, NAI Labs	2002-02-10 04:43:22 +00:00
Poul-Henning Kamp	3e72822404	Make st_blksize default to PAGE_SIZE instead of zero.	2002-01-25 16:39:57 +00:00
Matthew Dillon	c73df808a0	Remove 'VXLOCK: interlock avoided' warnings. This can now occur in normal operation. The vgonel() code has always called vclean() but until we started proactively freeing vnodes it would never actually be called with a dirty vnode, so this situation did not occur prior to the vnlru() code. Now that we proactively free vnodes when kern.maxvnodes is hit, however, vclean() winds up with work to do and improperly generates the warnings. Reviewed by: peter Approved by: re (for MFC) MFC after: 1 day	2002-01-19 02:14:45 +00:00
Alfred Perlstein	426da3bcfb	SMP Lock struct file, filedesc and the global file list. Seigo Tanimura (tanimura) posted the initial delta. I've polished it quite a bit reducing the need for locking and adapting it for KSE. Locks: 1 mutex in each filedesc protects all the fields. protects "struct file" initialization, while a struct file is being changed from &badfileops -> &pipeops or something the filedesc should be locked. 1 mutex in each struct file protects the refcount fields. doesn't protect anything else. the flags used for garbage collection have been moved to f_gcflag which was the FILLER short, this doesn't need locking because the garbage collection is a single threaded container. could likely be made to use a pool mutex. 1 sx lock for the global filelist. struct file * fhold(struct file fp); / increments reference count on a file / struct file fhold_locked(struct file fp); / like fhold but expects file to locked / struct file ffind_hold(struct thread , int fd); / finds the struct file in thread, adds one reference and returns it unlocked / struct file ffind_lock(struct thread , int fd); / ffind_hold, but returns file locked */ I still have to smp-safe the fget cruft, I'll get to that asap.	2002-01-13 11:58:06 +00:00
Matthew Dillon	fdb33f08ef	This is a forward port of Peter's vlrureclaim() fix, with some minor mods by me to make it more efficient. The original code had serious balancing problems and could also deadlock easily. This code relegates the vnode reclamation to its own kproc and relaxes the vnode reclamation requirements to better maintain kern.maxvnodes. This code still doesn't balance as well as it could, but it does a much better job then the original code. Approved by: re@freebsd.org Obtained from: ps, peter, dillon MFS Assuming: Assuming no problems crop up in Yahoo testing MFC after: 7 days	2001-12-18 20:48:54 +00:00
Alfred Perlstein	f03e89de68	turn vn_open() into a wrapper around vn_open_cred() which allows one to perform a vn_open using temporary/other/fake credentials. Modify the nfs client side locking code to use vn_open_cred() passing proc0's ucred instead of the old way which was to temporary raise privs while running vn_open(). This should close the race hopefully.	2001-11-11 22:39:07 +00:00
Robert Watson	fc2749a40c	o vn_open() fails to call VOP_CLOSE() if vfs_object_create fails. Ideally all successful calls to VOP_OPEN() might be reflected in a call to VOP_CLOSE(). For now, simply add a comment reflecting this problem; this should be fixed at some point.	2001-10-23 19:09:01 +00:00
John Baldwin	7106ca0d1a	Add missing includes of sys/lock.h.	2001-10-11 17:52:20 +00:00
Matthew Dillon	3418ebebfe	Make uio_yield() a global. Call uio_yield() between chunks in vn_rdwr_inchunks(), allowing other processes to gain an exclusive lock on the vnode. Specifically: directory scanning, to avoid a race to the root directory, and multiple child processes coring simultaniously so they can figure out that some other core'ing child has an exclusive adv lock and just exit instead. This completely fixes performance problems when large programs core. You can have hundreds of copies (forked children) of the same binary core all at once and not notice. MFC after: 3 days	2001-09-26 06:54:32 +00:00
Julian Elischer	b40ce4165d	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha	2001-09-12 08:38:13 +00:00
Matthew Dillon	06ae1e91c4	This brings in a Yahoo coredump patch from Paul, with additional mods by me (addition of vn_rdwr_inchunks). The problem Yahoo is solving is that if you have large process images core dumping, or you have a large number of forked processes all core dumping at the same time, the original coredump code would leave the vnode locked throughout. This can cause the directory vnode to get locked up, which can cause the parent directory vnode to get locked up, and so on all the way to the root node, locking the entire machine up for extremely long periods of time. This patch solves the problem in two ways. First it uses an advisory non-blocking lock to abort multiple processes trying to core to the same file. Second (my contribution) it chunks up the writes and uses bwillwrite() to avoid holding the vnode locked while blocking in the buffer cache. Submitted by: ps Reviewed by: dillon MFC after: 2 weeks	2001-09-08 20:02:33 +00:00
Andrey A. Chernov	5d97bedb22	vn_stat(): if va_size (u_quad_t) > OFF_MAX, return EOVERFLOW, don't copy it blindly to st_size	2001-08-23 17:56:48 +00:00
Matthew Dillon	ac8f990bde	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon	2001-05-24 07:22:27 +00:00
Greg Lehey	60fb0ce365	Revert consequences of changes to mount.h, part 2. Requested by: bde	2001-04-29 02:45:39 +00:00
Kirk McKusick	112f737245	When closing the last reference to an unlinked file, it is freed by the inactive routine. Because the freeing causes the filesystem to be modified, the close must be held up during periods when the filesystem is suspended. For snapshots to be consistent across crashes, they must write blocks that they copy and claim those written blocks in their on-disk block pointers before the old blocks that they referenced can be allowed to be written. Close a loophole that allowed unwritten blocks to be skipped when doing ffs_sync with a request to wait for all I/O activity to be completed.	2001-04-25 08:11:18 +00:00
Greg Lehey	d98dc34f52	Correct #includes to work with fixed sys/mount.h.	2001-04-23 09:05:15 +00:00
Boris Popov	602ef63172	Previous commit broke interlock locking for !LK_RETRY case.	2001-03-26 12:45:35 +00:00
Boris Popov	71d8277b51	Prevent race condition by using msleep() instead of mtx_unlock()/tsleep(). Reviewed by: alfred	2001-03-26 03:10:07 +00:00
Robert Watson	3063207147	o Rename "namespace" argument to "attrnamespace" as namespace is a C++ reserved word. Submitted by: jkh Obtained from: TrustedBSD Project	2001-03-19 05:44:15 +00:00
Robert Watson	70f3685105	o Change the API and ABI of the Extended Attribute kernel interfaces to introduce a new argument, "namespace", rather than relying on a first- character namespace indicator. This is in line with more recent thinking on EA interfaces on various mailing lists, including the posix1e, Linux acl-devel, and trustedbsd-discuss forums. Two namespaces are defined by default, EXTATTR_NAMESPACE_SYSTEM and EXTATTR_NAMESPACE_USER, where the primary distinction lies in the access control model: user EAs are accessible based on the normal MAC and DAC file/directory protections, and system attributes are limited to kernel-originated or appropriately privileged userland requests. o These API changes occur at several levels: the namespace argument is introduced in the extattr_{get,set}_file() system call interfaces, at the vnode operation level in the vop_{get,set}extattr() interfaces, and in the UFS extended attribute implementation. Changes are also introduced in the VFS extattrctl() interface (system call, VFS, and UFS implementation), where the arguments are modified to include a namespace field, as well as modified to advoid direct access to userspace variables from below the VFS layer (in the style of recent changes to mount by adrian@FreeBSD.org). This required some cleanup and bug fixing regarding VFS locks and the VFS interface, as a vnode pointer may now be optionally submitted to the VFS_EXTATTRCTL() call. Updated documentation for the VFS interface will be committed shortly. o In the near future, the auto-starting feature will be updated to search two sub-directories to the ".attribute" directory in appropriate file systems: "user" and "system" to locate attributes intended for those namespaces, as the single filename is no longer sufficient to indicate what namespace the attribute is intended for. Until this is committed, all attributes auto-started by UFS will be placed in the EXTATTR_NAMESPACE_SYSTEM namespace. o The default POSIX.1e attribute names for ACLs and Capabilities have been updated to no longer include the '$' in their filename. As such, if you're using these features, you'll need to rename the attribute backing files to the same names without '$' symbols in front. o Note that these changes will require changes in userland, which will be committed shortly. These include modifications to the extended attribute utilities, as well as to libutil for new namespace string conversion routines. Once the matching userland changes are committed, a buildworld is recommended to update all the necessary include files and verify that the kernel and userland environments are in sync. Note: If you do not use extended attributes (most people won't), upgrading is not imperative although since the system call API has changed, the new userland extended attribute code will no longer compile with old include files. o Couple of minor cleanups while I'm there: make more code compilation conditional on FFS_EXTATTR, which should recover a bit of space on kernels running without EA's, as well as update copyright dates. Obtained from: TrustedBSD Project	2001-03-15 02:54:29 +00:00
Jonathan Lemon	608a3ce62a	Extend kqueue down to the device layer. Backwards compatible approach suggested by: peter	2001-02-15 16:34:11 +00:00
Bosko Milekic	9ed346bab0	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)	2001-02-09 06:11:45 +00:00
Jason Evans	1b367556b5	Convert all simplelocks to mutexes and remove the simplelock implementations.	2001-01-24 12:35:55 +00:00
Matthew Dillon	936524aa02	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>	2000-11-18 23:06:26 +00:00
Poul-Henning Kamp	1d7e3e42e7	Take VBLK devices further out of their missery. This should fix the panic I introduced in my previous commit on this topic.	2000-11-02 21:14:13 +00:00
John Baldwin	35e0e5b311	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h	2000-10-20 07:58:15 +00:00
Jason Evans	a18b1f1d4d	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.	2000-10-04 01:29:17 +00:00
Robert Watson	100d2c187c	o Introduce vn_extattr_rm(), a helper function in the style of vn_extattr_get() and vn_extattr_set(). vn_extattr_rm() removes the specified extended attribute from a vnode, authorizing the change as the kernel (NULL cred). Obtained from: TrustedBSD Project	2000-09-22 22:33:13 +00:00
Robert Watson	e81c5f4307	o vn_extattr_set() will now call appropriate vn_start_write() and vn_finished_write() if IO_NODELOCKED is not set. Obtained from: TrustedBSD Project	2000-09-05 03:15:02 +00:00
Robert Watson	e6a9ab52db	o Introduce vn_extattr_{get,set}, wrapper routines for VOP_GETEXTATTR and VOP_SETEXTATTR to simplify calling from in-kernel consumers, such as capability code. Both accept a vnode (optionally locked, with ioflg to indicate that), attribute name, and a buffer + buffer length in UIO_SYSSPACE. Both authorize the call as a kernel request, with cred set to NULL for the actual VOP_ calls. Obtained from: TrustedBSD Project	2000-08-08 17:15:32 +00:00
Kirk McKusick	9b97113391	This patch corrects the first round of panics and hangs reported with the new snapshot code. Update addaliasu to correctly implement the semantics of the old checkalias function. When a device vnode first comes into existence, check to see if an anonymous vnode for the same device was created at boot time by bdevvp(). If so, adopt the bdevvp vnode rather than creating a new vnode for the device. This corrects a problem which caused the kernel to panic when taking a snapshot of the root filesystem. Change the calling convention of vn_write_suspend_wait() to be the same as vn_start_write(). Split out softdep_flushworklist() from softdep_flushfiles() so that it can be used to clear the work queue when suspending filesystem operations. Access to buffers becomes recursive so that snapshots can recursively traverse their indirect blocks using ffs_copyonwrite() when checking for the need for copy on write when flushing one of their own indirect blocks. This eliminates a deadlock between the syncer daemon and a process taking a snapshot. Ensure that softdep_process_worklist() can never block because of a snapshot being taken. This eliminates a problem with buffer starvation. Cleanup change in ffs_sync() which did not synchronously wait when MNT_WAIT was specified. The result was an unclean filesystem panic when doing forcible unmount with heavy filesystem I/O in progress. Return a zero'ed block when reading a block that was not in use at the time that a snapshot was taken. Normally, these blocks should never be read. However, the readahead code will occationally read them which can cause unexpected behavior. Clean up the debugging code that ensures that no blocks be written on a filesystem while it is suspended. Snapshots must explicitly label the blocks that they are writing during the suspension so that they do not cause a `write on suspended filesystem' panic. Reorganize ffs_copyonwrite() to eliminate a deadlock and also to prevent a race condition that would permit the same block to be copied twice. This change eliminates an unexpected soft updates inconsistency in fsck caused by the double allocation. Use bqrelse rather than brelse for buffers that will be needed soon again by the snapshot code. This improves snapshot performance.	2000-07-24 05:28:33 +00:00
Kirk McKusick	f2a2857bb3	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).	2000-07-11 22:07:57 +00:00
Kirk McKusick	e6796b67d9	Move the truncation code out of vn_open and into the open system call after the acquisition of any advisory locks. This fix corrects a case in which a process tries to open a file with a non-blocking exclusive lock. Even if it fails to get the lock it would still truncate the file even though its open failed. With this change, the truncation is done only after the lock is successfully acquired. Obtained from: BSD/OS	2000-07-04 03:34:11 +00:00
Jonathan Lemon	cb5ad9d362	Fix stupid braino in last commit, initialize `vp' before we test vp->v_tag. Spotted by: dillon	2000-06-25 18:10:45 +00:00
Jonathan Lemon	c8bea19ee3	Add a hack to fail registration of kq events on a non-ufs filesystem, as support for those is non-existent at the moment.	2000-06-22 18:41:07 +00:00
Jake Burkholder	e39756439c	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others	2000-05-26 02:09:24 +00:00
Jake Burkholder	740a1973a6	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd	2000-05-23 20:41:01 +00:00
Jeroen Ruigrok van der Werven	37d90a44af	Fix comment typo. Submitted by: nrahlstr	2000-05-12 16:06:49 +00:00
Poul-Henning Kamp	9626b608de	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter	2000-05-05 09:59:14 +00:00
Poul-Henning Kamp	2c9b67a8df	Remove unneeded #include <vm/vm_zone.h> Generated by: src/tools/tools/kerninclude	2000-04-30 18:52:11 +00:00
Jonathan Lemon	cb679c385e	Introduce kqueue() and kevent(), a kernel event notification facility.	2000-04-16 18:53:38 +00:00
Matthew Dillon	e4649cfac3	Change the write-behind code to take more care when starting async I/O's. The sequential read heuristic has been extended to cover writes as well. We continue to call cluster_write() normally, thus blocks in the file will still be reallocated for large (but still random) I/O's, but I/O will only be initiated for truely sequential writes. This solves a number of annoying situations, especially with DBM (hash method) writes, and also has the side effect of fixing a number of (stupid) benchmarks. Reviewed-by: mckusick	2000-04-02 00:55:28 +00:00
Poul-Henning Kamp	ba4ad1fcea	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde	2000-01-10 12:04:27 +00:00
Kirk McKusick	bd5f5da94d	Add bwillwrite to all system calls that create things in the filesystem. Benchmarks that create huge trees of empty files overwhelm the buffer cache.	2000-01-10 00:08:53 +00:00
Eivind Eklund	762e6b856c	Introduce NDFREE (and remove VOP_ABORTOP)	1999-12-15 23:02:35 +00:00
Matthew Dillon	91921bd597	Ensure that garbage from the kernel stack does not wind up being returned to user mode in the spare fields of the stat structure. PR: kern/14966 Reviewed by: dillon@freebsd.org Submitted by: Kelly Yancey kbyanc@posi.net	1999-11-18 08:14:20 +00:00
Peter Wemm	b127fae405	Add a vnode fo_stat() entry point.	1999-11-08 03:32:15 +00:00
Brian Feldman	13ccadd4b0	This is what was "fdfix2.patch," a fix for fd sharing. It's pretty far-reaching in fd-land, so you'll want to consult the code for changes. The biggest change is that now, you don't use fp->f_ops->fo_foo(fp, bar) but instead fo_foo(fp, bar), which increments and decrements the fp refcount upon entry and exit. Two new calls, fhold() and fdrop(), are provided. Each does what it seems like it should, and if fdrop() brings the refcount to zero, the fd is freed as well. Thanks to peter ("to hell with it, it looks ok to me.") for his review. Thanks to msmith for keeping me from putting locks everywhere :) Reviewed by: peter	1999-09-19 17:00:25 +00:00
Julian Elischer	85a219d201	Changes to centralise the default blocksize behaviour. More likely to follow. Submitted by: phk@freebsd.org	1999-09-09 19:08:44 +00:00
Julian Elischer	7012bab988	Revert a bunch of contraversial changes by PHK. After a quick think and discussion among various people some form of some of these changes will probably be recommitted. The reversion requested was requested by dg while discussions proceed. PHK has indicated that he can live with this, and it has been agreed that some form of some of these changes may return shortly after further discussion.	1999-09-03 05:16:59 +00:00
Poul-Henning Kamp	de5f40afa6	Improve the returned values in st_blksize a little bit, avoid accessing union fields not valid for dev_t type.	1999-09-01 05:36:55 +00:00
Poul-Henning Kamp	02e1576966	Make bdev userland access work like cdev userland access unless the highly non-recommended option ALLOW_BDEV_ACCESS is used. (bdev access is evil because you don't get write errors reported.) Kill si_bsize_best before it kills Matt :-) Use the specfs routines rather having cloned copies in devfs.	1999-08-30 07:56:23 +00:00
Peter Wemm	c3aac50f28	$Id$ -> $FreeBSD$	1999-08-28 01:08:13 +00:00
Brian Feldman	b5fca1cb2a	Add FIODTYPE ioctl for getting d_flags (type) info on a device. Okayed by: phk	1999-08-27 16:35:37 +00:00
Poul-Henning Kamp	a431597b25	Add a couple of missing but unimportant break; statements.	1999-08-25 11:44:11 +00:00
Poul-Henning Kamp	0232a25188	oops: Add missing include.	1999-08-13 11:22:48 +00:00
Poul-Henning Kamp	3a965c0db0	Move the special-casing of stat(2)->st_blksize for device files from UFS to the generic level. For chr/blk devices we don't care about the blocksize of the filesystem, we want what the device asked for.	1999-08-13 10:56:07 +00:00
Brian Feldman	e32c66c539	Fix fd race conditions (during shared fd table usage.) Badfileops is now used in f_ops in place of NULL, and modifications to the files are more carefully ordered. f_ops should also be set to &badfileops upon "close" of a file. This does not fix other problems mentioned in this PR than the first one. PR: 11629 Reviewed by: peter	1999-08-04 18:53:50 +00:00
Alan Cox	6745299365	Add sysctl and support code to allow directories to be VMIO'd. The default setting for the sysctl is OFF, which is the historical operation. Submitted by: dillon	1999-07-26 06:25:53 +00:00
Kirk McKusick	ad8ac923fa	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>	1999-07-08 06:06:00 +00:00
Poul-Henning Kamp	8947a90a90	Make sure that stat(2) and friends always return a valid st_dev field. Pseudo-FS need not fill in the va_fsid anymore, the syscall code will use the first half of the fsid, which now looks like a udev_t with major 255.	1999-07-02 16:29:47 +00:00
Poul-Henning Kamp	75c1354190	This Implements the mumbled about "Jail" feature. This is a seriously beefed up chroot kind of thing. The process is jailed along the same lines as a chroot does it, but with additional tough restrictions imposed on what the superuser can do. For all I know, it is safe to hand over the root bit inside a prison to the customer living in that prison, this is what it was developed for in fact: "real virtual servers". Each prison has an ip number associated with it, which all IP communications will be coerced to use and each prison has its own hostname. Needless to say, you need more RAM this way, but the advantage is that each customer can run their own particular version of apache and not stomp on the toes of their neighbors. It generally does what one would expect, but setting up a jail still takes a little knowledge. A few notes: I have no scripts for setting up a jail, don't ask me for them. The IP number should be an alias on one of the interfaces. mount a /proc in each jail, it will make ps more useable. /proc/<pid>/status tells the hostname of the prison for jailed processes. Quotas are only sensible if you have a mountpoint per prison. There are no privisions for stopping resource-hogging. Some "#ifdef INET" and similar may be missing (send patches!) If somebody wants to take it from here and develop it into more of a "virtual machine" they should be most welcome! Tools, comments, patches & documentation most welcome. Have fun... Sponsored by: http://www.rndassociates.com/ Run for almost a year by: http://www.servetheweb.com/	1999-04-28 11:38:52 +00:00
Poul-Henning Kamp	f711d546d2	Suser() simplification: 1: s/suser/suser_xxx/ 2: Add new function: suser(struct proc ), prototyped in <sys/proc.h>. 3: s/suser_xxx($[a-zA-Z0-9_]$->p_ucred, \&\1->p_acflag)/suser(\1)/ The remaining suser_xxx() calls will be scrutinized and dealt with later. There may be some unneeded #include <sys/cred.h>, but they are left as an exercise for Bruce. More changes to the suser() API will come along with the "jail" code.	1999-04-27 11:18:52 +00:00
Alan Cox	f78fd73fa6	Address several problems in vn_read and vn_write: 1. Make read-ahead work for pread and aio_read. 2. Fix one place where a comparison of uio_offset with -1 wasn't updated to use FOF_OFFSET. 3. Honor O_APPEND in the FOF_OFFSET case. In addition, use the variable name "ioflag" in both vn_read and vn_write to avoid possible confusion between the variable "flag" and the parameter "flags". Submitted by: Bruce Evans <bde@zeta.org.au> and me	1999-04-21 05:56:45 +00:00
Dmitrij Tejblum	8fe387ab84	Add standard padding argument to pread and pwrite syscall. That should make them NetBSD compatible. Add parameter to fo_read and fo_write. (The only flag FOF_OFFSET mean that the offset is set in the struct uio). Factor out some common code from read/pread/write/pwrite syscalls.	1999-04-04 21:41:28 +00:00
Alan Cox	cde9bc877b	Changed vn_read/write such that fp->f_offset isn't touched if uio->uio_offset != -1. This fixes a problem with aio_read/write and permits a straightforward implementation of pread/pwrite. PR: kern/8669 Submitted by: John Plevyak <jplevyak@inktomi.com> Reviewed by: Matthew Dillon <dillon@apollo.backplane.com>	1999-03-26 20:25:21 +00:00
Poul-Henning Kamp	57c90d6fcd	Use suser() to determine super-user-ness, don't examine cr_uid directly.	1999-01-30 12:21:49 +00:00
Eivind Eklund	15a1057c46	Add 'options DEBUG_LOCKS', which stores extra information in struct lock, and add some macros and function parameters to make sure that the information get to the point where it can be put in the lock structure. While I'm here, add DEBUG_VFS_LOCKS to LINT.	1999-01-20 14:49:12 +00:00
Eivind Eklund	fb1167777a	Remove the 'waslocked' parameter to vfs_object_create().	1999-01-05 18:50:03 +00:00
Peter Wemm	f3d6ee090e	Only do one VOP_ACCESS() per open() instead of two. This should reduce the NFSv3 ACCESS RPC problems a little for busy clients that do a lot of open/close. The nfs code could probably cache the results, but I'm not sure whether this would be legal or useful. The problem is that with a CPU farm, on each open there would be a lookup, getattr then access RPC then the read/write RPC activity. Caching the access results probably isn't going to help much if the clients access lots of files. Having the nfs_access() routine interpret the getattr results is a bit of a hack, but it's how NFSv2 is done and it might be OK for a mount attribute for v3.	1998-11-02 02:36:16 +00:00
Poul-Henning Kamp	c259b8dd2b	Report the mode as the result of the VOP_GETATTR rather than the vnodes type, they may not correspond.	1998-06-27 06:43:09 +00:00
Doug Rabson	ecbb00a262	This commit fixes various 64bit portability problems required for FreeBSD/alpha. The most significant item is to change the command argument to ioctl functions from int to u_long. This change brings us inline with various other BSD versions. Driver writers may like to use (__FreeBSD_version == 300003) to detect this change. The prototype FreeBSD/alpha machdep will follow in a couple of days time.	1998-06-07 17:13:14 +00:00
Mike Smith	7be2d30077	In the words of the submitter: --------- Make callers of namei() responsible for releasing references or locks instead of having the underlying filesystems do it. This eliminates redundancy in all terminal filesystems and makes it possible for stacked transport layers such as umapfs or nullfs to operate correctly. Quality testing was done with testvn, and lat_fs from the lmbench suite. Some NFS client testing courtesy of Patrik Kudo. vop_mknod and vop_symlink still release the returned vpp. vop_rename still releases 4 vnode arguments before it returns. These remaining cases will be corrected in the next set of patches. --------- Submitted by: Michael Hancock <michaelh@cet.co.jp>	1998-05-07 04:58:58 +00:00
Alexander Langer	7c2e3d329a	Grammar police.	1998-04-10 00:09:04 +00:00
Wolfram Schneider	5ddc8ded1d	New mount option nosymfollow. If enabled, the kernel lookup() function will not follow symbolic links on the mounted file system and return EACCES (Permission denied).	1998-04-08 18:31:59 +00:00
Peter Wemm	100ceca222	Today is not my lucky day. Fix missing brace and I got a request to use EMLINK instead.	1998-04-06 19:32:37 +00:00
Peter Wemm	193afe0189	Use a different errno (ELOOP (as sef mentioned) since the text that goes with the error sounds ok for the condition) if O_NOFOLLOW gets a link.	1998-04-06 18:43:28 +00:00
Peter Wemm	0fdc628b41	Rather than let users get fd's to symlink files, make O_NOFOLLOW cause an error if it gets a link (like it does if it gets a socket). The implications of letting users try and do file operations on symlinks themselves were too worrying.	1998-04-06 18:25:21 +00:00
Peter Wemm	7e3426aa1f	Implement a new open(2) flag: O_NOFOLLOW. This will instruct open to not follow symlinks, but to open a handle on the link itself(!). As strange as this might sound, it has several useful applications safe race-free ways of opening files in hostile areas (eg: /tmp, a mode 1777 /var/mail, etc). It also would allow things like fchown() to work on the link rather than having to implement a new syscall specifically for that task. Reviewed by: phk	1998-04-06 17:38:43 +00:00
Bruce Evans	6b16931c00	Removed unused #includes.	1998-02-25 05:58:50 +00:00
Eivind Eklund	0b08f5f737	Back out DIAGNOSTIC changes.	1998-02-06 12:14:30 +00:00
Eivind Eklund	47cfdb166d	Turn DIAGNOSTIC into a new-style option.	1998-02-04 22:34:03 +00:00
John Dyson	925a3a419a	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.	1998-01-12 01:46:33 +00:00
John Dyson	95e5e988e0	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.	1998-01-06 05:26:17 +00:00
John Dyson	60f8d46448	Fix the decl of vfs_ioopt, allow LFS to compile again, fix a minor problem with the object cache removal.	1997-12-29 01:03:55 +00:00
John Dyson	2be70f79f6	Lots of improvements, including restructring the caching and management of vnodes and objects. There are some metadata performance improvements that come along with this. There are also a few prototypes added when the need is noticed. Changes include: 1) Cleaning up vref, vget. 2) Removal of the object cache. 3) Nuke vnode_pager_uncache and friends, because they aren't needed anymore. 4) Correct some missing LK_RETRY's in vn_lock. 5) Correct the page range in the code for msync. Be gentle, and please give me feedback asap.	1997-12-29 00:25:11 +00:00
Sean Eric Fagan	2a024a2b05	Changes to allow event-based process monitoring and control.	1997-12-06 04:11:14 +00:00
John Dyson	fd3bf77574	Fix and complete the AIO syscalls. There are some performance enhancements coming up soon, but the code is functional. Docs will be forthcoming.	1997-11-29 01:33:10 +00:00
Poul-Henning Kamp	4a11ca4e29	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused	1997-11-07 08:53:44 +00:00
Bruce Evans	32e4d4c5e6	Use 127 instead of CHAR_MAX for the limit on the sequence count. The limit doesn't have anything to do with characters. The count mainly needs to fit in the VOP_READ() ioflag after being left shifted by 16. Moved vn_lock() before vn_closefile(). vn_lock() was mismerged from Lite2. Removed some gratuitous braces.	1997-10-27 15:26:23 +00:00
John Dyson	e7b0208f61	Relax the vnode locking for read only operations.	1997-10-06 02:38:30 +00:00
Peter Wemm	a2f9bc72c1	vn_select -> vn_poll	1997-09-14 02:51:16 +00:00
Bruce Evans	e4ba6a82b0	Removed unused #includes.	1997-09-02 20:06:59 +00:00
Doug Rabson	42146e3747	[Previous comment was incorrect for these files] Added calls to VFS lock debugging macros to make fixing filesystems' locking easier.	1997-04-04 17:47:43 +00:00
Doug Rabson	de15ef6aef	Add a function vop_sharedlock which a copy of vop_nolock without the implementation #ifdef out. This can be used for now by NFS. As soon as all the other filesystems' locking is fixed, this can go away. Print the vnode address in vprint for easier debugging.	1997-04-04 17:46:21 +00:00
Bruce Evans	2098241054	Don't include <sys/ioctl.h> in the kernel. Stage 4: include <sys/ttycom.h> and sometimes <sys/filio.h> instead of <sys/ioctl.h> in miscellaneous files. Most of these files have nothing to do with ttys but need to include <sys/ttycom.h> to get the definitions of TIOC[SG]PGRP which are (ab)used to convert F[SG]ETOWN fcntls into ioctls.	1997-03-24 11:52:29 +00:00
Bruce Evans	3ac4d1ef0c	Don't #include <sys/fcntl.h> in <sys/file.h> if KERNEL is defined. Fixed everything that depended on getting fcntl.h stuff from the wrong place. Most things don't depend on file.h stuff at all.	1997-03-23 03:37:54 +00:00
Guido van Rooij	dfd0621acc	Fix style bugs and other bugs in the NFS fix.	1997-03-08 15:14:30 +00:00
Gary Palmer	324d42ad57	Fix (I hope) the NFS hole. This is only compile tested. Submitted by: (partly) davids@SECNET.COM via BUGTRAQ	1997-03-07 07:42:41 +00:00
Peter Wemm	6875d25465	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.	1997-02-22 09:48:43 +00:00
John Dyson	996c772f58	This is the kernel Lite/2 commit. There are some requisite userland changes, so don't expect to be able to run the kernel as-is (very well) without the appropriate Lite/2 userland changes. The system boots and can mount UFS filesystems. Untested: ext2fs, msdosfs, NFS Known problems: Incorrect Berkeley ID strings in some files. Mount_std mounts will not work until the getfsent library routine is changed. Reviewed by: various people Submitted by: Jeffery Hsu <hsu@freebsd.org>	1997-02-10 02:22:35 +00:00
Jordan K. Hubbard	1130b656e5	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.	1997-01-14 07:20:47 +00:00
John Dyson	8b612c4b4a	This commit is the embodiment of some VFS read clustering improvements. Firstly, now our read-ahead clustering is on a file descriptor basis and not on a per-vnode basis. This will allow multiple processes reading the same file to take advantage of read-ahead clustering. Secondly, there previously was a problem with large reads still using the ramp-up algorithm. Of course, that was bogus, and now we read the entire "chunk" off of the disk in one operation. The read-ahead clustering algorithm should use less CPU than the previous also (I hope :-)). NOTE: THAT LKMS MUST BE REBUILT!!!	1996-12-29 02:45:28 +00:00
John Dyson	6476c0d204	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg	1996-08-21 21:56:23 +00:00
John Dyson	0f20dc9443	Remove a now unnecessary function prototype.	1996-03-09 06:42:15 +00:00
John Dyson	91477adc6e	Enable VMIO for non-VDIR metadata and block device.	1996-03-02 03:45:12 +00:00
John Dyson	bd7e5f992e	Eliminated many redundant vm_map_lookup operations for vm_mmap. Speed up for vfs_bio -- addition of a routine bqrelse to greatly diminish overhead for merged cache. Efficiency improvement for vfs_cluster. It used to do alot of redundant calls to cluster_rbuild. Correct the ordering for vrele of .text and release of credentials. Use the selective tlb update for 486/586/P6. Numerous fixes to the size of objects allocated for files. Additionally, fixes in the various pagers. Fixes for proper positioning of vnode_pager_setsize in msdosfs and ext2fs. Fixes in the swap pager for exhausted resources. The pageout code will not as readily thrash. Change the page queue flags (PG_ACTIVE, PG_INACTIVE, PG_FREE, PG_CACHE) into page queue indices (PQ_ACTIVE, PQ_INACTIVE, PQ_FREE, PQ_CACHE), thereby improving efficiency of several routines. Eliminate even more unnecessary vm_page_protect operations. Significantly speed up process forks. Make vm_object_page_clean more efficient, thereby eliminating the pause that happens every 30seconds. Make sequential clustered writes B_ASYNC instead of B_DELWRI even in the case of filesystems mounted async. Fix a panic with busy pages when write clustering is done for non-VMIO buffers.	1996-01-19 04:00:31 +00:00
Poul-Henning Kamp	27a0b398a7	Staticize. Unstaticize a function in scsi/scsi_base that was used, with an undocumented option. My last count on the LINT kernel shows: Total symbols: 3647 unref symbols: 463 undef symbols: 4 1 ref symbols: 1751 2 ref symbols: 485 Approaching the pain threshold now.	1995-12-17 21:23:44 +00:00
John Dyson	a316d390bd	Changes to support 1Tb filesizes. Pages are now named by an (object,index) pair instead of (object,offset) pair.	1995-12-11 04:58:34 +00:00
David Greenman	efeaf95a41	Untangled the vm.h include file spaghetti.	1995-12-07 12:48:31 +00:00
David Greenman	d68a41903e	Moved the filesystem read-only check out of the syscalls and into the filesystem layer, as was done in lite-2. Merged in some other cosmetic changes while I was at it. Rewrote most of msdosfs_access() to be more like ufs_access() and to include the FS read-only check. Obtained from: partially from 4.4BSD-lite2	1995-10-22 09:32:48 +00:00
Poul-Henning Kamp	e83e1865c0	A little hack to avoid a 64bit divide. Can go away if Gcc ever learns to optimise 64bit stuff...	1995-10-06 09:43:32 +00:00
David Greenman	24aa09cd4f	vnode_pager_alloc() never returns NULL, so don't check for it.	1995-07-20 09:43:12 +00:00
Bruce Evans	97e156674d	Don't include <sys/tty.h> in drivers that aren't tty drivers or in general files that don't depend on the internals of <sys/tty.h>	1995-07-16 10:13:08 +00:00
David Greenman	24a1cce34f	NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).	1995-07-13 08:48:48 +00:00
David Greenman	06cb725951	Moved call to VOP_GETATTR() out of vnode_pager_alloc() and into the places that call vnode_pager_alloc() so that a failure return can be dealt with. This fixes a panic seen on NFS clients when a file being opened is deleted on the server before the open completes.	1995-07-09 06:58:03 +00:00
David Greenman	6663c3d522	Removed extra semicolon.	1995-06-28 12:32:47 +00:00
David Greenman	aa2cabb958	1) Converted v_vmdata to v_object. 2) Removed unnecessary vm_object_lookup()/pager_cache(object, TRUE) pairs after vnode_pager_alloc() calls - the object is already guaranteed to be persistent. 3) Removed some gratuitous casts.	1995-06-28 12:01:13 +00:00
Rodney W. Grimes	9b2e535452	Remove trailing whitespace.	1995-05-30 08:16:23 +00:00
David Greenman	54ea0c000f	Unlock the vnode before sleeping on an OBJ_DEAD object. Should fix Bruce's hang. Fixed some formatting anomolies and removed some unneeded casts.	1995-05-10 18:59:11 +00:00
David Greenman	50475e8bd3	Removed unnecessary call to vnode_pager_uncache(). We automatically clear the VTEXT flag after all mappers have finished with the object.	1995-03-19 12:08:03 +00:00
Poul-Henning Kamp	9977034cd2	YFfix	1995-02-14 06:31:13 +00:00
David Greenman	0d94caffca	These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman	1995-01-09 16:06:02 +00:00
David Greenman	8e58bf6875	Stuff object into v_vmdata rather than pager. Not important which at the moment, but will be in the future. Other changes mostly cosmetic, but are made for future VMIO considerations. Submitted by: John Dyson	1994-10-05 09:48:45 +00:00
Poul-Henning Kamp	797f2d22f0	All of this is cosmetic. prototypes, #includes, printfs and so on. Makes GCC a lot more silent.	1994-10-02 17:35:40 +00:00
David Greenman	605f11c8f2	Moved over my fix for vnode lossage when multiple TIOCSCTTY ioctls are done. This patch was extended to also include a suggested change by Kirk McKusick which allows the control tty to be reasigned to a different tty without losing a vnode.	1994-08-18 03:53:38 +00:00
David Greenman	3c4dd3568f	Added $Id$	1994-08-02 07:55:43 +00:00
Rodney W. Grimes	26f9a76710	The big 4.4BSD Lite to FreeBSD 2.0.0 (Development) patch. Reviewed by: Rodney W. Grimes Submitted by: John Dyson and David Greenman	1994-05-25 09:21:21 +00:00
Rodney W. Grimes	df8bae1de4	BSD 4.4 Lite Kernel Sources	1994-05-24 10:09:53 +00:00

1 2 3 4 5

248 Commits