freebsd-skq

Author	SHA1	Message	Date
alc	a221c3d422	Hold the page queues lock when calling vm_page_flag_clear().	2002-12-27 06:52:32 +00:00
alc	f786588f12	- Hold the kernel_object's lock around vm_page_alloc(kernel_object,...). - Hold the page queues lock around vm_page_wakeup().	2002-12-23 20:10:47 +00:00
mckusick	89e2af024f	The buffer daemon cannot skip over buffers owned by locked inodes as they may be the only viable ones to flush. Thus it will now wait for an inode lock if the other alternatives will result in rollbacks (and immediate redirtying of the buffer). If only buffers with rollbacks are available, one will be flushed, but then the buffer daemon will wait briefly before proceeding. Failing to wait briefly effectively deadlocks a uniprocessor since every other process writing to that filesystem will wait for the buffer daemon to clean up which takes close enough to forever to feel like a deadlock. Reported by: Archie Cobbs <archie@dellroad.org> Sponsored by: DARPA & NAI Labs. Approved by: re	2002-12-14 01:35:30 +00:00
alc	cd298c7231	Hold the page queues/flags lock when calling vm_page_set_validclean(). Approved by: re	2002-11-23 19:10:31 +00:00
alc	5e336b1d19	Now that pmap_remove_all() is exported by our pmap implementations use it directly.	2002-11-16 07:44:25 +00:00
alc	fc8a5bc419	When prot is VM_PROT_NONE, call pmap_page_protect() directly rather than indirectly through vm_page_protect(). The one remaining page flag that is updated by vm_page_protect() is already being updated by our various pmap implementations. Note: A later commit will similarly change the VM_PROT_READ case and eliminate vm_page_protect().	2002-11-10 07:12:04 +00:00
mckusick	be6aa51161	When the number of dirty buffers rises too high, the buf_daemon runs to help clean up. After selecting a potential buffer to write, this patch has it acquire a lock on the vnode that owns the buffer before trying to write it. The vnode lock is necessary to avoid a race with some other process holding the vnode locked and trying to flush its dirty buffers. In particular, if the vnode in question is a snapshot file, then the race can lead to a deadlock. To avoid slowing down the buf_daemon, it does a non-blocking lock request when trying to lock the vnode. If it fails to get the lock it skips over the buffer and continues down its queue looking for buffers to flush. Sponsored by: DARPA & NAI Labs.	2002-10-18 01:29:59 +00:00
phk	c5be3924c3	Remove unused includes. Clarify the intention of a while(); Move a local variable to avoid potential name-confusion.	2002-09-28 17:46:30 +00:00
phk	1dfc2c167f	Be consistent about "static" functions: if the function is marked static in its prototype, mark it static at the definition too. Inspired by: FlexeLint warning #512	2002-09-28 17:15:38 +00:00
phk	107a67d351	Correctly order VI_UNLOCK(), local variables and block comment.	2002-09-28 12:15:44 +00:00
phk	9eadc24323	Make biowait() check bio_error before the BIO_ERROR flag, to propery catch internal GEOM use of bio_error. Sponsored by: DARPA & NAI Labs.	2002-09-26 16:32:14 +00:00
jeff	54956e8ea4	- Lock accesses to v_numoutput. - Lock calls to gbincore.	2002-09-25 02:11:37 +00:00
phk	1ff0331915	s/Danglish/English/ Some style issues. Change the timeout to be hz/10 instead of hz. Brucification by: bde.	2002-09-15 17:52:35 +00:00
phk	cff87abbac	Un-inline the non-trivial "trivial" bio* functions. Untangle devstat_end_transaction_bio()	2002-09-14 19:34:11 +00:00
njl	0590c43070	Remove all use of vnode->v_tag, replacing with appropriate substitutes. v_tag is now const char * and should only be used for debugging. Additionally: 1. All users of VT_NTS now check vfsconf->vf_type VFCF_NETWORK 2. The user of VT_PROCFS now checks for the new flag VV_PROCDEP, which is propagated by pseudofs to all child vnodes if the fs sets PFS_PROCDEP. Suggested by: phk Reviewed by: bde, rwatson (earlier version)	2002-09-14 09:02:28 +00:00
phk	725daf0b2d	Oops, broke the build there. Uninline biodone() now that it is non-trivial. Introduce biowait() function. Currently there is a race condition and the mitigation is a timeout/retry. It is not obvious what kind of locking (if any) is suitable for BIO_DONE, since the majority of users take are of this themselves, and only a few places actually rely on the wakeup. Sponsored by: DARPA & NAI Labs.	2002-09-13 11:28:31 +00:00
peter	c3bdd669c3	Change hw.physmem and hw.usermem to unsigned long like they used to be in the original hardwired sysctl implementation. The buf size calculator still overflows an integer on machines with large KVA (eg: ia64) where the number of pages does not fit into an int. Use 'long' there. Change Maxmem and physmem and related variables to 'long', mostly for completeness. Machines are not likely to overflow 'int' pages in the near term, but then again, 640K ought to be enough for anybody. This comes for free on 32 bit machines, so why not?	2002-08-30 04:04:37 +00:00
charnier	7dd9d47059	Replace various spelling with FALLTHROUGH which is lint()able	2002-08-25 13:23:09 +00:00
jeff	02517b6731	- Replace v_flag with v_iflag and v_vflag - v_vflag is protected by the vnode lock and is used when synchronization with VOP calls is needed. - v_iflag is protected by interlock and is used for dealing with vnode management issues. These flags include X/O LOCK, FREE, DOOMED, etc. - All accesses to v_iflag and v_vflag have either been locked or marked with mp_fixme's. - Many ASSERT_VOP_LOCKED calls have been added where the locking was not clear. - Many functions in vfs_subr.c were restructured to provide for stronger locking. Idea stolen from: BSD/OS	2002-08-04 10:29:36 +00:00
alc	5674611fb4	o Convert two instances of vm_page_sleep_busy() to vm_page_sleep_if_busy() with appropriate page queue locking.	2002-08-03 18:59:19 +00:00
alc	88e0310b1e	o Acquire the page queues lock before calling vm_page_io_finish(). o Assert that the page queues lock is held in vm_page_io_finish().	2002-08-01 17:57:42 +00:00
alc	14df04dfaf	o Replace vm_page_sleep_busy() with vm_page_sleep_if_busy() in vfs_busy_pages().	2002-07-30 20:41:10 +00:00
alc	1926ab155f	o Use vm_page_alloc(... \| VM_ALLOC_WIRED) in place of vm_page_wire().	2002-07-19 19:35:06 +00:00
mckusick	b44cb5787c	Add support to UFS2 to provide storage for extended attributes. As this code is not actually used by any of the existing interfaces, it seems unlikely to break anything (famous last words). The internal kernel interface to manipulate these attributes is invoked using two new IO_ flags: IO_NORMAL and IO_EXT. These flags may be specified in the ioflags word of VOP_READ, VOP_WRITE, and VOP_TRUNCATE. Specifying IO_NORMAL means that you want to do I/O to the normal data part of the file and IO_EXT means that you want to do I/O to the extended attributes part of the file. IO_NORMAL and IO_EXT are mutually exclusive for VOP_READ and VOP_WRITE, but may be specified individually or together in the case of VOP_TRUNCATE. For example, when removing a file, VOP_TRUNCATE is called with both IO_NORMAL and IO_EXT set. For backward compatibility, if neither IO_NORMAL nor IO_EXT is set, then IO_NORMAL is assumed. Note that the BA_ and IO_ flags have been `merged' so that they may both be used in the same flags word. This merger is possible by assigning the IO_ flags to the low sixteen bits and the BA_ flags the high sixteen bits. This works because the high sixteen bits of the IO_ word is reserved for read-ahead and help with write clustering so will never be used for flags. This merge lets us get away from code of the form: if (ioflags & IO_SYNC) flags \|= BA_SYNC; For the future, I have considered adding a new field to the vattr structure, va_extsize. This addition could then be exported through the stat structure to allow applications to find out the size of the extended attribute storage and also would provide a more standard interface for truncating them (via VOP_SETATTR rather than VOP_TRUNCATE). I am also contemplating adding a pathconf parameter (for concreteness, lets call it _PC_MAX_EXTSIZE) which would let an application determine the maximum size of the extended atribute storage. Sponsored by: DARPA & NAI Labs.	2002-07-19 07:29:39 +00:00
alc	e650551ab9	o Lock page queue accesses by vm_page_wire().	2002-07-14 19:45:46 +00:00
alc	a23831a538	o Lock some page queue accesses, in particular, those by vm_page_unwire().	2002-07-13 20:13:34 +00:00
dillon	da4e111a55	Replace the global buffer hash table with per-vnode splay trees using a methodology similar to the vm_map_entry splay and the VM splay that Alan Cox is working on. Extensive testing has appeared to have shown no increase in overhead. Disadvantages Dirties more cache lines during lookups. Not as fast as a hash table lookup (but still N log N and optimal when there is locality of reference). Advantages vnode->v_dirtyblkhd is now perfectly sorted, making fsync/sync/filesystem syncer operate more efficiently. I get to rip out all the old hacks (some of which were mine) that tried to keep the v_dirtyblkhd tailq sorted. The per-vnode splay tree should be easier to lock / SMPng pushdown on vnodes will be easier. This commit along with another that Alan is working on for the VM page global hash table will allow me to implement ranged fsync(), optimize server-side nfs commit rpcs, and implement partial syncs by the filesystem syncer (aka filesystem syncer would detect that someone is trying to get the vnode lock, remembers its place, and skip to the next vnode). Note that the buffer cache splay is somewhat more complex then other splays due to special handling of background bitmap writes (multiple buffers with the same lblkno in the same vnode), and B_INVAL discontinuities between the old hash table and the existence of the buffer on the v_cleanblkhd list. Suggested by: alc	2002-07-10 17:02:32 +00:00
bde	401e8fb129	Fixed some printf format errors (one new one reported by gcc and 3 nearby old ones not reported by gcc). This helps unbreak LINT.	2002-07-08 12:21:11 +00:00
jeff	5d7805963e	Add two asserts that prove & document getblk and geteblk's behavior of returning locked bufs.	2002-07-07 05:27:08 +00:00
jeff	f1b0400267	Fix a mistake in my last commit. Don't grab an extra reference to the object in bp->b_object.	2002-07-06 21:27:20 +00:00
jeff	0dd7645264	Fixup uses of GETVOBJECT. - Cache a pointer to the vnode's object in the buf. - Hold a reference to that object in addition to the vnode's reference just to be consistent. - Cleanup code that got the object indirectly through the vp and VOP calls. This fixes at least one case where we were calling GETVOBJECT without a lock. It also avoids an expensive layered call at the cost of another pointer in struct buf.	2002-07-06 08:59:52 +00:00
mux	abb73c48ca	More 64 bits platforms warning fixes. Reviewed by: rwatson	2002-06-23 18:32:39 +00:00
dillon	0af4a7d8da	Fix a bug in vfs_bio_clrbuf(). The single-page-clrbuf optimization was improperly clearing more then just the invalid portions of the page. (This bug is not known to have been triggered by anything). Submitted by: tegge MFC after: 7 days	2002-06-22 19:09:35 +00:00
mckusick	88d85c15ef	This commit adds basic support for the UFS2 filesystem. The UFS2 filesystem expands the inode to 256 bytes to make space for 64-bit block pointers. It also adds a file-creation time field, an ability to use jumbo blocks per inode to allow extent like pointer density, and space for extended attributes (up to twice the filesystem block size worth of attributes, e.g., on a 16K filesystem, there is space for 32K of attributes). UFS2 fully supports and runs existing UFS1 filesystems. New filesystems built using newfs can be built in either UFS1 or UFS2 format using the -O option. In this commit UFS1 is the default format, so if you want to build UFS2 format filesystems, you must specify -O 2. This default will be changed to UFS2 when UFS2 proves itself to be stable. In this commit the boot code for reading UFS2 filesystems is not compiled (see /sys/boot/common/ufsread.c) as there is insufficient space in the boot block. Once the size of the boot block is increased, this code can be defined. Things to note: the definition of SBSIZE has changed to SBLOCKSIZE. The header file <ufs/ufs/dinode.h> must be included before <ufs/ffs/fs.h> so as to get the definitions of ufs2_daddr_t and ufs_lbn_t. Still TODO: Verify that the first level bootstraps work for all the architectures. Convert the utility ffsinfo to understand UFS2 and test growfs. Add support for the extended attribute storage. Update soft updates to ensure integrity of extended attribute storage. Switch the current extended attribute interfaces to use the extended attribute storage. Add the extent like functionality (framework is there, but is currently never used). Sponsored by: DARPA & NAI Labs. Reviewed by: Poul-Henning Kamp <phk@freebsd.org>	2002-06-21 06:18:05 +00:00
phk	b98dc1ffce	Use "bwrbg" as description when we sleep for background writing, "biord" was misleading in every possible way.	2002-06-06 08:56:10 +00:00
phk	536c2f0f78	Remove a six year old undocumented #ifdef : NO_B_MALLOC.	2002-05-04 19:24:55 +00:00
jhb	db9aa81e23	Change callers of mtx_init() to pass in an appropriate lock type name. In most cases NULL is passed, but in some cases such as network driver locks (which use the MTX_NETWORK_LOCK macro) and UMA zone locks, a name is used. Tested on: i386, alpha, sparc64	2002-04-04 21:03:38 +00:00
dillon	9a85737b15	brelse() was improperly clearing B_DELWRI in the B_DELWRI\|B_INVAL case without removing the buffer from the vnode's dirty buffer list, which can result in a panic in NFS. Replaced the code with a call to bundirty() which deals with it properly. PR: kern/36108, kern/36174 Submitted by: various people Special mention: to Danny Schales <dan@coes.LaTech.edu> for providing a core dump that helped me track this down. MFC after: 1 day	2002-04-03 00:17:36 +00:00
alfred	357e37e023	Remove __P.	2002-03-19 21:25:46 +00:00
bde	df8144b98e	Fixed some printf format errors (hopefully all of the remaining daddr64_t ones for GENERIC, and all others on the same line as those). Reformat the printfs if necessary to avoid new long lones or old format printf errors.	2002-03-19 04:09:21 +00:00
jake	34dcf8975d	Convert all pmap_kenter/pmap_kremove pairs in MI code to use pmap_qenter/ pmap_qremove. pmap_kenter is not safe to use in MI code because it is not guaranteed to flush the mapping from the tlb on all cpus. If the process in question is preempted and migrates cpus between the call to pmap_kenter and pmap_kremove, the original cpu will be left with stale mappings in its tlb. This is currently not a problem for i386 because we do not use PG_G on SMP, and thus all mappings are flushed from the tlb on context switches, not just user mappings. This is not the case on all architectures, and if PG_G is to be used with SMP on i386 it will be a problem. This was committed by peter earlier as part of his fine grained tlb shootdown work for i386, which was backed out for other reasons. Reviewed by: peter	2002-03-17 00:56:41 +00:00
mckusick	e929f2e4f0	Introduce the new 64-bit size disk block, daddr64_t. Change the bio and buffer structures to have daddr64_t bio_pblkno, b_blkno, and b_lblkno fields which allows access to disks larger than a Terabyte in size. This change also requires that the VOP_BMAP vnode operation accept and return daddr64_t blocks. This delta should not affect system operation in any way. It merely sets up the necessary interfaces to allow the development of disk drivers that work with these larger disk block addresses. It also allows for the development of UFS2 which will use 64-bit block addresses.	2002-03-15 18:49:47 +00:00
alfred	b0fd50345a	Giant pushdown for read/write/pread/pwrite syscalls. kern/kern_descrip.c: Aquire Giant in fdrop_locked when file refcount hits zero, this removes the requirement for the caller to own Giant for the most part. kern/kern_ktrace.c: Aquire Giant in ktrgenio, simplifies locking in upper read/write syscalls. kern/vfs_bio.c: Aquire Giant in bwillwrite if needed. kern/sys_generic.c Giant pushdown, remove Giant for: read, pread, write and pwrite. readv and writev aren't done yet because of the possible malloc calls for iov to uio processing. kern/sys_socket.c Grab giant in the socket fo_read/write functions. kern/vfs_vnops.c Grab giant in the vnode fo_read/write functions.	2002-03-15 08:03:46 +00:00
eivind	e97f2e798c	* Move bswlist declaration and initialization from kern/vfs_bio.c to vm/vm_pager.c, which is the only place it is used. * Make the QUEUE_* definitions and bufqueues local to vfs_bio.c. * constify buf_wmesg.	2002-03-05 18:20:58 +00:00
eivind	05c20bbea2	Document all functions, global and static variables, and sysctls. Includes some minor whitespace changes, and re-ordering to be able to document properly (e.g, grouping of variables and the SYSCTL macro calls for them, where the documentation has been added.) Reviewed by: phk (but all errors are mine)	2002-03-05 15:38:49 +00:00
peter	f2dee2e96f	Back out all the pmap related stuff I've touched over the last few days. There is some unresolved badness that has been eluding me, particularly affecting uniprocessor kernels. Turning off PG_G helped (which is a bad sign) but didn't solve it entirely. Userland programs still crashed.	2002-02-27 09:51:33 +00:00
peter	2051cd9c72	Jake further reduced IPI shootdowns on sparc64 in loops by using ranged shootdowns in a couple of key places. Do the same for i386. This also hides some physical addresses from higher levels and has it use the generic vm_page_t's instead. This will help for PAE down the road. Obtained from: jake (MI code, suggestions for MD part)	2002-02-27 02:14:58 +00:00
phk	b9b775cf13	GC: BIO_ORDERED, various infrastructure dealing with BIO_ORDERED.	2002-02-22 09:26:35 +00:00
phk	62d248fb9e	Replace bowrite() with BUF_WRITE in ufs. Remove bowrite(), it is now unused. This is the first step in getting entirely rid of BIO_ORDERED which is a generally accepted evil thing. Approved by: mckusick	2002-02-22 09:03:00 +00:00
dillon	8abd6168f2	GC P_BUFEXHAUST leftovers, we've had a new mechanism to avoid buffer cache lockups for over a year now. MFC after: 0 days	2002-01-31 18:39:44 +00:00
dillon	cd4d323ad3	This fixes a large number of bugs in our NFS client side code. A recent commit by Kirk also fixed a softupdates bug that could easily be triggered by server side NFS. * An edge case with shared R+W mmap()'s and truncate whereby the system would inappropriately clear the dirty bits on still-dirty data. (applicable to all filesystems) THIS FIX TEMPORARILY DISABLED PENDING FURTHER TESTING. see vm/vm_page.c line 1641 * The straddle case for VM pages and buffer cache buffers when truncating. (applicable to NFS client side) * Possible SMP database corruption due to vm_pager_unmap_page() not clearing the TLB for the other cpu's. (applicable to NFS client side but could effect all filesystems). Note: not considered serious since the corruption occurs beyond the file EOF. * When flusing a dirty buffer due to B_CACHE getting cleared, we were accidently setting B_CACHE again (that is, bwrite() sets B_CACHE), when we really want it to stay clear after the write is complete. This resulted in a corrupt buffer. (applicable to all filesystems but probably only triggered by NFS) * We have to call vtruncbuf() when ftruncate()ing to remove any buffer cache buffers. This is still tentitive, I may be able to remove it due to the second bug fix. (applicable to NFS client side) * vnode_pager_setsize() race against nfs_vinvalbuf()... we have to set n_size before calling nfs_vinvalbuf or the NFS code may recursively vnode_pager_setsize() to the original value before the truncate. This is what was causing the user mmap bus faults in the nfs tester program. (applicable to NFS client side) * Fix to softupdates (see ufs/ffs/ffs_inode.c 1.73, commit made by Kirk). Testing program written by: Avadis Tevanian, Jr. Testing program supplied by: jkh / Apple (see Dec2001 posting to freebsd-hackers with Subject 'NFS: How to make FreeBS fall on its face in one easy step') MFC after: 1 week	2001-12-14 01:16:57 +00:00
dillon	6e9238ff3f	The nbuf calculation was assuming that PAGE_SIZE = 4096 bytes, which is bogus. The calculation has been adjusted to use units of kilobytes. Noticed by: Chad David <davidc@acns.ab.ca> MFC after: 1 week	2001-12-08 20:37:08 +00:00
dillon	08792e81f7	Placemark an interrupt race in -current which is currently protected by Giant. -stable will get spl*() fixes for the race. Reported by: Rob Anderson <rob@isilon.com> MFC after: 0 days	2001-11-08 18:09:18 +00:00
dillon	1147eaf58a	Implement IO_NOWDRAIN and B_NOWDRAIN - prevents the buffer cache from blocking in wdrain during a write. This flag needs to be used in devices whos strategy routines turn-around and issue another high level I/O, such as when MD turns around and issues a VOP_WRITE to vnode backing store, in order to avoid deadlocking the dirty buffer draining code. Remove a vprintf() warning from MD when the backing vnode is found to be in-use. The syncer of buf_daemon could be flushing the backing vnode at the time of an MD operation so the warning is not correct. MFC after: 1 week	2001-11-05 18:48:54 +00:00
dillon	c6710f54b0	Documentation MFC after: 1 day	2001-10-21 06:26:55 +00:00
jhb	4806d88677	Change the kernel's ucred API as follows: - crhold() returns a reference to the ucred whose refcount it bumps. - crcopy() now simply copies the credentials from one credential to another and has no return value. - a new crshared() primitive is added which returns true if a ucred's refcount is > 1 and false (0) otherwise.	2001-10-11 23:38:17 +00:00
dillon	c77ea66d88	Enable vmiodirenable by default. Remove incorrect comment from sysctl.conf. MFC after: 1 week	2001-09-26 19:35:04 +00:00
julian	5596676e6c	KSE Milestone 2 Note ALL MODULES MUST BE RECOMPILED make the kernel aware that there are smaller units of scheduling than the process. (but only allow one thread per process at this time). This is functionally equivalent to teh previousl -current except that there is a thread associated with each process. Sorry john! (your next MFC will be a doosie!) Reviewed by: peter@freebsd.org, dillon@freebsd.org X-MFC after: ha ha ha ha	2001-09-12 08:38:13 +00:00
dillon	f658586bd8	Remove the code that limited the buffer_map to 1/2 the size of the kernel_map. maxbcache takes care of this now and the 1/2 limit can interfere with testing. Suggested by: bde	2001-08-22 18:10:37 +00:00
dillon	abe30f58d8	Move most of the kernel submap initialization code, including the timeout callwheel and buffer cache, out of the platform specific areas and into the machine independant area. i386 and alpha adjusted here. Other cpus can be fixed piecemeal. Reviewed by: freebsd-smp, jake	2001-08-22 04:07:27 +00:00
peter	6ca5d5c5c5	Revert previous accidental commit. FWIW, it was part of enabling VM caching of disks through mmap() and stopping syncing of open files that had their last reference in the fs removed (ie: their unsync'ed pages get discarded on close already, so I made it stop syncing too).	2001-07-27 15:57:17 +00:00
peter	18bc463cb6	Fix cut/paste blunder. Serves me right for doing a last minute tweak to what I had for some time. Submitted by: bde	2001-07-27 15:52:49 +00:00
dillon	e028603b7e	With Alfred's permission, remove vm_mtx in favor of a fine-grained approach (this commit is just the first stage). Also add various GIANT_ macros to formalize the removal of Giant, making it easy to test in a more piecemeal fashion. These macros will allow us to test fine-grained locks to a degree before removing Giant, and also after, and to remove Giant in a piecemeal fashion via sysctl's on those subsystems which the authors believe can operate without Giant.	2001-07-04 16:20:28 +00:00
dillon	a179ee09ab	This patch implements O_DIRECT about 80% of the way. It takes a patchset Tor created a while ago, removes the raw I/O piece (that has cache coherency problems), and adds a buffer cache / VM freeing piece. Essentially this patch causes O_DIRECT I/O to not be left in the cache, but does not prevent it from going through the cache, hence the 80%. For the last 20% we need a method by which the I/O can be issued directly to buffer supplied by the user process and bypass the buffer cache entirely, but still maintain cache coherency. I also have the code working under -stable but the changes made to sys/file.h may not be MFCable, so an MFC is not on the table yet. Submitted by: tegge, dillon	2001-05-24 07:22:27 +00:00
jhb	7de84bf1d3	- Always call bfreekva() w/o vm_mtx held. - Always call vfs_setdirty() with vm_mtx held. - Fix an old comment: vm_hold_unload_pages is called vm_hold_free_pages() nowadays. - Always call vm_hold_free_pages() w/o vm_mtx held.	2001-05-23 22:24:49 +00:00
alfred	a3f0842419	Introduce a global lock for the vm subsystem (vm_mtx). vm_mtx does not recurse and is required for most low level vm operations. faults can not be taken without holding Giant. Memory subsystems can now call the base page allocators safely. Almost all atomic ops were removed as they are covered under the vm mutex. Alpha and ia64 now need to catch up to i386's trap handlers. FFS and NFS have been tested, other filesystems will need minor changes (grabbing the vm lock when twiddling page properties). Reviewed (partially) by: jake, jhb	2001-05-19 01:28:09 +00:00
grog	4b9d9cbaac	Revert consequences of changes to mount.h, part 2. Requested by: bde	2001-04-29 02:45:39 +00:00
grog	1f5de30718	Correct #includes to work with fixed sys/mount.h.	2001-04-23 09:05:15 +00:00
phk	11bb4116b3	bread() is a special case of breadn(), so don't replicate code.	2001-04-18 07:16:07 +00:00
phk	676302e684	Write a switch statement as less obscure if statements.	2001-04-17 20:22:07 +00:00
phk	378e561228	This patch removes the VOP_BWRITE() vector. VOP_BWRITE() was a hack which made it possible for NFS client side to use struct buf with non-bio backing. This patch takes a more general approach and adds a bp->b_op vector where more methods can be added. The success of this patch depends on bp->b_op being initialized all relevant places for some value of "relevant" which is not easy to determine. For now the buffers have grown a b_magic element which will make such issues a tiny bit easier to debug.	2001-04-17 08:56:39 +00:00
mckusick	ba66879022	Add debugging option to always read/write cylinder groups as full sized blocks. To enable this option, use: `sysctl -w debug.bigcgs=1'. Add debugging option to disable background writes of cylinder groups. To enable this option, use: `sysctl -w debug.dobkgrdwrite=0'. These debugging options should be tried on systems that are panicing with corrupted cylinder group maps to see if it makes the problem go away. The set of panics in question are: ffs_clusteralloc: map mismatch ffs_nodealloccg: map corrupted ffs_nodealloccg: block not in map ffs_alloccg: map corrupted ffs_alloccg: block not in map ffs_alloccgblk: cyl groups corrupted ffs_alloccgblk: can't find blk in cyl ffs_checkblk: partially free fragment The following panics are less likely to be related to this problem, but might be helped by these debugging options: ffs_valloc: dup alloc ffs_blkfree: freeing free block ffs_blkfree: freeing free frag ffs_vfree: freeing free inode If you try these options, please report whether they helped reduce your bitmap corruption panics to Kirk McKusick at <mckusick@mckusick.com> and to Matt Dillon <dillon@earth.backplane.com>.	2001-04-17 05:37:51 +00:00
dillon	f1c1ec2bab	Fix lockup for loopback NFS mounts. The pipelined I/O limitations could be hit on the client side and prevent the server side from retiring writes. Pipeline operations turned off for all READs (no big loss since reads are usually synchronous) and for NFS writes, and left on for the default bwrite(). (MFC expected prior to 4.3 freeze) Testing by: mjacob, dillon	2001-02-28 04:13:11 +00:00
bmilekic	f364d4ac36	Change and clean the mutex lock interface. mtx_enter(lock, type) becomes: mtx_lock(lock) for sleep locks (MTX_DEF-initialized locks) mtx_lock_spin(lock) for spin locks (MTX_SPIN-initialized) similarily, for releasing a lock, we now have: mtx_unlock(lock) for MTX_DEF and mtx_unlock_spin(lock) for MTX_SPIN. We change the caller interface for the two different types of locks because the semantics are entirely different for each case, and this makes it explicitly clear and, at the same time, it rids us of the extra `type' argument. The enter->lock and exit->unlock change has been made with the idea that we're "locking data" and not "entering locked code" in mind. Further, remove all additional "flags" previously passed to the lock acquire/release routines with the exception of two: MTX_QUIET and MTX_NOSWITCH The functionality of these flags is preserved and they can be passed to the lock/unlock routines by calling the corresponding wrappers: mtx_{lock, unlock}_flags(lock, flag(s)) and mtx_{lock, unlock}_spin_flags(lock, flag(s)) for MTX_DEF and MTX_SPIN locks, respectively. Re-inline some lock acq/rel code; in the sleep lock case, we only inline the _obtain_lock()s in order to ensure that the inlined code fits into a cache line. In the spin lock case, we inline recursion and actually only perform a function call if we need to spin. This change has been made with the idea that we generally tend to avoid spin locks and that also the spin locks that we do have and are heavily used (i.e. sched_lock) do recurse, and therefore in an effort to reduce function call overhead for some architectures (such as alpha), we inline recursion for this case. Create a new malloc type for the witness code and retire from using the M_DEV type. The new type is called M_WITNESS and is only declared if WITNESS is enabled. Begin cleaning up some machdep/mutex.h code - specifically updated the "optimized" inlined code in alpha/mutex.h and wrote MTX_LOCK_SPIN and MTX_UNLOCK_SPIN asm macros for the i386/mutex.h as we presently need those. Finally, caught up to the interface changes in all sys code. Contributors: jake, jhb, jasone (in no particular order)	2001-02-09 06:11:45 +00:00
dillon	c8a95a285d	This commit represents work mainly submitted by Tor and slightly modified by myself. It solves a serious vm_map corruption problem that can occur with the buffer cache when block sizes > 64K are used. This code has been heavily tested in -stable but only tested somewhat on -current. An MFC will occur in a few days. My additions include the vm_map_simplify_entry() and minor buffer cache boundry case fix. Make the buffer cache use a system map for buffer cache KVM rather then a normal map. Ensure that VM objects are not allocated for system maps. There were cases where a buffer map could wind up with a backing VM object -- normally harmless, but this could also result in the buffer cache blocking in places where it assumes no blocking will occur, possibly resulting in corrupted maps. Fix a minor boundry case in the buffer cache size limit is reached that could result in non-optimal code. Add vm_map_simplify_entry() calls to prevent 'creeping proliferation' of vm_map_entry's in the buffer cache's vm_map. Previously only a simple linear optimization was made. (The buffer vm_map typically has only a handful of vm_map_entry's. This stabilizes it at that level permanently). PR: 20609 Submitted by: (Tor Egge) tegge	2001-02-04 06:19:28 +00:00
jake	4f5d8ed825	Use PCPU_GET, PCPU_PTR and PCPU_SET to access all per-cpu variables other then curproc.	2001-01-10 04:43:51 +00:00
dillon	fd223545d4	This implements a better launder limiting solution. There was a solution in 4.2-REL which I ripped out in -stable and -current when implementing the low-memory handling solution. However, maxlaunder turns out to be the saving grace in certain very heavily loaded systems (e.g. newsreader box). The new algorithm limits the number of pages laundered in the first pageout daemon pass. If that is not sufficient then suceessive will be run without any limit. Write I/O is now pipelined using two sysctls, vfs.lorunningspace and vfs.hirunningspace. This prevents excessive buffered writes in the disk queues which cause long (multi-second) delays for reads. It leads to more stable (less jerky) and generally faster I/O streaming to disk by allowing required read ops (e.g. for indirect blocks and such) to occur without interrupting the write stream, amoung other things. NOTE: eventually, filesystem write I/O pipelining needs to be done on a per-device basis. At the moment it is globalized.	2000-12-26 19:41:38 +00:00
jhb	cdfe59aac6	Stick the kthread API in a kthread_* namespace, and the specialized kproc functions in a kproc_* namespace. Reviewed by: -arch	2000-12-15 20:08:20 +00:00
dillon	2ace352085	Implement a low-memory deadlock solution. Removed most of the hacks that were trying to deal with low-memory situations prior to now. The new code is based on the concept that I/O must be able to function in a low memory situation. All major modules related to I/O (except networking) have been adjusted to allow allocation out of the system reserve memory pool. These modules now detect a low memory situation but rather then block they instead continue to operate, then return resources to the memory pool instead of cache them or leave them wired. Code has been added to stall in a low-memory situation prior to a vnode being locked. Thus situations where a process blocks in a low-memory condition while holding a locked vnode have been reduced to near nothing. Not only will I/O continue to operate, but many prior deadlock conditions simply no longer exist. Implement a number of VFS/BIO fixes (found by Ian): in biodone(), bogus-page replacement code, the loop was not properly incrementing loop variables prior to a continue statement. We do not believe this code can be hit anyway but we aren't taking any chances. We'll turn the whole section into a panic (as it already is in brelse()) after the release is rolled. In biodone(), the foff calculation was incorrectly clamped to the iosize, causing the wrong foff to be calculated for pages in the case of an I/O error or biodone() called without initiating I/O. The problem always caused a panic before. Now it doesn't. The problem is mainly an issue with NFS. Fixed casts for ~PAGE_MASK. This code worked properly before only because the calculations use signed arithmatic. Better to properly extend PAGE_MASK first before inverting it for the 64 bit masking op. In brelse(), the bogus_page fixup code was improperly throwing away the original contents of 'm' when it did the j-loop to fix the bogus pages. The result was that it would potentially invalidate parts of the WRONG page(!), leading to corruption. There may still be cases where a background bitmap write is being duplicated, causing potential corruption. We have identified a potentially serious bug related to this but the fix is still TBD. So instead this patch contains a KASSERT to detect the problem and panic the machine rather then continue to corrupt the filesystem. The problem does not occur very often.. it is very hard to reproduce, and it may or may not be the cause of the corruption people have reported. Review by: (VFS/BIO: mckusick, Ian Dowse <iedowse@maths.tcd.ie>) Testing by: (VM/Deadlock) Paul Saab <ps@yahoo-inc.com>	2000-11-18 23:06:26 +00:00
phk	4e063f5534	Take VBLK devices further out of their missery. This should fix the panic I introduced in my previous commit on this topic.	2000-11-02 21:14:13 +00:00
jhb	d944886e4d	Catch up to moving headers: - machine/ipl.h -> sys/ipl.h - machine/mutex.h -> sys/mutex.h	2000-10-20 07:58:15 +00:00
jasone	4e290e67b7	Convert lockmgr locks from using simple locks to using mutexes. Add lockdestroy() and appropriate invocations, which corresponds to lockinit() and must be called to clean up after a lockmgr lock is no longer needed.	2000-10-04 01:29:17 +00:00
bp	a7bc78c86d	Add three new VOPs: VOP_CREATEVOBJECT, VOP_DESTROYVOBJECT and VOP_GETVOBJECT. They will be used by nullfs and other stacked filesystems to support full cache coherency. Reviewed in general by: mckusick, dillon	2000-09-12 09:49:08 +00:00
jasone	769e0f974d	Major update to the way synchronization is done in the kernel. Highlights include: * Mutual exclusion is used instead of spl(). See mutex(9). (Note: The alpha port is still in transition and currently uses both.) Per-CPU idle processes. * Interrupts are run in their own separate kernel threads and can be preempted (i386 only). Partially contributed by: BSDi (BSD/OS) Submissions by (at least): cp, dfr, dillon, grog, jake, jhb, sheldonh	2000-09-07 01:33:02 +00:00
mckusick	b4eb939d51	Now that buffer locks can be recursive, we need to delete the panics that complain about them. Obtained from: Brian Fundakowski Feldman <green@FreeBSD.org>	2000-07-25 18:28:46 +00:00
mckusick	a3d0c189ea	Add snapshots to the fast filesystem. Most of the changes support the gating of system calls that cause modifications to the underlying filesystem. The gating can be enabled by any filesystem that needs to consistently suspend operations by adding the vop_stdgetwritemount to their set of vnops. Once gating is enabled, the function vfs_write_suspend stops all new write operations to a filesystem, allows any filesystem modifying system calls already in progress to complete, then sync's the filesystem to disk and returns. The function vfs_write_resume allows the suspended write operations to begin again. Gating is not added by default for all filesystems as for SMP systems it adds two extra locks to such critical kernel paths as the write system call. Thus, gating should only be added as needed. Details on the use and current status of snapshots in FFS can be found in /sys/ufs/ffs/README.snapshot so for brevity and timelyness is not included here. Unless and until you create a snapshot file, these changes should have no effect on your system (famous last words).	2000-07-11 22:07:57 +00:00
phk	4ec91666fa	Virtualizes & untangles the bioops operations vector. Ref: Message-ID: <18317.961014572@critter.freebsd.dk> To: current@	2000-06-16 08:48:51 +00:00
jake	961b97d434	Back out the previous change to the queue(3) interface. It was not discussed and should probably not happen. Requested by: msmith and others	2000-05-26 02:09:24 +00:00
jake	d93fbc9916	Change the way that the queue(3) structures are declared; don't assume that the type argument to _HEAD and _ENTRY is a struct. Suggested by: phk Reviewed by: phk Approved by: mdodd	2000-05-23 20:41:01 +00:00
phk	36c3965ff9	Separate the struct bio related stuff out of <sys/buf.h> into <sys/bio.h>. <sys/bio.h> is now a prerequisite for <sys/buf.h> but it shall not be made a nested include according to bdes teachings on the subject of nested includes. Diskdrivers and similar stuff below specfs::strategy() should no longer need to include <sys/buf.> unless they need caching of data. Still a few bogus uses of struct buf to track down. Repocopy by: peter	2000-05-05 09:59:14 +00:00
phk	8f6a76b4dd	Give struct bio it's own call back mechanism.	2000-05-01 13:36:25 +00:00
phk	0fa9118e2c	Hmm, diff/patch still doesn't like me. Missed one s/biowait/bufwait/g	2000-04-30 06:16:03 +00:00
phk	1931990da0	s/biowait/bufwait/g Prodded by: several.	2000-04-29 16:25:22 +00:00
phk	24992f67a9	Remove a leftover dysonism.	2000-04-29 16:14:10 +00:00
phk	6be1308ad1	Remove ~25 unneeded #include <sys/conf.h> Remove ~60 unneeded #include <sys/malloc.h>	2000-04-19 14:58:28 +00:00
phk	d6ecb7a574	Don't declare common variables in include files: move buftimelock til vfs_bio.c where it is initialized.	2000-04-18 11:21:28 +00:00
phk	aaaef0b54e	Complete the bio/buf divorce for all code below devfs::strategy Exceptions: Vinum untouched. This means that it cannot be compiled. Greg Lehey is on the case. CCD not converted yet, casts to struct buf (still safe) atapi-cd casts to struct buf to examine B_PHYS	2000-04-15 05:54:02 +00:00
phk	8ee11d587f	Move B_ERROR flag to b_ioflags and call it BIO_ERROR. (Much of this done by script) Move B_ORDERED flag to b_ioflags and call it BIO_ORDERED. Move b_pblkno and b_iodone_chain to struct bio while we transition, they will be obsoleted once bio structs chain/stack. Add bio_queue field for struct bio aware disksort. Address a lot of stylistic issues brought up by bde.	2000-04-02 15:24:56 +00:00
phk	be39d0d9f1	Draw the outline of "struct bio". Struct bio is the future carrier of I/O requests for "struct buf".	2000-04-02 09:26:51 +00:00
dillon	8fb4c6b599	Commit the buffer cache cleanup patch to 4.x and 5.x. This patch fixes a fragmentation problem due to geteblk() reserving too much space for the buffer and imposes a larger granularity (16K) on KVA reservations for the buffer cache to avoid fragmentation issues. The buffer cache size calculations have been redone to simplify them (fewer defines, better comments, less chance of running out of KVA). The geteblk() fix solves a performance problem that DG was able reproduce. This patch does not completely fix the KVA fragmentation problems, but it goes a long way Mostly Reviewed by: bde and others Approved by: jkh	2000-03-27 21:29:33 +00:00
phk	5df766a0f8	Rename the existing BUF_STRATEGY() to DEV_STRATEGY() substitute BUF_WRITE(foo) for VOP_BWRITE(foo->b_vp, foo) substitute BUF_STRATEGY(foo) for VOP_STRATEGY(foo->b_vp, foo) This patch is machine generated except for the ccd.c and buf.h parts.	2000-03-20 11:29:10 +00:00
phk	a246e10f55	Remove B_READ, B_WRITE and B_FREEBUF and replace them with a new field in struct buf: b_iocmd. The b_iocmd is enforced to have exactly one bit set. B_WRITE was bogusly defined as zero giving rise to obvious coding mistakes. Also eliminate the redundant struct buf flag B_CALL, it can just as efficiently be done by comparing b_iodone to NULL. Should you get a panic or drop into the debugger, complaining about "b_iocmd", don't continue. It is likely to write on your disk where it should have been reading. This change is a step in the direction towards a stackable BIO capability. A lot of this patch were machine generated (Thanks to style(9) compliance!) Vinum users: Greg has not had time to test this yet, be careful.	2000-03-20 10:44:49 +00:00
phk	6b3385b773	Eliminate the undocumented, experimental, non-delivering and highly dangerous MAX_PERF option.	2000-03-16 08:51:55 +00:00
mckusick	41c200930c	Need to reset the buffer pointer to avoid reconsidering the same buffer again (without this the rollback analysis was being lost). Should reduce the write count for most workloads. Submitted by: Craig A Soules <soules+@andrew.cmu.edu>	2000-01-18 02:13:26 +00:00
phk	ae0c1ec8f7	Give vn_isdisk() a second argument where it can return a suitable errno. Suggested by: bde	2000-01-10 12:04:27 +00:00
mckusick	d4409da210	Several performance improvements for soft updates have been added: 1) Fastpath deletions. When a file is being deleted, check to see if it was so recently created that its inode has not yet been written to disk. If so, the delete can proceed to immediately free the inode. 2) Background writes: No file or block allocations can be done while the bitmap is being written to disk. To avoid these stalls, the bitmap is copied to another buffer which is written thus leaving the original available for futher allocations. 3) Link count tracking. Constantly track the difference in i_effnlink and i_nlink so that inodes that have had no change other than i_effnlink need not be written. 4) Identify buffers with rollback dependencies so that the buffer flushing daemon can choose to skip over them.	2000-01-10 00:24:24 +00:00
luoqi	e100d44d55	Introduce a mechanism to suspend/resume system processes. Suspend syncer and bufdaemon prior to disk sync during system shutdown.	2000-01-07 08:36:44 +00:00
dillon	885c27b720	Reimplement buf_daemon / getnewbuf() interaction for dealing with stressful situations. buf_daemon now makes a distinction between being woken up and its sleep timing out, and as a consequence is now much better able to dynamically tune itself to its environment. Reviewed by: Alfred Perlstein <bright@wintelcom.net>	1999-12-20 20:28:40 +00:00
mckusick	a7a8ed1423	Collect read and write counts for filesystems. This new code drops the counting in bwrite and puts it all in spec_strategy. I did some tests and verified that the counts collected for writes in spec_strategy is identical to the counts that we previously collected in bwrite. We now also get read counts (async reads come from requests for read-ahead blocks). Note that you need to compile a new version of mount to get the read counts printed out. The old mount binary is completely compatible, the only reason to install a new mount is to get the read counts printed. Submitted by: Craig A Soules <soules+@andrew.cmu.edu> Reviewed by: Kirk McKusick <mckusick@mckusick.com>	1999-12-01 02:09:30 +00:00
phk	1848d96439	Convert various pieces of code to use vn_isdisk() rather than checking for vp->v_type == VBLK. In ccd: we don't need to call VOP_GETATTR to find the type of a vnode. Reviewed by: sos	1999-11-22 10:33:55 +00:00
phk	8fca18de89	This is a partial commit of the patch from PR 14914: Alot of the code in sys/kern directly accesses the Q_HEAD and Q_ENTRY structures for list operations. This patch makes all list operations in sys/kern use the queue(3) macros, rather than directly accessing the *Q_{HEAD,ENTRY} structures. This batch of changes compile to the same object files. Reviewed by: phk Submitted by: Jake Burkholder <jake@checker.org> PR: 14914	1999-11-16 10:56:05 +00:00
phk	91b84ac6af	Remove a #define which doesn't do miracles anymore.	1999-10-30 09:31:52 +00:00
phk	8e3c3eafed	useracc() the prequel: Merge the contents (less some trivial bordering the silly comments) of <vm/vm_prot.h> and <vm/vm_inherit.h> into <vm/vm.h>. This puts the #defines for the vm_inherit_t and vm_prot_t types next to their typedefs. This paves the road for the commit to follow shortly: change useracc() to use VM_PROT_{READ\|WRITE} rather than B_{READ\|WRITE} as argument.	1999-10-29 18:09:36 +00:00
dillon	16d1a2d1fd	Adjust the buffer cache to better handle small-memory machines. A slightly older version of this code was tested by BDE and I. Also fixes a lockup situation when kva gets too fragmented. Remove the maxvmiobufspace variable and sysctl, they are no longer used. Also cleanup (remove) #if 0 sections from prior commits. This code is more of a hack, but presumably the whole buffer cache implementation is going to be rewritten in the next year so it's no big deal.	1999-10-24 03:27:28 +00:00
peter	787140aa42	Trim unused options (or #ifdef for undoc options). Submitted by: phk	1999-10-11 15:19:12 +00:00
dt	9b2e6d0ad7	Count bogus_page as wired.	1999-09-30 07:39:20 +00:00
dillon	5007145b49	Fix bug in brelse() regarding redirtying buffers on B_ERROR. brelse() improperly ignored the B_INVAL flag when acting on the B_ERROR. If both B_INVAL and B_ERROR are set the buffer is typically out of the underlying device's block range and must be destroyed. If only B_ERROR is set (for a write), a write error occured and operation remains as it was before: the buffer must be redirtied to avoid corrupting the filesystem state. Reviewed by: David Greenman <dg@root.com> Submitted by: Tor.Egge@fast.no	1999-09-20 16:19:24 +00:00
luoqi	dfe06dcb7d	Allow getblk() to be called from an idle context (by panic() inside an interrupt handler). Reviewed by: dillon	1999-09-03 17:49:25 +00:00
peter	3b842d34e8	$Id$ -> $FreeBSD$	1999-08-28 01:08:13 +00:00
bde	faf95163e9	Cast pointers to uintptr_t instead of casting them to u_long, and/or vice versa. Cosmetic.	1999-08-24 00:56:50 +00:00
phk	e938d317d5	Decommision miscfs/specfs/specdev.h. Most of it goes into <sys/conf.h>, a few lines into <sys/vnode.h>. Add a few fields to struct specinfo, paving the way for the fun part.	1999-08-08 18:43:05 +00:00
alc	743eeab638	Add sysctl and support code to allow directories to be VMIO'd. The default setting for the sysctl is OFF, which is the historical operation. Submitted by: dillon	1999-07-26 06:25:53 +00:00
peter	ffa1a6b0d3	bufhashinit() is called with a caddr_t and is expected to return the same in both the alpha and i386 ports.	1999-07-09 16:41:19 +00:00
mckusick	aa7ff8691d	Condition in KASSERT was reversed.	1999-07-08 17:58:55 +00:00
mckusick	52ea4270f3	These changes appear to give us benefits with both small (32MB) and large (1G) memory machine configurations. I was able to run 'dbench 32' on a 32MB system without bring the machine to a grinding halt. * buffer cache hash table now dynamically allocated. This will have no effect on memory consumption for smaller systems and will help scale the buffer cache for larger systems. * minor enhancement to pmap_clearbit(). I noticed that all the calls to it used constant arguments. Making it an inline allows the constants to propogate to deeper inlines and should produce better code. * removal of inherent vfs_ioopt support through the emplacement of appropriate #ifdef's, with John's permission. If we do not find a use for it by the end of the year we will remove it entirely. * removal of getnewbufloops* counters & sysctl's - no longer necessary for debugging, getnewbuf() is now optimal. * buffer hash table functions removed from sys/buf.h and localized to vfs_bio.c * VFS_BIO_NEED_DIRTYFLUSH flag and support code added ( bwillwrite() ), allowing processes to block when too many dirty buffers are present in the system. * removal of a softdep test in bdwrite() that is no longer necessary now that bdwrite() no longer attempts to flush dirty buffers. * slight optimization added to bqrelse() - there is no reason to test for available buffer space on B_DELWRI buffers. * addition of reverse-scanning code to vfs_bio_awrite(). vfs_bio_awrite() will attempt to locate clusterable areas in both the forward and reverse direction relative to the offset of the buffer passed to it. This will probably not make much of a difference now, but I believe we will start to rely on it heavily in the future if we decide to shift some of the burden of the clustering closer to the actual I/O initiation. * Removal of the newbufcnt and lastnewbuf counters that Kirk added. They do not fix any race conditions that haven't already been fixed by the gbincore() test done after the only call to getnewbuf(). getnewbuf() is a static, so there is no chance of it being misused by other modules. ( Unless Kirk can think of a specific thing that this code fixes. I went through it very carefully and didn't see anything ). * removal of VOP_ISLOCKED() check in flushbufqueues(). I do not think this check is necessary, the buffer should flush properly whether the vnode is locked or not. ( yes? ). * removal of extra arguments passed to getnewbuf() that are not necessary. * missed cluster_wbuild() that had to be a cluster_wbuild_wb() in vfs_cluster.c * vn_write() now calls bwillwrite() PRIOR to locking the vnode, which should greatly aid flushing operations in heavy load situations - both the pageout and update daemons will be able to operate more efficiently. * removal of b_usecount. We may add it back in later but for now it is useless. Prior implementations of the buffer cache never had enough buffers for it to be useful, and current implementations which make more buffers available might not benefit relative to the amount of sophistication required to implement a b_usecount. Straight LRU should work just as well, especially when most things are VMIO backed. I expect that (even though John will not like this assumption) directories will become VMIO backed some point soon. Submitted by: Matthew Dillon <dillon@backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>	1999-07-08 06:06:00 +00:00
mckusick	9d4f0d78fa	The buffer queue mechanism has been reformulated. Instead of having QUEUE_AGE, QUEUE_LRU, and QUEUE_EMPTY we instead have QUEUE_CLEAN, QUEUE_DIRTY, QUEUE_EMPTY, and QUEUE_EMPTYKVA. With this patch clean and dirty buffers have been separated. Empty buffers with KVM assignments have been separated from truely empty buffers. getnewbuf() has been rewritten and now operates in a 100% optimal fashion. That is, it is able to find precisely the right kind of buffer it needs to allocate a new buffer, defragment KVM, or to free-up an existing buffer when the buffer cache is full (which is a steady-state situation for the buffer cache). Buffer flushing has been reorganized. Previously buffers were flushed in the context of whatever process hit the conditions forcing buffer flushing to occur. This resulted in processes blocking on conditions unrelated to what they were doing. This also resulted in inappropriate VFS stacking chains due to multiple processes getting stuck trying to flush dirty buffers or due to a single process getting into a situation where it might attempt to flush buffers recursively - a situation that was only partially fixed in prior commits. We have added a new daemon called the buf_daemon which is responsible for flushing dirty buffers when the number of dirty buffers exceeds the vfs.hidirtybuffers limit. This daemon attempts to dynamically adjust the rate at which dirty buffers are flushed such that getnewbuf() calls (almost) never block. The number of nbufs and amount of buffer space is now scaled past the 8MB limit that was previously imposed for systems with over 64MB of memory, and the vfs.{lo,hi}dirtybuffers limits have been relaxed somewhat. The number of physical buffers has been increased with the intention that we will manage physical I/O differently in the future. reassignbuf previously attempted to keep the dirtyblkhd list sorted which could result in non-deterministic operation under certain conditions, such as when a large number of dirty buffers are being managed. This algorithm has been changed. reassignbuf now keeps buffers locally sorted if it can do so cheaply, and otherwise gives up and adds buffers to the head of the dirtyblkhd list. The new algorithm is deterministic but not perfect. The new algorithm greatly reduces problems that previously occured when write_behind was turned off in the system. The P_FLSINPROG proc->p_flag bit has been replaced by the more descriptive P_BUFEXHAUST bit. This bit allows processes working with filesystem buffers to use available emergency reserves. Normal processes do not set this bit and are not allowed to dig into emergency reserves. The purpose of this bit is to avoid low-memory deadlocks. A small race condition was fixed in getpbuf() in vm/vm_pager.c. Submitted by: Matthew Dillon <dillon@apollo.backplane.com> Reviewed by: Kirk McKusick <mckusick@mckusick.com>	1999-07-04 00:25:38 +00:00
peter	5f8ebc1b91	Hopefully fix the remaining glitches with the BUF_*() changes. This should (really this time) fix pageout to swap and a couple of clustering cases. This simplifies BUF_KERNPROC() so that it unconditionally reassigns the lock owner rather than testing B_ASYNC and having the caller decide when to do the reassign. At present this is required because some places use B_CALL/b_iodone to free the buffers without B_ASYNC being set. Also, vfs_cluster.c explicitly calls BUF_KERNPROC() when attaching the buffers rather than the parent walking the cluster_head tailq. Reviewed by: Kirk McKusick <mckusick@mckusick.com>	1999-06-29 05:59:47 +00:00
peter	bbe9a3f06f	Fix a bug that was almost certainly making breadn() fail. BUF_KERNPROC() was being called on the wrong bp - it should be called on the one that's just about to be fed to VOP_STRATEGY().	1999-06-28 15:32:10 +00:00
peter	f27334b347	GC the remnants of the old pre-softupdates update daemon. It's been #if 0'd for a fair while now.	1999-06-26 14:46:35 +00:00
mckusick	5b58f2f951	Convert buffer locking from using the B_BUSY and B_WANTED flags to using lockmgr locks. This commit should be functionally equivalent to the old semantics. That is, all buffer locking is done with LK_EXCLUSIVE requests. Changes to take advantage of LK_SHARED and LK_RECURSIVE will be done in future commits.	1999-06-26 02:47:16 +00:00
mckusick	269bde5a40	When allocating new buffers in getnewbuf, there are several points at which we may sleep. So, after completing our buffer allocation we must ensure that another process has not come along and allocated a different buffer with the same identity. We do this by keeping a global counter of the number of buffers that getnewbuf has allocated. We save this count when we enter getnewbuf and check it when we are about to return. If it has changed, then other buffers were allocated while we were in getnewbuf, so we must return NULL to let our parent know that it must recheck to see if it still needs the new buffer. Hopefully this fix will eliminate the creation of duplicate buffers with the same identity and the obscure corruptions that they cause.	1999-06-22 01:39:53 +00:00
mckusick	88e39a63db	Add a vnode argument to VOP_BWRITE to get rid of the last vnode operator special case. Delete special case code from vnode_if.sh, vnode_if.src, umap_vnops.c, and null_vnops.c.	1999-06-16 23:27:55 +00:00
tegge	f57c7820cd	If we still haven't got a sufficient number of free buffers after the call to flushdirtybuffers() then sleep in waitfreebuffers(). PR: 11697 Reviewed by: David Greenman, Matt Dillon	1999-06-16 03:19:04 +00:00
mckusick	02e5fe8035	Get rid of the global variable rushjob and replace it with a function in kern/vfs_subr.c named speedup_syncer() which handles the speedup request. Change the various clients of rushjob to use the new function.	1999-06-15 23:37:29 +00:00
peter	29a5087ba2	Try an fix a couple of dev_t/major/minor etc nits.	1999-05-12 22:30:50 +00:00
phk	f57a01ebfc	remove b_proc from struct buf, it's (now) unused. Reviewed by: dillon, bde	1999-05-06 20:00:34 +00:00
phk	45443ee5d4	Remove unused fields from struct buf: b_savekva b_validoff b_validend Reviewed by: dillon, bde	1999-05-06 17:06:41 +00:00
alc	5cb08a2652	The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>	1999-05-02 23:57:16 +00:00
alc	9f01d1b1e7	Address a performance problem in getnewbuf: In heavy-writing situations, QUEUE_LRU can contain a large number of DELWRI buffers at its head. These buffers must be moved to the tail if they cannot be written async in order to reduce the scanning time required to skip past these buffers in later getnewbuf() calls. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>	1999-04-29 18:15:25 +00:00
dt	490abdf54d	getnewbuf(): check return value from tsleep(). Interruptible NFS may pass PCATCH to slpflag.	1999-04-14 18:51:52 +00:00
alc	7c9bf17e25	Fix a performance problem with the new getnewbuf() code: in an outofspace condition ( bufspace > hibufspace ), an inappropriate scan of the empty queue was performed looking for buffer space to free up. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>	1999-04-07 02:41:54 +00:00
julian	0ed09d2ad5	Catch a case spotted by Tor where files mmapped could leave garbage in the unallocated parts of the last page when the file ended on a frag but not a page boundary. Delimitted by tags PRE_MATT_MMAP_EOF and POST_MATT_MMAP_EOF, in files alpha/alpha/pmap.c i386/i386/pmap.c nfs/nfs_bio.c vm/pmap.h vm/vm_page.c vm/vm_page.h vm/vnode_pager.c miscfs/specfs/spec_vnops.c ufs/ufs/ufs_readwrite.c kern/vfs_bio.c Submitted by: Matt Dillon <dillon@freebsd.org> Reviewed by: Alan Cox <alc@freebsd.org>	1999-04-05 19:38:30 +00:00
bde	9dfb67dcc5	Fixed a serious bug in rev.1.202. getnewbuf() sometimes didn't initialise bp->b_data. This tended to cause panics for file systems whose block size is smaller than one page.	1999-03-19 10:17:44 +00:00
julian	f27c95753f	Reviewed by: Many at differnt times in differnt parts, including alan, john, me, luoqi, and kirk Submitted by: Matt Dillon <dillon@frebsd.org> This change implements a relatively sophisticated fix to getnewbuf(). There were two problems with getnewbuf(). First, the writerecursion can lead to a system stack overflow when you have NFS and/or VN devices in the system. Second, the free/dirty buffer accounting was completely broken. Not only did the nfs routines blow it trying to manually account for the buffer state, but the accounting that was done did not work well with the purpose of their existance: figuring out when getnewbuf() needs to sleep. The meat of the change is to kern/vfs_bio.c. The remaining diffs are all minor except for NFS, which includes both the fixes for bp interaction AND fixes for a 'biodone(): buffer already done' lockup. Sys/buf.h also contains a chaining structure which is not used by this patchset but is used by other patches that are coming soon. This patch deliniated by tags PRE_MAT_GETBUF and POST_MAT_GETBUF. (sorry for the missing T matt)	1999-03-12 02:24:58 +00:00
julian	5014b30868	Make comment match code.	1999-03-02 21:23:38 +00:00
julian	57bcf97676	Remove inapropriate use of VOP_ISLOCKED() This produced races resulting in panics and filesystem corruptions under some circumstances. Reviewed by: luoqi chen <luoqi@freebsd.org> Reviewed by: Kirk McKusick <mckusick@mckusick.com> Submitted by: Matt Dillon <dillon@freebsd.org>	1999-03-02 20:26:39 +00:00
dillon	a40e0249d4	Fix warnings in preparation for adding -Wall -Wcast-qual to the kernel compile	1999-01-27 21:50:00 +00:00
dillon	6db82357ca	Don't try to calculate B_CACHE for an NFS related bp that has a > 0 b_validend. This will screw up small-writes, causing lots of little writes out the network. We will assume that NFS handles B_CACHE properly.	1999-01-24 00:51:11 +00:00
dillon	fcfddd9dee	Fix an expression parenthesization typo in a conditional. It should not have any operational effects other then to make the code in question a little faster. Also added a more involved comment.	1999-01-23 06:36:15 +00:00
dg	e4e8fec98a	Don't throw away the buffer contents on a fatal write error; just mark the buffer as still being dirty. This isn't a perfect solution, but throwing away the buffer contents will often result in filesystem corruption and this solution will at least correctly deal with transient errors. Submitted by: Kirk McKusick <mckusick@mckusick.com>	1999-01-22 08:59:05 +00:00
dillon	6bd52e6f4b	The main operational changes are in getblk()'s handling of the B_DELWRI and B_CACHE flags, fixing a bug that showed up with NFS. Also, a number of cases where manually inserted code has been removed and replaced with an inline function call giving us better functional isolation in the source.	1999-01-21 09:19:33 +00:00
dillon	df24433bbe	This is a rather large commit that encompasses the new swapper, changes to the VM system to support the new swapper, VM bug fixes, several VM optimizations, and some additional revamping of the VM code. The specific bug fixes will be documented with additional forced commits. This commit is somewhat rough in regards to code cleanup issues. Reviewed by: "John S. Dyson" <root@dyson.iquest.net>, "David Greenman" <dg@root.com>	1999-01-21 08:29:12 +00:00
dillon	475c80e993	Obtained from: Luoqi Fix NFS file corruption problem introduced in 1.188. The valid range was not being set properly, causing a later reference to the buffer to clear the B_CACHE bit.	1999-01-19 08:00:51 +00:00
eivind	b00b77dac7	Silence warnings.	1999-01-12 11:59:34 +00:00
eivind	89e1199534	KNFize, by bde.	1999-01-10 01:58:29 +00:00
eivind	a8dc66f457	Split DIAGNOSTIC -> DIAGNOSTIC, INVARIANTS, and INVARIANT_SUPPORT as discussed on -hackers. Introduce 'KASSERT(assertion, ("panic message", args))' for simple check + panic. Reviewed by: msmith	1999-01-08 17:31:30 +00:00
dillon	56451ce788	Adjust some comments to prevent future confusion on the implementation. Also add a reference to the buf(9) manual page.	1998-12-22 18:57:30 +00:00
luoqi	fa77afb3a9	Correctly handle misaligned VMIO buffer (whose start or end offset in the VM object are not page aligned). This should fix the mount_msdos panic after a failed attemp to mount as ffs. Reviewed By: Matthew Dillon <dillon@apollo.backplane.com> Archie Cobbs <archie@whistle.com> Dmitrij Tejblum <dima@tejblum.dnttm.rssi.ru>	1998-12-22 14:43:58 +00:00
dillon	f93fd49220	fix intermediate overflow in 'quad = int * int' situation by casting the arguments to the multiply to a quad equivalent. In this case, vm_ooffset_t. Reviewed by: Archie Cobbs <archie@whistle.com>	1998-12-14 21:17:37 +00:00
eivind	2d81fe5347	Fix grouping of statements. This remove a potential panic in the soft updates code. While I'm here, remove an unintended trigraph. Reviewed by: Kirk McKusick <kirk@freebsd.org>	1998-12-07 17:23:45 +00:00
dg	939b432d02	Closed a very narrow and rare race condition that involved net interrupts, bio interrupts, and a truncated file that along with the precise alignment of the planets could result in a page being freed multiple times or a just-freed page being put onto the inactive queue.	1998-11-18 09:00:47 +00:00
peter	8ef35acf90	Use TAILQ macros for clean/dirty block list processing. Set b_xflags rather than abusing the list next pointer with a magic number.	1998-10-31 15:31:29 +00:00
dg	8de37805c0	Unwire everything to the inactive queue in order to preserve LRU ordering.	1998-10-30 14:53:54 +00:00
dg	dce7995d53	Fixed editing error. Pointed out by bde.	1998-10-29 11:04:22 +00:00
dg	20b2c33d9a	Added a second argument, "activate" to the vm_page_unwire() call so that the caller can select either inactive or active queue to put the page on.	1998-10-28 13:37:02 +00:00
phk	13c66194f4	Nitpicking and dusting performed on a train. Removes trivial warnings about unused variables, labels and other lint.	1998-10-25 17:44:59 +00:00
dg	3defb6d13f	Fixed two potentially serious classes of bugs: 1) The vnode pager wasn't properly tracking the file size due to "size" being page rounded in some cases and not in others. This sometimes resulted in corrupted files. First noticed by Terry Lambert. Fixed by changing the "size" pager_alloc parameter to be a 64bit byte value (as opposed to a 32bit page index) and changing the pagers and their callers to deal with this properly. 2) Fixed a bogus type cast in round_page() and trunc_page() that caused some 64bit offsets and sizes to be scrambled. Removing the cast required adding casts at a few dozen callers. There may be problems with other bogus casts in close-by macros. A quick check seemed to indicate that those were okay, however.	1998-10-13 08:24:45 +00:00
dillon	9cb29011f9	PR: kern/7418 Reviewed by: Luoqi Chen <luoqi@watermarkgroup.com> Fixed problem where write()s can get lost due to buffers flagged B_DELWRI being improperly released in brelse().	1998-09-26 00:12:35 +00:00
peter	a2aeaf4564	Goodbye BOUNCE_BUFFERS, for a hack it has served us well. The last consumer of this code (the old SCSI system) has left us and the CAM code does it's own bouncing. The isa dma system has been doing it's own bouncing for a while too. Reviewed by: core	1998-09-25 17:34:49 +00:00
gibbs	0a7eb834b7	kern_clock.c: Remove old disk statistics variables. vfs_bio.c: Enable bowrite now that B_ORDERED works for all buffer devices.	1998-09-15 10:05:18 +00:00
phk	4630814c8b	Add a new vnode op, VOP_FREEBLKS(), which filesystems can use to inform device drivers about sectors no longer in use. Device-drivers receive the call through d_strategy, if they have D_CANFREE in d_flags. This allows flash based devices to erase the sectors and avoid pointlessly carrying them around in compactions. Reviewed by: Kirk Mckusick, bde Sponsored by: M-Systems (www.m-sys.com)	1998-09-05 14:13:12 +00:00
dfr	e2df972eb1	Cosmetic changes to the PAGE_XXX macros to make them consistent with the other objects in vm.	1998-09-04 08:06:57 +00:00
luoqi	71cc3d284f	Close a race window for getnewbuf() between shared lock holders of the vnode. Reviewed by: Mike Smith	1998-08-28 20:07:13 +00:00
phk	9279ee9f4d	Fix DDBs printing of buf-flags after I changed them yesterday.	1998-08-25 14:41:42 +00:00
phk	84bd0e1571	Remove the last remaining evidence of B_TAPE. Reclaim 3 unused bits in b_flags	1998-08-24 17:47:25 +00:00
dfr	5fdaeb281d	Change various syscalls to use size_t arguments instead of u_int. Add some overflow checks to read/write (from bde). Change all modifications to vm_page::flags, vm_page::busy, vm_object::flags and vm_object::paging_in_progress to use operations which are not interruptable. Reviewed by: Bruce Evans <bde@zeta.org.au>	1998-08-24 08:39:39 +00:00
dfr	cb85cf3e66	Protect all modifications to v_numoutput with splbio().	1998-08-13 08:09:08 +00:00
dfr	0864bef679	Protect all modifications to paging_in_progress with splvm(). The i386 managed to avoid corruption of this variable by luck (the compiler used a memory read-modify-write instruction which wasn't interruptable) but other architectures cannot. With this change, I am now able to 'make buildworld' on the alpha (sfx: the crowd goes wild...)	1998-08-06 08:33:19 +00:00
bde	1805fa7aa8	Fixed printf format errors.	1998-07-13 07:05:55 +00:00
julian	78a155ddaf	Catch a few corner cases where FreeBSD differs enough from BSD 4.4 to confuse Soft updates.. Should solve several "dangling deps" panics.	1998-07-08 01:04:33 +00:00
julian	4363221ba2	VOP_STRATEGY grows an (struct vnode *) argument as the value in b_vp is often not really what you want. (and needs to be frobbed). more cleanups will follow this. Reviewed by: Bruce Evans <bde@freebsd.org>	1998-07-04 20:45:42 +00:00
peter	dfa49715ca	vm_page_is_valid() wasn't expecting a large offset argument, it's expecting a sub-page offset. We were passing the file position, and vm_page_bits() could do some interesting things when base was larger PAGE_SIZE. if (size > PAGE_SIZE - base) size = PAGE_SIZE - base; is interesting when (PAGE_SIZE - base) is negative. I could imagine that this could have interesting consequences for memory page -> device block bit validation.	1998-05-01 15:10:59 +00:00
peter	e5ab5108e0	Fix one problem with NFSv3 > 2GB file support. Submitted by: bde	1998-05-01 15:04:35 +00:00
des	396b114475	Seventy-odd "its" / "it's" typos in comments fixed as per kern/6108.	1998-04-17 22:37:19 +00:00
bde	b598f559b2	Support compiling with `gcc -ansi'.	1998-04-15 17:47:40 +00:00
dyson	1eaa978a47	Correct a problem where buffers might not be zeroed when needed. The B_MALLOC buffers might not have been properly zeroed.	1998-03-27 06:48:24 +00:00
dyson	aad85d5a04	In kern_physio.c fix tsleep priority messup. In vfs_bio.c, remove b_generation count usage, remove redundant reassignbuf, remove redundant spl(s), manage page PG_ZERO flags more correctly, utilize in invalid value for b_offset until it is properly initialized. Add asserts for #ifdef DIAGNOSTIC, when b_offset is improperly used. when a process is not performing I/O, and just waiting on a buffer generally, make the sleep priority low. only check page validity in getblk for B_VMIO buffers. In vfs_cluster, add b_offset asserts, correct pointer calculation for clustered reads. Improve readability of certain parts of the code. Remove redundant spl(s). In vfs_subr, correct usage of vfs_bio_awrite (From Andrew Gallatin <gallatin@cs.duke.edu>). More vtruncbuf problems fixed.	1998-03-19 22:48:16 +00:00
dyson	9470d0a235	Correct a problem where data OR metadata could be thrown away if a buffer is grown.	1998-03-17 17:36:05 +00:00
kato	7fa5494b44	Deleted PC-98 code because (1) machine dependent code should not be in here, and (2) the flag used in PC-98 code has been assigned to another purpose.	1998-03-17 08:41:28 +00:00
dyson	6e92f5716b	Some VM improvements, including elimination of alot of Sig-11 problems. Tor Egge and others have helped with various VM bugs lately, but don't blame him -- blame me!!! pmap.c: 1) Create an object for kernel page table allocations. This fixes a bogus allocation method previously used for such, by grabbing pages from the kernel object, using bogus pindexes. (This was a code cleanup, and perhaps a minor system stability issue.) pmap.c: 2) Pre-set the modify and accessed bits when prudent. This will decrease bus traffic under certain circumstances. vfs_bio.c, vfs_cluster.c: 3) Rather than calculating the beginning virtual byte offset multiple times, stick the offset into the buffer header, so that the calculated offset can be reused. (Long long multiplies are often expensive, and this is a probably unmeasurable performance improvement, and code cleanup.) vfs_bio.c: 4) Handle write recursion more intelligently (but not perfectly) so that it is less likely to cause a system panic, and is also much more robust. vfs_bio.c: 5) getblk incorrectly wrote out blocks that are incorrectly sized. The problem is fixed, and writes blocks out ONLY when B_DELWRI is true. vfs_bio.c: 6) Check that already constituted buffers have fully valid pages. If not, then make sure that the B_CACHE bit is not set. (This was a major source of Sig-11 type problems.) vfs_bio.c: 7) Fix a potential system deadlock due to an incorrectly specified sleep priority while waiting for a buffer write operation. The change that I made opens the system up to serious problems, and we need to examine the issue of process sleep priorities. vfs_cluster.c, vfs_bio.c: 8) Make clustered reads work more correctly (and more completely) when buffers are already constituted, but not fully valid. (This was another system reliability issue.) vfs_subr.c, ffs_inode.c: 9) Create a vtruncbuf function, which is used by filesystems that can truncate files. The vinvalbuf forced a file sync type operation, while vtruncbuf only invalidates the buffers past the new end of file, and also invalidates the appropriate pages. (This was a system reliabiliy and performance issue.) 10) Modify FFS to use vtruncbuf. vm_object.c: 11) Make the object rundown mechanism for OBJT_VNODE type objects work more correctly. Included in that fix, create pager entries for the OBJT_DEAD pager type, so that paging requests that might slip in during race conditions are properly handled. (This was a system reliability issue.) vm_page.c: 12) Make some of the page validation routines be a little less picky about arguments passed to them. Also, support page invalidation change the object generation count so that we handle generation counts a little more robustly. vm_pageout.c: 13) Further reduce pageout daemon activity when the system doesn't need help from it. There should be no additional performance decrease even when the pageout daemon is running. (This was a significant performance issue.) vnode_pager.c: 14) Teach the vnode pager to handle race conditions during vnode deallocations.	1998-03-16 01:56:03 +00:00
julian	10c5ccc30a	Reviewed by: dyson@freebsd.org (john Dyson), dg@root.com (david greenman) Submitted by: Kirk McKusick (mcKusick@mckusick.com) Obtained from: WHistle development tree	1998-03-08 09:59:44 +00:00
dyson	8ceb6160f4	This mega-commit is meant to fix numerous interrelated problems. There has been some bitrot and incorrect assumptions in the vfs_bio code. These problems have manifest themselves worse on NFS type filesystems, but can still affect local filesystems under certain circumstances. Most of the problems have involved mmap consistancy, and as a side-effect broke the vfs.ioopt code. This code might have been committed seperately, but almost everything is interrelated. 1) Allow (pmap_object_init_pt) prefaulting of buffer-busy pages that are fully valid. 2) Rather than deactivating erroneously read initial (header) pages in kern_exec, we now free them. 3) Fix the rundown of non-VMIO buffers that are in an inconsistent (missing vp) state. 4) Fix the disassociation of pages from buffers in brelse. The previous code had rotted and was faulty in a couple of important circumstances. 5) Remove a gratuitious buffer wakeup in vfs_vmio_release. 6) Remove a crufty and currently unused cluster mechanism for VBLK files in vfs_bio_awrite. When the code is functional, I'll add back a cleaner version. 7) The page busy count wakeups assocated with the buffer cache usage were incorrectly cleaned up in a previous commit by me. Revert to the original, correct version, but with a cleaner implementation. 8) The cluster read code now tries to keep data associated with buffers more aggressively (without breaking the heuristics) when it is presumed that the read data (buffers) will be soon needed. 9) Change to filesystem lockmgr locks so that they use LK_NOPAUSE. The delay loop waiting is not useful for filesystem locks, due to the length of the time intervals. 10) Correct and clean-up spec_getpages. 11) Implement a fully functional nfs_getpages, nfs_putpages. 12) Fix nfs_write so that modifications are coherent with the NFS data on the server disk (at least as well as NFS seems to allow.) 13) Properly support MS_INVALIDATE on NFS. 14) Properly pass down MS_INVALIDATE to lower levels of the VM code from vm_map_clean. 15) Better support the notion of pages being busy but valid, so that fewer in-transit waits occur. (use p->busy more for pageouts instead of PG_BUSY.) Since the page is fully valid, it is still usable for reads. 16) It is possible (in error) for cached pages to be busy. Make the page allocation code handle that case correctly. (It should probably be a printf or panic, but I want the system to handle coding errors robustly. I'll probably add a printf.) 17) Correct the design and usage of vm_page_sleep. It didn't handle consistancy problems very well, so make the design a little less lofty. After vm_page_sleep, if it ever blocked, it is still important to relookup the page (if the object generation count changed), and verify it's status (always.) 18) In vm_pageout.c, vm_pageout_clean had rotted, so clean that up. 19) Push the page busy for writes and VM_PROT_READ into vm_pageout_flush. 20) Fix vm_pager_put_pages and it's descendents to support an int flag instead of a boolean, so that we can pass down the invalidate bit.	1998-03-07 21:37:31 +00:00
dyson	1ae42d49a8	Fix a rounding error for the NFS buffer validend. Submitted by: John W. De Boskey <jwd@unx.sas.com>	1998-03-04 03:17:30 +00:00
dyson	69e5a1e9f5	1) Use a more consistent page wait methodology. 2) Do not unnecessarily force page blocking when paging pages out. 3) Further improve swap pager performance and correctness, including fixing the paging in progress deadlock (except in severe I/O error conditions.) 4) Enable vfs_ioopt=1 as a default. 5) Fix and enable the page prezeroing in SMP mode. All in all, SMP systems especially should show a significant improvement in "snappyness."	1998-03-01 04:18:54 +00:00
dg	fa872f3d14	Fix a && that should be an &. Reviewed by: "John S. Dyson" <dyson@FreeBSD.ORG> Submitted by: jwd@unx.sas.com (John W. DeBoskey)	1998-02-11 20:06:48 +00:00
eivind	d7a6ab2803	Staticize.	1998-02-09 06:11:36 +00:00
eivind	4547a09753	Back out DIAGNOSTIC changes.	1998-02-06 12:14:30 +00:00
eivind	c552a9a1c3	Turn DIAGNOSTIC into a new-style option.	1998-02-04 22:34:03 +00:00
dyson	2aacd1ab4f	Change the busy page mgmt, so that when pages are freed, they MUST be PG_BUSY. It is bogus to free a page that isn't busy, because it is in a state of being "unavailable" when being freed. The additional advantage is that the page_remove code has a better cross-check that the page should be busy and unavailable for other use. There were some minor problems with the collapse code, and this plugs those subtile "holes." Also, the vfs_bio code wasn't checking correctly for PG_BUSY pages. I am going to develop a more consistant scheme for grabbing pages, busy or otherwise. For now, we are stuck with the current morass.	1998-01-31 11:56:53 +00:00
dyson	548a436486	Various NFS fixes: Make vfs_bio buffer mgmt work better. Buffers were being used after brelse. Make nfs_getpages work independently of other NFS interfaces. This eliminates some difficult recursion problems and decreases pagefault overhead. Remove an erroneous vfs_unbusy_pages. Fix a reentrancy problem, with nfs_vinvalbuf when vnode is already being rundown. Reassignbuf wasn't being called when needed under certain circumstances. (Thanks to Bill Paul for help.)	1998-01-25 06:24:09 +00:00
dyson	8726294764	Add better support for larger I/O clusters, including larger physical I/O. The support is not mature yet, and some of the underlying implementation needs help. However, support does exist for IDE devices now.	1998-01-24 02:01:46 +00:00
dyson	197bd655c4	VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)	1998-01-22 17:30:44 +00:00
dyson	b130b30c96	Tie up some loose ends in vnode/object management. Remove an unneeded config option in pmap. Fix a problem with faulting in pages. Clean-up some loose ends in swap pager memory management. The system should be much more stable, but all subtile bugs aren't fixed yet.	1998-01-17 09:17:02 +00:00
dyson	d9d8bf6d30	Fix some vnode management problems, and better mgmt of vnode free list. Fix the UIO optimization code. Fix an assumption in vm_map_insert regarding allocation of swap pagers. Fix an spl problem in the collapse handling in vm_object_deallocate. When pages are freed from vnode objects, and the criteria for putting the associated vnode onto the free list is reached, either put the vnode onto the list, or put it onto an interrupt safe version of the list, for further transfer onto the actual free list. Some minor syntax changes changing pre-decs, pre-incs to post versions. Remove a bogus timeout (that I added for debugging) from vn_lock. PHK will likely still have problems with the vnode list management, and so do I, but it is better than it was.	1998-01-12 01:46:33 +00:00
dyson	cb2800cd94	Make our v_usecount vnode reference count work identically to the original BSD code. The association between the vnode and the vm_object no longer includes reference counts. The major difference is that vm_object's are no longer freed gratuitiously from the vnode, and so once an object is created for the vnode, it will last as long as the vnode does. When a vnode object reference count is incremented, then the underlying vnode reference count is incremented also. The two "objects" are now more intimately related, and so the interactions are now much less complex. When vnodes are now normally placed onto the free queue with an object still attached. The rundown of the object happens at vnode rundown time, and happens with exactly the same filesystem semantics of the original VFS code. There is absolutely no need for vnode_pager_uncache and other travesties like that anymore. A side-effect of these changes is that SMP locking should be much simpler, the I/O copyin/copyout optimizations work, NFS should be more ponderable, and further work on layered filesystems should be less frustrating, because of the totally coherent management of the vnode objects and vnodes. Please be careful with your system while running this code, but I would greatly appreciate feedback as soon a reasonably possible.	1998-01-06 05:26:17 +00:00
dyson	b6fbb09bc1	Improve my copyright.	1997-12-22 11:54:00 +00:00
dyson	331974b30d	Slight performance improvement, removal of unneeded SPLs.	1997-12-07 04:06:41 +00:00
bde	efd51d84cf	Don't include <sys/lock.h> in headers when only `struct simplelock' is required. Fixed everything that depended on the pollution.	1997-12-05 19:55:52 +00:00
phk	a1bfb618d9	In all such uses of struct buf: 's/b_un.b_addr/b_data/g'	1997-12-02 21:07:20 +00:00
dyson	ded4973e92	Fix a serious problem during resizing buffers where old buffers address space wasn't being properly reclaimed. Submitted by: Bruce Evans <bde@freebsd.org>	1997-12-01 19:04:00 +00:00
dyson	0c60b939c7	Avoid manipulating the buffer map at interrupt time by deferring bfreekva to getnewbuf, and remove from brelse. Reviewed by: dg@root.com	1997-11-24 06:18:27 +00:00
phk	4d26888936	Remove a bunch of variables which were unused both in GENERIC and LINT. Found by: -Wunused	1997-11-07 08:53:44 +00:00
phk	4c8218a5c7	Move the "retval" (3rd) parameter from all syscall functions and put it in struct proc instead. This fixes a boatload of compiler warning, and removes a lot of cruft from the sources. I have not removed the /ARGSUSED/, they will require some looking at. libkvm, ps and other userland struct proc frobbing programs will need recompiled.	1997-11-06 19:29:57 +00:00
bde	fb826377ff	Removed unused #includes.	1997-10-28 15:59:26 +00:00
phk	e3cdaf12b2	VFS interior redecoration. Rename vn_default_error to vop_defaultop all over the place. Move vn_bwrite from vfs_bio.c to vfs_default.c and call it vop_stdbwrite. Use vop_null instead of nullop. Move vop_nopoll from vfs_subr.c to vfs_default.c Move vop_sharedlock from vfs_subr.c to vfs_default.c Move vop_nolock from vfs_subr.c to vfs_default.c Move vop_nounlock from vfs_subr.c to vfs_default.c Move vop_noislocked from vfs_subr.c to vfs_default.c Use vop_ebadf instead of *_ebadf. Add vop_defaultop for getpages on master vnode in MFS.	1997-10-26 20:55:39 +00:00
phk	36e7a51ea1	Last major round (Unless Bruce thinks of somthing :-) of malloc changes. Distribute all but the most fundamental malloc types. This time I also remembered the trick to making things static: Put "static" in front of them. A couple of finer points by: bde	1997-10-12 20:26:33 +00:00
phk	645e7b2ab6	Distribute and statizice a lot of the malloc M_* types. Substantial input from: bde	1997-10-11 18:31:40 +00:00
gibbs	52ace446d2	init_main.c subr_autoconf.c: Add support for "interrupt driven configuration hooks". A component of the kernel can register a hook, most likely during auto-configuration, and receive a callback once interrupt services are available. This callback will occur before the root and dump devices are configured, so the configuration task can affect the selection of those two devices or complete any tasks that need to be performed prior to launching init. System boot is posponed so long as a hook is registered. The hook owner is responsible for removing the hook once their task is complete or the system boot can continue. kern_acct.c kern_clock.c kern_exit.c kern_synch.c kern_time.c: Change the interface and implementation for the kernel callout service. The new implemntaion is based on the work of Adam M. Costello and George Varghese, published in a technical report entitled "Redesigning the BSD Callout and Timer Facilities". The interface used in FreeBSD is a little different than the one outlined in the paper. The new function prototypes are: struct callout_handle timeout(void (func)(void ), void arg, int ticks); void untimeout(void (func)(void ), void arg, struct callout_handle handle); If a client wishes to remove a timeout, it must store the callout_handle returned by timeout and pass it to untimeout. The new implementation gives 0(1) insert and removal of callouts making this interface scale well even for applications that keep 100s of callouts outstanding. See the updated timeout.9 man page for more details.	1997-09-21 22:00:25 +00:00
dyson	fe4d489758	Re-institute a bugfix in allocation of anonymous buffer memory.	1997-09-21 04:49:30 +00:00
phk	1a50603a24	The patch is needed in order to not throw away unmodified local filesystem metadata at the first brelse call when the block device vnode has v_tag set to VT_NFS. Reviewed by: phk Submitted by: Tor Egge <tegge@idi.ntnu.no>	1997-09-10 20:09:22 +00:00
bde	be65adf888	Some staticized variables were still declared to be extern.	1997-09-07 16:56:34 +00:00
dyson	b90433b1a9	Back out some incorrect changes that was worse than the original bug.	1997-08-26 04:36:27 +00:00
dyson	02d84824b3	Some corrections to the anonymous page managment. Submitted by: Peter Chen <pmchen@eecs.umich.edu>	1997-08-21 01:35:37 +00:00
dyson	c38957d22b	Modify the scheduling policy to take into account disk I/O waits as chargeable CPU usage. This should mitigate the problem of processes doing disk I/O hogging the CPU. Various users have reported the problem, and test code shows that the problem should now be gone.	1997-08-09 10:13:32 +00:00
dyson	eb8e1f5e4e	Fix a problem with the VN device. Specifically, the VN device can cause a problem of spiraling death due to buffer resource limitations. The vfs_bio code in general had little ability to handle buffer resource management, and now it does. Also, there are a lot more knobs for tuning the vfs_bio code now. The knobs came free because of the need that there always be some immediately available buffers (non-delayed or locked) for use. Note that the buffer cache code is much less likely to get bogged down with lots of delayed writes, even more so than before.	1997-06-15 17:56:53 +00:00
bde	09d13c3c83	Fixed livelock in getnewbuf(). It is possible for multiple process to sleep concurrently waiting for a buffer. When the buffer shortage is a shortage of space but not a shortage of buffer headers, the processes took turns creating empty buffers and waking each other to advertise the brelse() of the empties; progress was never made because tsleep() always found another high-priority process to run and everything was done at splbio(), so vfs_update never had a chance to flush delayed writes, not to mention that i/o never had a chance to complete. The problem seems to be rare in practice, but it can easily be reproduced by misusing block devices, at least for sufficently slow devices on machines with a sufficiently small buffer cache. E.g., `tar cvf /dev/fd0 /kernel' on an 8MB system with no disk in fd0 causes the problem quickly; the same command with a disk in fd0 causes the problem not quite as quickly; and people have reported problems newfs'ing file systems on block devices. Block devices only cause this problem indirectly. They are pessimized for time and space, and the space pessimization causes the shortage (it manifests as internal fragmentation in buffer_map). This should be fixed in 2.2.	1997-06-13 08:30:40 +00:00
dfr	041a050592	Don't throw NFS B_DELWRI buffers back to the vm system in brelse. Make sure that b_validoff..b_validend is at least as big as b_dirtyoff..b_dirtyend.	1997-06-06 09:04:28 +00:00
dfr	77f763b0e4	Fix some performance problems with the NFS mmap fixes.	1997-06-03 09:42:43 +00:00
dfr	654c037a9c	The previous fix didn't work properly for small block size filesystems, which caused very slow file access for cd9660 and some ext2fs filesystems. Reviewed by: bde	1997-05-30 22:25:35 +00:00
dfr	d7e320b30e	Fix a few bugs with NFS and mmap caused by NFS' use of b_validoff and b_validend. The changes to vfs_bio.c are a bit ugly but hopefully can be tidied up later by a slight redesign. PR: kern/2573, kern/2754, kern/3046 (possibly) Reviewed by: dyson	1997-05-19 14:36:56 +00:00
joerg	f84e48e335	Add a DDB command `show buffer', to display a struct buf. It's impossible to display everything, so i've chosen a small subset. Add more to this as you think seems useful.	1997-05-10 09:09:42 +00:00
dyson	786ce6b2bc	Improve the buffer cache memory policy by moving pages over to the cache queue more often. The pageout daemon had to be waken up more often than necessary since pages were not put on the cache queue, when they should have been. Submitted by: David Greenman <dg@freebsd.org>	1997-04-13 03:33:25 +00:00
bde	278256e73a	Removed potentially harmful garbage <vm/lock.h> and fixed bogus use of it. It was actually harmless because the use was null due to fortuitous include orders and identical (wrong) idempotency macros.	1997-04-01 08:39:07 +00:00
peter	94b6d72794	Back out part 1 of the MCFH that changed $Id$ to $FreeBSD$. We are not ready for it yet.	1997-02-22 09:48:43 +00:00
bde	dd18dffcc8	Removed redundant spl0()'s from kernel processes. They were work-arounds for a bug in fork().	1997-01-15 19:05:08 +00:00
jkh	808a36ef65	Make the long-awaited change from $Id$ to $FreeBSD$ This will make a number of things easier in the future, as well as (finally!) avoiding the Id-smashing problem which has plagued developers for so long. Boy, I'm glad we're not using sup anymore. This update would have been insane otherwise.	1997-01-14 07:20:47 +00:00
dyson	e198c2084e	Clean-up of the new buffer kva allocation code. Also, there was an error in the !BOUNCE_BUFFERS case.	1996-12-05 04:28:52 +00:00
dyson	e30058b4a4	Fix a problem with the new buffer_map management code. Additionally, decrease the size of buffer_map to approx 2/3 of what it used to be (buffer_map can be smaller now.) The original commit of these changes increased the size of buffer_map to the point where the system would not boot on large systems -- now large systems with large caches will have even less problems than before.	1996-12-01 15:46:40 +00:00
dyson	7a58275f33	Implement a new totally dynamic (up to MAXPHYS) buffer kva allocation scheme. Additionally, add the capability for checking for unexpected kernel page faults. The maximum amount of kva space for buffers hasn't been decreased from where it is, but it will now be possible to do so. This scheme manages the kva space similar to the buffers themselves. If there isn't enough kva space because of usage or fragementation, buffers will be reclaimed until a buffer allocation is successful. This scheme should be very resistant to fragmentation problems until/if the LFS code is fixed and uses the bogus buffer locking scheme -- but a 'fixed' LFS is not likely to use such a scheme. Now there should be NO problem allocating buffers up to MAXPHYS.	1996-11-30 22:41:49 +00:00
dyson	f7092d9364	Potentially fix a problem, whereby MSDOSFS can request buffers larger than the vfs layer can provide. We now automatically support 32K clusters if MSDOSFS is installed, and panic if a filesystem tries to allocate a buffer larger than MAXBSIZE. This commit is a result of some "prodding" by BDE.	1996-11-28 04:26:04 +00:00
dyson	20b553ba68	Improve the caching of small files like directories, while not substantially increasing buffer space. Specifically, we double the number of buffers, but allocate only half the amount of memory per buffer. Note that VDIR files aren't cached unless instantiated in a buffer. This will significantly improve caching.	1996-11-17 02:11:01 +00:00
dyson	b5f32c0701	Fix a problem that could cause msync (or many other things) to deadlock. The heuristic for managment of memory backing the buffer cache was nice, but didn't work due to some architectural problems. Simplify and improve the algorithm.	1996-10-17 03:04:43 +00:00
dyson	25ad3f2d7c	Fix 4 problems: Major: When blocking occurs in allocbuf() for VMIO files, excess wire counts could accumulate. Major: Pages are incorrectly accumulated into the physical buffer for clustered reads. This happens when bogus page is needed. Minor: When reclaiming buffers, the async flag on the buffer needs to be zero, or the reclaim is not optimal. Minor: The age flag should be cleared, if a buffer is wanted.	1996-10-06 07:50:05 +00:00
dyson	53c86aac70	Fix an spl window, a page manipulation at interrupt time that was incorrect, and correct the support for B_ORDERED. The spl window fix was from Peter Wemm, and his questions led me to find the problem with the interrupt time page manipulation.	1996-09-20 02:26:35 +00:00
dyson	778e4b0932	Add needed spl protection, and some minor cleanups in vfs_vmio_release. Submitted by: Peter Wemm <peter@spinner.dialix.com> and me.	1996-09-18 15:57:41 +00:00
dyson	e775010c52	Clean up some more problems with freeing busy or wired pages. The vfs_bio code was not waiting properly for page state until manipulating it.	1996-09-14 04:40:33 +00:00
dyson	22eb6874d5	A modification that allows the driver strategy to modify the B_ASYNC flag broke things pretty bad (freeing buffer already on queue or other wierd buffer queue errors.) The broken code is left in commented out, but this makes the problem go away for now.	1996-09-13 03:15:45 +00:00
dyson	62b009f8b1	Addition of page coloring support. Various levels of coloring are afforded. The default level works with minimal overhead, but one can also enable full, efficient use of a 512K cache. (Parameters can be generated to support arbitrary cache sizes also.)	1996-09-08 20:44:49 +00:00
gibbs	458938509f	Add bowrite. Bowrite guarantees that buffers queued after a call to bowrite will be written after the specified buffer (on a particular device). Bowrite does this either by taking advantage of hardware ordering support (e.g. tagged queueing on SCSI devices) or resorting to a synchronous write.	1996-09-06 05:37:53 +00:00
dyson	966cbc5d29	Even though this looks like it, this is not a complex code change. The interface into the "VMIO" system has changed to be more consistant and robust. Essentially, it is now no longer necessary to call vn_open to get merged VM/Buffer cache operation, and exceptional conditions such as merged operation of VBLK devices is simpler and more correct. This code corrects a potentially large set of problems including the problems with ktrace output and loaded systems, file create/deletes, etc. Most of the changes to NFS are cosmetic and name changes, eliminating a layer of subroutine calls. The direct calls to vput/vrele have been re-instituted for better cross platform compatibility. Reviewed by: davidg	1996-08-21 21:56:23 +00:00

... 3 4 5 6 7 ...

545 Commits