freebsd-dev

Author	SHA1	Message	Date
Alan Cox	10c447fac2	Swap in can occur safely without Giant. Release Giant on entry to scheduler().	2005-05-22 21:06:07 +00:00
Alan Cox	35cf2323f8	Remove GIANT_REQUIRED from swapout_procs().	2005-05-22 00:30:50 +00:00
Alan Cox	071a1710d1	Reduce the number of times that we acquire and release locks in swap_pager_getpages(). MFC after: 1 week	2005-05-20 21:26:05 +00:00
Alan Cox	1aececdb5a	Remove calls to spl*().	2005-05-19 06:11:13 +00:00
Alan Cox	d2d9c9aca7	Remove a stale comment concerning spl* usage.	2005-05-19 03:53:07 +00:00
Alan Cox	cf51adc0a1	Update some comments to reflect the change from spl-based to lock-based synchronization.	2005-05-18 22:08:52 +00:00
Alan Cox	ab973359d5	Remove calls to spl*().	2005-05-18 20:45:33 +00:00
Alan Cox	2e2a6fa28a	Revert revision 1.270: swp_pager_async_iodone() need not perform VM_LOCK_GIANT(). Discussed with: jeff	2005-05-18 17:48:04 +00:00
Bjoern A. Zeeb	f3aad9a6bb	Correct 32 vs 64 bit signedness issues. Approved by: pjd (mentor) MFC after: 2 weeks	2005-05-18 08:57:31 +00:00
Peter Grehan	10b00dd4f3	The final test in unlock_and_deallocate() to determine if GIANT needs to be unlocked wasn't updated to check for OBJ_NEEDGIANT. This caused a WITNESS panic when debug_mpsafevm was set to 0. Approved by: jeffr	2005-05-12 04:09:41 +00:00
Marcel Moolenaar	23110bedcd	Enable debug_mpsafevm on ia64 due to the severe functional regression caused by recent locking changes when it's off. Revert the logic to trim down the conditional. Clued-in by: alc@	2005-05-08 23:56:16 +00:00
Jeff Roberson	b8a0b997fd	- We need to inhert the OBJ_NEEDGIANT flag from the original object in vm_object_split(). Spotted by: alc	2005-05-04 20:54:16 +00:00
Jeff Roberson	ed4fe4f4f5	- Add a new object flag "OBJ_NEEDSGIANT". We set this flag if the underlying vnode requires Giant. - In vm_fault only acquire Giant if the underlying object has NEEDSGIANT set. - In vm_object_shadow inherit the NEEDSGIANT flag from the backing object.	2005-05-03 11:11:26 +00:00
Alan Cox	b7903e65fb	Remove GIANT_REQUIRED from vmspace_exec(). Prodded by: jeff	2005-05-02 07:05:20 +00:00
Jeff Roberson	382a601cd7	- VM_LOCK_GIANT in the swap pager's iodone routine as VFS will soon call it without Giant. Sponsored by: Isilon Systems, Inc.	2005-04-30 11:25:49 +00:00
Robert Watson	5d1ae027f0	Modify UMA to use critical sections to protect per-CPU caches, rather than mutexes, which offers lower overhead on both UP and SMP. When allocating from or freeing to the per-cpu cache, without INVARIANTS enabled, we now no longer perform any mutex operations, which offers a 1%-3% performance improvement in a variety of micro-benchmarks. We rely on critical sections to prevent (a) preemption resulting in reentrant access to UMA on a single CPU, and (b) migration of the thread during access. In the event we need to go back to the zone for a new bucket, we release the critical section to acquire the global zone mutex, and must re-acquire the critical section and re-evaluate which cache we are accessing in case migration has occured, or circumstances have changed in the current cache. Per-CPU cache statistics are now gathered lock-free by the sysctl, which can result in small races in statistics reporting for caches. Reviewed by: bmilekic, jeff (somewhat) Tested by: rwatson, kris, gnn, scottl, mike at sentex dot net, others	2005-04-29 18:56:36 +00:00
Jeff Roberson	7625cbf3cc	- Pass the ISOPEN flag to namei so filesystems will know we're about to open them or otherwise access the data.	2005-04-27 09:05:19 +00:00
Kris Kennaway	f5fca0d8be	Add the vm.exec_map_entries tunable and read-only sysctl, which controls the number of entries in exec_map (maximum number of simultaneous execs that can be handled by the kernel). The default value of 16 is insufficient on heavily loaded machines (particularly SMP machines), and if it is exceeded then executing further processes will generate a SIGABRT. This is a workaround until a better solution can be implemented. Reviewed by: alc MFC after: 3 days	2005-04-25 19:22:05 +00:00
Dag-Erling Smørgrav	02dcaf2fd1	Unbreak the build on 64-bit architectures.	2005-04-16 12:37:16 +00:00
John Baldwin	3c3edcb445	Add a vm.blacklist tunable which can hold a space or comma seperated list of physical addresses. The pages containing these physical addresses will not be added to the free list and thus will effectively be ignored by the VM system. This is mostly useful for the case when one knows of specific physical addresses that have bit errors (such as from a memtest run) so that one can blacklist the bad pages while waiting for the new sticks of RAM to arrive. The physical addresses of any ignored pages are listed in the message buffer as well.	2005-04-15 21:45:02 +00:00
Christian S.J. Peron	c92163dcad	Move MAC check_vnode_mmap entry point out from being exclusive to MAP_SHARED so that the entry point gets executed un-conditionally. This may be useful for security policies which want to perform access control checks around run-time linking. -add the mmap(2) flags argument to the check_vnode_mmap entry point so that we can make access control decisions based on the type of mapped object. -update any dependent API around this parameter addition such as function prototype modifications, entry point parameter additions and the inclusion of sys/mman.h header file. -Change the MLS, BIBA and LOMAC security policies so that subject domination routines are not executed unless the type of mapping is shared. This is done to maintain compatibility between the old vm_mmap_vnode(9) and these policies. Reviewed by: rwatson MFC after: 1 month	2005-04-14 16:03:30 +00:00
John Baldwin	9fd0669542	Tidy vcnt() by moving a duplicated line above #ifdef and removing a useless variable.	2005-04-12 23:15:28 +00:00
John Baldwin	a5c7ea5b9e	Flip the switch and turn mpsafevm on by default for sparc64. Approved by: alc	2005-04-04 20:59:02 +00:00
Jeff Roberson	6e4b282039	- Don't NULL the vnode's v_object pointer until after the object is torn down. If we have dirty pages, the putpages routine will need to know what the vnode's object is so that it may write out dirty pages. Pointy hat: phk Found by: obrien	2005-04-03 22:56:58 +00:00
John Baldwin	98df9218da	- Change the vm_mmap() function to accept an objtype_t parameter specifying the type of object represented by the handle argument. - Allow vm_mmap() to map device memory via cdev objects in addition to vnodes and anonymous memory. Note that mmaping a cdev directly does not currently perform any MAC checks like mapping a vnode does. - Unbreak the DRM getbufs ioctl by having it call vm_mmap() directly on the cdev the ioctl is acting on rather than trying to find a suitable vnode to map from. Reviewed by: alc, arch@	2005-04-01 20:00:11 +00:00
Jeff Roberson	f247a5240d	- LK_NOPAUSE is a nop now. Sponsored by: Isilon Systems, Inc.	2005-03-31 04:37:09 +00:00
Alan Cox	c6ec6a7cae	Eliminate (now) unnecessary acquisition and release of the global page queues lock in vm_object_backing_scan(). Updates to the page's PG_BUSY flag and busy field are synchronized by the containing object's lock. Testing the page's hold_count and wire_count in vm_object_backing_scan()'s OBSC_COLLAPSE_NOWAIT case is unnecessary. There is no reason why the held or wired pages cannot be migrated to the shadow object. Reviewed by: tegge	2005-03-30 05:40:02 +00:00
David Schultz	010b1ca16e	Move the swap_zone == NULL check earlier (i.e. before we dereference the pointer.) Found by: Coverity Prevent analysis tool	2005-03-18 21:22:48 +00:00
Jeff Roberson	ee39666a76	- Don't lock the vnode interlock in vm_object_set_writeable_dirty() if we've already set the object flags. Reviewed by: alc	2005-03-17 12:03:42 +00:00
Jeff Roberson	761dbeb66f	- In vm_page_insert() hold the backing vnode when the first page is inserted. - In vm_page_remove() drop the backing vnode when the last page is removed. - Don't check the vnode to see if it must be reclaimed on every call to vm_page_free_toq() as we only check it now when it is actually required. This saves us two lock operations per call. Sponsored by: Isilon Systems, Inc.	2005-03-15 14:14:09 +00:00
Jeff Roberson	7747c03884	- Don't directly adjust v_usecount, use vref() instead. Sponsored by: Isilon Systems, Inc.	2005-03-14 09:03:19 +00:00
Jeff Roberson	1d39df3fe9	- Retire OLOCK and OWANT. All callers hold the vnode lock when creating a vnode object. There has been an assert to prove this for some time. Sponsored by: Isilon Systems, Inc.	2005-03-14 07:29:40 +00:00
Jeff Roberson	493d78b3bd	- Don't acquire the vnode lock in destroy_vobject, assert that it has already been acquired by the caller. Sponsored by: Isilon Systems, Inc.	2005-03-13 12:05:05 +00:00
Alan Cox	b70458aec3	Revert the first part of revision 1.114 and modify the second part. On architectures implementing uma_small_alloc() pages do not necessarily belong to the kmem object.	2005-02-24 06:13:01 +00:00
Poul-Henning Kamp	dfd4be14bd	Try to unbreak the vnode locking around vop_reclaim() (based mostly on patch from kan@). Pull bufobj_invalbuf() out of vinvalbuf() and make g_vfs call it on close. This is not yet a generally safe function, but for this very specific use it is safe. This solves the problem with buffers not being flushed by unmount or after failed mount attempts.	2005-02-19 11:44:57 +00:00
Bosko Milekic	8076cb5289	Well, it seems that I pre-maturely removed the "All rights reserved" statement from some files, so re-add it for the moment, until the related legalese is sorted out. This change affects: sys/kern/kern_mbuf.c sys/vm/memguard.c sys/vm/memguard.h sys/vm/uma.h sys/vm/uma_core.c sys/vm/uma_dbg.c sys/vm/uma_dbg.h sys/vm/uma_int.h	2005-02-16 21:45:59 +00:00
Bosko Milekic	500f29d06e	Make UMA set the overloaded page->object back to kmem_object for UMA_ZONE_REFCNT and UMA_ZONE_MALLOC zones, as the page(s) undoubtedly came from kmem_map for those two. Previously it would set it back to NULL for UMA_ZONE_REFCNT zones and although this was probably not fatal, it added MORE code for no reason.	2005-02-16 20:06:11 +00:00
Bosko Milekic	7fae6a1116	Rather than overloading the page->object field like UMA does, use instead an unused pageq queue reference in the page structure to stash a pointer to the MemGuard FIFO. Using the page->object field caused problems because when vm_map_protect() was called the second time to set VM_PROT_DEFAULT back onto a set of pages in memguard_map, the protection in the VM would be changed but the PMAP code would lazily not restore the PG_RW bit on the underlying pages right away (see pmap_protect()). So when a page fault finally occured and the VM noticed the faulting address corresponds to a page that _does_ have write access now, it would then call into PMAP to set back PG_RW (i386 case being discussed here). However, before it got to do that, an assertion on the object lock not being owned would get triggered, as the object of the faulting page would need to be locked but was overloaded by MemGuard. This is precisely why MemGuard cannot overload page->object. Submitted by: Alan Cox (alc@)	2005-02-15 22:17:07 +00:00
Poul-Henning Kamp	7fbdc92113	sysctl node vm.stats can not be static (for ia64 reasons).	2005-02-11 16:34:14 +00:00
Bosko Milekic	0341256576	Implement support for buffers larger than PAGE_SIZE in MemGuard. Adds a little bit of complexity but performance requirements lacking (this is a debugging allocator after all), it's really not too bad (still only 317 lines). Also add an additional check to help catch really weird 3-threads-involved races: make memguard_free() write to the first page handed back, always, before it does anything else. Note that there is still a problem in VM+PMAP (specifically with vm_map_protect) w.r.t. MemGuard uses it, but this will be fixed shortly and this change stands on its own.	2005-02-10 22:36:05 +00:00
Poul-Henning Kamp	39a79f0c01	Make three SYSCTL_NODEs static	2005-02-10 12:18:36 +00:00
Poul-Henning Kamp	253de0a143	Make npages static and const.	2005-02-10 12:18:17 +00:00
Suleiman Souhlal	81ae703462	Set the scheduling class of the zeroidle thread to PRI_IDLE. Reviewed by: jhb Approved by: grehan (mentor) MFC after: 1 week	2005-02-04 06:18:31 +00:00
Alan Cox	8e99783b25	Update the text of an assertion to reflect changes made in revision 1.148. Submitted by: tegge Eliminate an unnecessary, temporary increment of the backing object's reference count in vm_object_qcollapse(). Reviewed by: tegge	2005-01-30 21:29:47 +00:00
Poul-Henning Kamp	7146d6cb3e	Move the contents of vop_stddestroyvobject() to the new vnode_pager function vnode_destroy_vobject(). Make the new function zero the vp->v_object pointer so we can tell if a call is missing.	2005-01-28 08:56:48 +00:00
Poul-Henning Kamp	8516dd18e1	Don't use VOP_GETVOBJECT, use vp->v_object directly.	2005-01-25 00:40:01 +00:00
Poul-Henning Kamp	d07a6d3f61	Move the body of vop_stdcreatevobject() over to the vnode_pager under the name Sande^H^H^H^H^Hvnode_create_vobject(). Make the new function take a size argument which removes the need for a VOP_STAT() or a very pessimistic guess for disks. Call that new function from vop_stdcreatevobject(). Make vnode_pager_alloc() private now that its only user came home.	2005-01-24 21:21:59 +00:00
Poul-Henning Kamp	35764be39e	Kill the VV_OBJBUF and test the v_object for NULL instead.	2005-01-24 13:13:57 +00:00
Jeff Roberson	ae51ff1127	- Remove GIANT_REQUIRED where giant is no longer required. - Use VFS_LOCK_GIANT() rather than directly acquiring giant in places where giant is only held because vfs requires it. Sponsored By: Isilon Systems, Inc.	2005-01-24 10:48:29 +00:00
Alan Cox	75337a5677	Guard against address wrap in kernacc(). Otherwise, a program accessing a bad address range through /dev/kmem can panic the machine. Submitted by: Mark W. Krentel Reported by: Kris Kennaway MFC after: 1 week	2005-01-22 19:21:29 +00:00
Bosko Milekic	eca64e79b5	s/round_page/trunc_page/g I meant trunc_page. It's only a coincidence this hasn't caused problems yet. Pointed out by: Antoine Brodin <antoine.brodin@laposte.net>	2005-01-22 00:09:34 +00:00
Bosko Milekic	e4eb384b47	Bring in MemGuard, a very simple and small replacement allocator designed to help detect tamper-after-free scenarios, a problem more and more common and likely with multithreaded kernels where race conditions are more prevalent. Currently MemGuard can only take over malloc()/realloc()/free() for particular (a) malloc type(s) and the code brought in with this change manually instruments it to take over M_SUBPROC allocations as an example. If you are planning to use it, for now you must: 1) Put "options DEBUG_MEMGUARD" in your kernel config. 2) Edit src/sys/kern/kern_malloc.c manually, look for "XXX CHANGEME" and replace the M_SUBPROC comparison with the appropriate malloc type (this might require additional but small/simple code modification if, say, the malloc type is declared out of scope). 3) Build and install your kernel. Tune vm.memguard_divisor boot-time tunable which is used to scale how much of kmem_map you want to allott for MemGuard's use. The default is 10, so kmem_size/10. ToDo: 1) Bring in a memguard(9) man page. 2) Better instrumentation (e.g., boot-time) of MemGuard taking over malloc types. 3) Teach UMA about MemGuard to allow MemGuard to override zone allocations too. 4) Improve MemGuard if necessary. This work is partly based on some old patches from Ian Dowse.	2005-01-21 18:09:17 +00:00
Alan Cox	986b43f845	Add checks to vm_map_findspace() to test for address wrap. The conditions where this could occur are very rare, but possible. Submitted by: Mark W. Krentel MFC after: 2 weeks	2005-01-18 19:50:09 +00:00
Alan Cox	d936694f09	Consider three objects, O, BO, and BBO, where BO is O's backing object and BBO is BO's backing object. Now, suppose that O and BO are being collapsed. Furthermore, suppose that BO has been marked dead (OBJ_DEAD) by vm_object_backing_scan() and that either vm_object_backing_scan() has been forced to sleep due to encountering a busy page or vm_object_collapse() has been forced to sleep due to memory allocation in the swap pager. If vm_object_deallocate() is then called on BBO and BO is BBO's only shadow object, vm_object_deallocate() will collapse BO and BBO. In doing so, it adds a necessary temporary reference to BO. If this collapse also sleeps and the prior collapse resumes first, the temporary reference will cause vm_object_collapse to panic with the message "backing_object %p was somehow re-referenced during collapse!" Resolve this race by changing vm_object_deallocate() such that it doesn't collapse BO and BBO if BO is marked dead. Once O and BO are collapsed, vm_object_collapse() will attempt to collapse O and BBO. So, vm_object_deallocate() on BBO need do nothing. Reported by: Peter Holm on 20050107 URL: http://www.holm.cc/stress/log/cons102.html In collaboration with: tegge@ Candidate for RELENG_4 and RELENG_5 MFC after: 2 weeks	2005-01-15 21:12:47 +00:00
Poul-Henning Kamp	7c0745eeae	Eliminate unused and unnecessary "cred" argument from vinvalbuf()	2005-01-14 07:33:51 +00:00
Poul-Henning Kamp	8df6bac4c7	Remove the unused credential argument from VOP_FSYNC() and VFS_SYNC(). I'm not sure why a credential was added to these in the first place, it is not used anywhere and it doesn't make much sense: The credentials for syncing a file (ability to write to the file) should be checked at the system call level. Credentials for syncing one or more filesystems ("none") should be checked at the system call level as well. If the filesystem implementation needs a particular credential to carry out the syncing it would logically have to the cached mount credential, or a credential cached along with any delayed write data. Discussed with: rwatson	2005-01-11 07:36:22 +00:00
Bosko Milekic	c5c1b16ec5	While we want the recursion protection for the bucket zones so that recursion from the VM is handled (and the calling code that allocates buckets knows how to deal with it), we do not want to prevent allocation from the slab header zones (slabzone and slabrefzone) if uk_recurse is not zero for them. The reason is that it could lead to NULL being returned for the slab header allocations even in the M_WAITOK case, and the caller can't handle that (this is also explained in a comment with this commit). The problem analysis is documented in our mailing lists: http://docs.freebsd.org/cgi/getmsg.cgi?fetch=153445+0+archive/2004/freebsd-current/20041231.freebsd-current (see entire thread for proper context). Crash dump data provided by: Peter Holm <peter@holm.cc>	2005-01-11 03:33:09 +00:00
Stefan Farfeleder	1e183df21e	ISO C requires at least one element in an initialiser list.	2005-01-10 20:30:04 +00:00
Alan Cox	5ba514bc89	Move the acquisition and release of the page queues lock outside of a loop in vm_object_split() to avoid repeated acquisition and release.	2005-01-08 23:41:11 +00:00
Alan Cox	46fbc58202	Transfer responsibility for freeing the page taken from the cache queue and (possibly) unlocking the containing object from vm_page_alloc() to vm_page_select_cache(). Recent optimizations to vm_map_pmap_enter() (see vm_map.c revisions 1.362 and 1.363) and pmap_enter_quick() have resulted in panic()s because vm_page_alloc() mistakenly unlocked objects that had not been locked by vm_page_select_cache(). Reported by: Peter Holm and Kris Kennaway	2005-01-07 05:02:19 +00:00
Warner Losh	60727d8b86	/* -> /*- for license, minor formatting changes	2005-01-07 02:29:27 +00:00
Alan Cox	df2e33bf42	Revise the part of vm_pageout_scan() that moves pages from the cache queue to the free queue. With this change, if a page from the cache queue belongs to a locked object, it is simply skipped over rather than moved to the inactive queue.	2005-01-06 20:22:36 +00:00
Poul-Henning Kamp	4f8205e5d1	When allocating bio's in the swap_pager use M_WAITOK since the alternative is much worse.	2005-01-03 13:28:56 +00:00
Alan Cox	0869d38ba6	Assert that page allocations during an interrupt specify VM_ALLOC_INTERRUPT. Assert that pages removed from the cache queue are not busy.	2004-12-31 19:50:45 +00:00
Alan Cox	7aa2190c8e	Access to the page's busy field is (now) synchronized by the containing object's lock. Therefore, the assertion that the page queues lock is held can be removed from vm_page_io_start().	2004-12-29 04:18:22 +00:00
Alan Cox	91f7a86064	Note that access to the page's busy count is synchronized by the containing object's lock.	2004-12-27 05:27:59 +00:00
Alan Cox	40198b3c04	Assert that the vm object is locked on entry to vm_page_sleep_if_busy(); remove some unneeded code.	2004-12-26 21:46:44 +00:00
Bosko Milekic	7b8712053c	Add my copyright and update Jeff's copyright on UMA source files, as per his request. Discussed with: Jeffrey Roberson	2004-12-26 00:35:12 +00:00
Poul-Henning Kamp	475e8cc394	fix comment	2004-12-25 21:30:41 +00:00
Alan Cox	a51b084059	Continue the transition from synchronizing access to the page's PG_BUSY flag and busy field with the global page queues lock to synchronizing their access with the containing object's lock. Specifically, acquire the containing object's lock before reading the page's PG_BUSY flag and busy field in vm_fault(). Reviewed by: tegge@	2004-12-24 19:31:54 +00:00
Alan Cox	1f70d62298	Modify pmap_enter_quick() so that it expects the page queues to be locked on entry and it assumes the responsibility for releasing the page queues lock if it must sleep. Remove a bogus comment from pmap_enter_quick(). Using the first change, modify vm_map_pmap_enter() so that the page queues lock is acquired and released once, rather than each time that a page is mapped.	2004-12-23 20:16:11 +00:00
Alan Cox	98fe9a0ddf	Eliminate another unnecessary call to vm_page_busy(). (See revision 1.333 for a detailed explanation.)	2004-12-17 18:54:51 +00:00
Alan Cox	06c98c5dcc	Enable debug.mpsafevm by default on alpha.	2004-12-17 17:17:36 +00:00
Alan Cox	85f5b24573	In the common case, pmap_enter_quick() completes without sleeping. In such cases, the busying of the page and the unlocking of the containing object by vm_map_pmap_enter() and vm_fault_prefault() is unnecessary overhead. To eliminate this overhead, this change modifies pmap_enter_quick() so that it expects the object to be locked on entry and it assumes the responsibility for busying the page and unlocking the object if it must sleep. Note: alpha, amd64, i386 and ia64 are the only implementations optimized by this change; arm, powerpc, and sparc64 still conservatively busy the page and unlock the object within every pmap_enter_quick() call. Additionally, this change is the first case where we synchronize access to the page's PG_BUSY flag and busy field using the containing object's lock rather than the global page queues lock. (Modifications to the page's PG_BUSY flag and busy field have asserted both locks for several weeks, enabling an incremental transition.)	2004-12-15 19:55:05 +00:00
Alan Cox	90688d137c	With the removal of kern/uipc_jumbo.c and sys/jumbo.h, vm_object_allocate_wait() is not used. Remove it.	2004-12-08 05:01:47 +00:00
Alan Cox	2ad036b657	Almost nine years ago, when support for 1TB files was introduced in revision 1.55, the address parameter to vnode_pager_addr() was changed from an unsigned 32-bit quantity to a signed 64-bit quantity. However, an out-of-range check on the address was not updated. Consequently, memory-mapped I/O on files greater than 2GB could cause a kernel panic. Since the address is now a signed 64-bit quantity, the problem resolution is simply to remove a cast. Reviewed by: bde@ and tegge@ PR: 73010 MFC after: 1 week	2004-12-07 22:05:38 +00:00
Alan Cox	d8fed1d050	Correct a sanity check in vnode_pager_generic_putpages(). The cast used to implement the sanity check should have been changed when we converted the implementation of vm_pindex_t from 32 to 64 bits. (Thus, RELENG_4 is not affected.) The consequence of this error would be a legimate write to an extremely large file being treated as an errant attempt to write meta- data. Discussed with: tegge@	2004-12-05 21:48:11 +00:00
David Schultz	6004362e66	Don't include sys/user.h merely for its side-effect of recursively including other headers.	2004-11-27 06:51:39 +00:00
Olivier Houchard	6fc96493ac	Remove useless casts.	2004-11-26 15:04:26 +00:00
Xin LI	8e33bced3c	Try to close a potential, but serious race in our VM subsystem. Historically, our contigmalloc1() and contigmalloc2() assumes that a page in PQ_CACHE can be unconditionally reused by busying and freeing it. Unfortunatelly, when object happens to be not NULL, the code will set m->object to NULL and disregard the fact that the page is actually in the VM page bucket, resulting in page bucket hash table corruption and finally, a filesystem corruption, or a 'page not in hash' panic. This commit has borrowed the idea taken from DragonFlyBSD's fix to the VM fix by Matthew Dillon[1]. This version of patch will do the following checks: - When scanning pages in PQ_CACHE, check hold_count and skip over pages that are held temporarily. - For pages in PQ_CACHE and selected as candidate of being freed, check if it is busy at that time. Note: It seems that this is might be unrelated to kern/72539. Obtained from: DragonFlyBSD, sys/vm/vm_contig.c,v 1.11 and 1.12 [1] Reminded by: Matt Dillon Reworked by: alc MFC After: 1 week	2004-11-24 18:56:13 +00:00
David Schultz	9799b417d5	Disable U area swapping and remove the routines that create, destroy, copy, and swap U areas. Reviewed by: arch@	2004-11-20 02:29:00 +00:00
Poul-Henning Kamp	9c83534dd8	Make VOP_BMAP return a struct bufobj for the underlying storage device instead of a vnode for it. The vnode_pager does not and should not have any interest in what the filesystem uses for backend. (vfs_cluster doesn't use the backing store argument.)	2004-11-15 09:18:27 +00:00
Poul-Henning Kamp	5c6e573ffb	Add pbgetbo()/pbrelbo() lighter weight versions of pbgetvp()/pbrelvp().	2004-11-15 08:47:18 +00:00
Poul-Henning Kamp	287013d287	More kasserts.	2004-11-15 08:33:09 +00:00
Poul-Henning Kamp	d7fe1f51ad	style polishing.	2004-11-15 08:22:38 +00:00
Poul-Henning Kamp	a752aa8f17	Move pbgetvp() and pbrelvp() to vm_pager.c with the rest of the pbuf stuff.	2004-11-15 08:12:50 +00:00
Poul-Henning Kamp	e8a7bef39e	expect the caller to have called pbrelvp() if necessary.	2004-11-15 08:07:26 +00:00
Poul-Henning Kamp	676f3ee26c	Explicitly call pbrelvp()	2004-11-15 08:06:05 +00:00
Poul-Henning Kamp	d20b2f76cc	Improve readability with a bunch of typedefs for the pager ops. These can also be used for prototypes in the pagers.	2004-11-09 13:43:20 +00:00
Dag-Erling Smørgrav	7419d1e25f	#include <vm/vm_param.h> instead of <machine/vmparam.h> (the former includes the latter, but also declares variables which are defined in kern/subr_param.c). Change som VM parameters from quad_t to unsigned long. They refer to quantities (size limits for text, heap and stack segments) which must necessarily be smaller than the size of the address space, so long is adequate on all platforms. MFC after: 1 week	2004-11-08 18:20:02 +00:00
Alan Cox	dad740e967	Eliminate an unnecessary atomic operation. Articulate the rationale in a comment.	2004-11-06 21:48:45 +00:00
Robert Watson	dc2c7965c0	Abstract the logic to look up the uma_bucket_zone given a desired number of entries into bucket_zone_lookup(), which helps make more clear the logic of consumers of bucket zones. Annotate the behavior of bucket_init() with a comment indicating how the various data structures, including the bucket lookup tables, are initialized.	2004-11-06 11:43:30 +00:00
Poul-Henning Kamp	a7f06e2bd4	Remove dangling variable	2004-11-06 11:33:11 +00:00
Robert Watson	f9d27e7524	Annotate what bucket_size[] array does; staticize since it's used only in uma_core.c.	2004-11-06 11:24:40 +00:00
David Schultz	8bc61209d4	Fix the last known race in swapoff(), which could lead to a spurious panic: swapoff: failed to locate %d swap blocks The race occurred because putpages() can block between the time it allocates swap space and the time it updates the swap metadata to associate that space with a vm_object, so swapoff() would complain about the temporary inconsistency. I hoped to fix this by making swp_pager_getswapspace() and swp_pager_meta_build() a single atomic operation, but that proved to be inconvenient. With this change, swapoff() simply doesn't attempt to be so clever about detecting when all the pageout activity to the target device should have drained.	2004-11-06 07:17:50 +00:00
Alan Cox	19187819b7	Move a call to wakeup() from vm_object_terminate() to vnode_pager_dealloc() because this call is only needed to wake threads that slept when they discovered a dead object connected to a vnode. To eliminate unnecessary calls to wakeup() by vnode_pager_dealloc(), introduce a new flag, OBJ_DISCONNECTWNT. Reviewed by: tegge@	2004-11-06 05:33:02 +00:00
John Baldwin	57ea1265bd	- Set the priority of the page zeroing thread using sched_prio() when the thread is created rather than adjusting the priority in the main function. (kthread_create() should probably take the initial priority as an argument.) - Only yield the CPU in the !PREEMPTION case if there are any other runnable threads. Yielding when there isn't anything else better to do just wastes time in pointless context switches (albeit while the system is idle.)	2004-11-05 19:14:02 +00:00
Alan Cox	34d9e6fdae	During traversal of the inactive queue, try locking the page's containing object before accessing the page's flags or the object's reference count.	2004-11-05 06:24:05 +00:00
Alan Cox	b546ac5490	Eliminate another unnecessary call to vm_page_busy() that immediately precedes a call to vm_page_rename(). (See the previous revision for a detailed explanation.)	2004-11-05 05:40:45 +00:00
David Schultz	b3fed13e9d	Close a race in swapoff(). Here are the gory details: In order to avoid livelock, swapoff() skips over objects with a nonzero pip count and makes another pass if necessary. Since it is impossible to know which objects we care about, it would choose an arbitrary object with a nonzero pip count and wait for it before making another pass, the theory being that this object would finish paging about as quickly as the ones we care about. Unfortunately, we may have slept since we acquired a reference to this object. Hack around this problem by tsleep()ing on the pointer anyway, but timeout after a fixed interval. More elegant solutions are possible, but the ones I considered unnecessarily complicate this rare case. Also, kill some nits that seem to have crept into the swapoff() code in the last 75 revisions or so: - Don't pass both sp and sp->sw_used to swap_pager_swapoff(), since the latter can be derived from the former. - Replace swp_pager_find_dev() with something simpler. There's no need to iterate over the entire list of swap devices just to determine if a given block is assigned to the one we're interested in. - Expand the scope of the swhash_mtx in a couple of places so that it isn't released and reacquired once for every hash bucket. - Don't drop the swhash_mtx while holding a reference to an object. We need to lock the object first. Unfortunately, doing so would violate the established lock order, so use VM_OBJECT_TRYLOCK() and try again on a subsequent pass if the object is already locked. - Refactor swp_pager_force_pagein() and swap_pager_swapoff() a bit.	2004-11-05 05:36:56 +00:00
Poul-Henning Kamp	6e67e2a710	Retire b_magic now, we have the bufobj containing the same hint.	2004-11-04 09:48:18 +00:00
Poul-Henning Kamp	c5d3d25e4f	De-couple our I/O bio request from the embedded bio in buf by explicitly copying the fields.	2004-11-04 08:38:07 +00:00
Poul-Henning Kamp	c569065139	Remove buf->b_dev field.	2004-11-04 07:59:57 +00:00
Alan Cox	d19ef81437	The synchronization provided by vm object locking has eliminated the need for most calls to vm_page_busy(). Specifically, most calls to vm_page_busy() occur immediately prior to a call to vm_page_remove(). In such cases, the containing vm object is locked across both calls. Consequently, the setting of the vm page's PG_BUSY flag is not even visible to other threads that are following the synchronization protocol. This change (1) eliminates the calls to vm_page_busy() that immediately precede a call to vm_page_remove() or functions, such as vm_page_free() and vm_page_rename(), that call it and (2) relaxes the requirement in vm_page_remove() that the vm page's PG_BUSY flag is set. Now, the vm page's PG_BUSY flag is set only when the vm object lock is released while the vm page is still in transition. Typically, this is when it is undergoing I/O.	2004-11-03 20:17:31 +00:00
Alan Cox	f790de29b9	Introduce a Boolean variable wakeup_needed to avoid repeated, unnecessary calls to wakeup() by vm_page_zero_idle_wakeup().	2004-10-31 19:32:57 +00:00
Alan Cox	b86e6ec007	During traversal of the active queue by vm_pageout_page_stats(), try locking the page's containing object before accessing the page's flags.	2004-10-30 23:30:53 +00:00
Alan Cox	af117d1278	Eliminate an unused but initialized variable.	2004-10-30 20:11:23 +00:00
Alan Cox	4b8a5c4095	Add an assignment statement that I omitted from the previous revision.	2004-10-30 07:09:46 +00:00
Alan Cox	f4d49654ae	Assert that the containing vm object is locked in vm_page_cache() and vm_page_try_to_cache().	2004-10-28 05:26:21 +00:00
Bosko Milekic	a5a262c6db	Fix a INVARIANTS-only bug introduced in Revision 1.104: IF INVARIANTS is defined, and in the rare case that we have allocated some objects from the slab and at least one initializer on at least one of those objects failed, and we need to fail the allocation and push the uninitialized items back into the slab caches -- in that scenario, we would fail to [re]set the bucket cache's ub_bucket item references to NULL, which would eventually trigger a KASSERT.	2004-10-27 21:19:35 +00:00
Alan Cox	b08abf6cc0	During traversal of the active queue, try locking the page's containing object before accessing the page's flags or the object's reference count. If the trylock fails, handle the page as though it is busy.	2004-10-27 18:29:17 +00:00
Poul-Henning Kamp	6229cc508b	Also check that the sectormask is bigger than zero. Wrap this overly long KASSERT and remove newline.	2004-10-26 19:51:57 +00:00
Poul-Henning Kamp	5d9d81e7ea	Put the I/O block size in bufobj->bo_bsize. We keep si_bsize_phys around for now as that is the simplest way to pull the number out of disk device drivers in devfs_open(). The correct solution would be to do an ioctl(DIOCGSECTORSIZE), but the point is probably mooth when filesystems sit on GEOM, so don't bother for now.	2004-10-26 07:39:12 +00:00
Poul-Henning Kamp	ff4782b57d	Don't clear flags we just checked were not set.	2004-10-26 05:57:29 +00:00
Alan Cox	63bb7041cc	Assert that the containing vm object is locked in vm_page_flash().	2004-10-25 19:52:44 +00:00
Alan Cox	75d0533847	Assert that the containing vm object is locked in vm_page_busy() and vm_page_wakeup().	2004-10-24 23:53:47 +00:00
Poul-Henning Kamp	b792bebeea	Move the buffer method vector (buf->b_op) to the bufobj. Extend it with a strategy method. Add bufstrategy() which do the usual VOP_SPECSTRATEGY/VOP_STRATEGY song and dance. Rename ibwrite to bufwrite(). Move the two NFS buf_ops to more sensible places, add bufstrategy to them. Add inlines for bwrite() and bstrategy() which calls through buf->b_bufobj->b_ops->b_{write,strategy}(). Replace almost all VOP_STRATEGY()/VOP_SPECSTRATEGY() calls with bstrategy().	2004-10-24 20:03:41 +00:00
Alan Cox	e5526e6aa5	Acquire the vm object lock before rather than after calling vm_page_sleep_if_busy(). (The motivation being to transition synchronization of the vm_page's PG_BUSY flag from the global page queues lock to the per-object lock.)	2004-10-24 19:32:19 +00:00
Alan Cox	ddf4bb37c8	Use VM_ALLOC_NOBUSY instead of calling vm_page_wakeup().	2004-10-24 18:46:32 +00:00
Alan Cox	0f9f9bcb53	Introduce VM_ALLOC_NOBUSY, an option to vm_page_alloc() and vm_page_grab() that indicates that the caller does not want a page with its busy flag set. In many places, the global page queues lock is acquired and released just to clear the busy flag on a just allocated page. Both the allocation of the page and the clearing of the busy flag occur while the containing vm object is locked. So, the busy flag might as well never be set.	2004-10-24 06:15:36 +00:00
Poul-Henning Kamp	494eb176e7	Add b_bufobj to struct buf which eventually will eliminate the need for b_vp. Initialize b_bufobj for all buffers. Make incore() and gbincore() take a bufobj instead of a vnode. Make inmem() local to vfs_bio.c Change a lot of VI_[UN]LOCK(bp->b_vp) to BO_[UN]LOCK(bp->b_bufobj) also VI_MTX() to BO_MTX(), Make buf_vlist_add() take a bufobj instead of a vnode. Eliminate other uses of bp->b_vp where bp->b_bufobj will do. Various minor polishing: remove "register", turn panic into KASSERT, use new function declarations, TAILQ_FOREACH_SAFE() etc.	2004-10-22 08:47:20 +00:00
Poul-Henning Kamp	a76d8f4ec9	Move the VI_BWAIT flag into no bo_flag element of bufobj and call it BO_WWAIT Add bufobj_wref(), bufobj_wdrop() and bufobj_wwait() to handle the write count on a bufobj. Bufobj_wdrop() replaces vwakeup(). Use these functions all relevant places except in ffs_softdep.c where the use if interlocked_sleep() makes this impossible. Rename b_vnbufs to b_bobufs now that we touch all the relevant files anyway.	2004-10-21 15:53:54 +00:00
Alan Cox	1e96d2a217	Correct two errors in PG_BUSY management by vm_page_cowfault(). Both errors are in rarely executed paths. 1. Each time the retry_alloc path is taken, the PG_BUSY must be set again. Otherwise vm_page_remove() panics. 2. There is no need to set PG_BUSY on the newly allocated page before freeing it. The page already has PG_BUSY set by vm_page_alloc(). Setting it again could cause an assertion failure. MFC after: 2 weeks	2004-10-18 08:11:59 +00:00
Alan Cox	36aeb90e34	Assert that the containing object is locked in vm_page_io_start() and vm_page_io_finish(). The motivation being to transition synchronization of the vm_page's busy field from the global page queues lock to the per-object lock.	2004-10-17 22:33:40 +00:00
Alan Cox	950d5f7a99	Remove unnecessary check for curthread == NULL.	2004-10-17 20:29:28 +00:00
Peter Wemm	a7bc3102c4	Put on my peril sensitive sunglasses and add a flags field to the internal sysctl routines and state. Add some code to use it for signalling the need to downconvert a data structure to 32 bits on a 64 bit OS when requested by a 32 bit app. I tried to do this in a generic abi wrapper that intercepted the sysctl oid's, or looked up the format string etc, but it was a real can of worms that turned into a fragile mess before I even got it partially working. With this, we can now run 'sysctl -a' on a 32 bit sysctl binary and have it not abort. Things like netstat, ps, etc have a long way to go. This also fixes a bug in the kern.ps_strings and kern.usrstack hacks. These do matter very much because they are used by libc_r and other things.	2004-10-11 22:04:16 +00:00
Brian Feldman	55fc8c1146	In the previous revision, I did not intend to change the default value of "nosleepwithlocks." Submitted by: ru	2004-10-09 18:51:32 +00:00
Brian Feldman	ab14a3f7aa	Fix critical stability problems that can cause UMA mbuf cluster state management corruption, mbuf leaks, general mbuf corruption, and at least on i386 a first level splash damage radius that encompasses up to about half a megabyte of the memory after an mbuf cluster's allocation slab. In short, this has caused instability nightmares anywhere the right kind of network traffic is present. When the polymorphic refcount slabs were added to UMA, the new types were not used pervasively. In particular, the slab management structure was turned into one for refcounts, and one for non-refcounts (supposed to be mostly like the old slab management structure), but the latter was almost always used through out. In general, every access to zones with UMA_ZONE_REFCNT turned on corrupted the "next free" slab offset offset and the refcount with each other and with other allocations (on i386, 2 mbuf clusters per 4096 byte slab). Fix things so that the right type is used to access refcounted zones where it was not before. There are additional errors in gross overestimation of padding, it seems, that would cause a large kegs (nee zones) to be allocated when small ones would do. Unless I have analyzed this incorrectly, it is not directly harmful.	2004-10-08 20:19:29 +00:00
David Schultz	f6bcadc4fc	Don't look for swap blocks in objects that aren't swap-backed. I expect that this will fix the following panic, reported by Jun: swap_pager_isswapped: failed to locate all swap meta blocks MT5 candidate	2004-09-24 16:04:20 +00:00
Poul-Henning Kamp	891822a853	XXX mark two places where we do not hold a threadcount on the dev when frobbing the cdevsw. In both cases we examine only the cdevsw and it is a good question if we weren't better off copying those properties into the cdev in the first place. This question will be revisited.	2004-09-24 08:32:36 +00:00
Poul-Henning Kamp	751fdd08fe	Use dev_re[fl]thread() to maintain a ref on the device driver while we call the ->d_mmap function.	2004-09-24 05:59:11 +00:00
David Schultz	8daa8c602a	The zone from which proc structures are allocated is marked UMA_ZONE_NOFREE to guarantee type stability, so proc_fini() should never be called. Move an assertion from proc_fini() to proc_dtor() and garbage-collect the rest of the unreachable code. I have retained vm_proc_dispose(), since I consider its disuse a bug.	2004-09-19 18:34:17 +00:00
Poul-Henning Kamp	7ce1979be6	Add new a function isa_dma_init() which returns an errno when it fails and which takes a M_WAITOK/M_NOWAIT flag argument. Add compatibility isa_dmainit() macro which whines loudly if isa_dma_init() fails. Problem uncovered by: tegge	2004-09-15 12:09:50 +00:00
Alan Cox	5e4bdb57cb	System maps are prohibited from mapping vnode-backed objects. Take advantage of this restriction to avoid acquiring and releasing Giant when wiring pages within a system map. In collaboration with: tegge@	2004-09-11 18:49:59 +00:00
Poul-Henning Kamp	1a31a6c3b2	add KASSERTS	2004-09-07 07:32:40 +00:00
Alan Cox	b102e653ad	Enable debug.mpsafevm by default on amd64 and i386. This enables copy-on- write and zero-fill faults to run without holding Giant. It is still possible to disable Giant-free operation by setting debug.mpsafevm to 0 in loader.conf.	2004-09-04 05:51:54 +00:00
Alan Cox	94ddc7076d	Push Giant deep into vm_forkproc(), acquiring it only if the process has mapped System V shared memory segments (see shmfork_myhook()) or requires the allocation of an ldt (see vm_fault_wire()).	2004-09-03 05:11:32 +00:00
Scott Long	9923b511ed	Turn PREEMPTION into a kernel option. Make sure that it's defined if FULL_PREEMPTION is defined. Add a runtime warning to ULE if PREEMPTION is enabled (code inspired by the PREEMPTION warning in kern_switch.c). This is a possible MT5 candidate.	2004-09-02 18:59:15 +00:00
Alan Cox	c2296e999c	Remove dead code.	2004-09-01 19:58:37 +00:00
Alan Cox	1a95d74419	In vm_fault_unwire() eliminate the acquisition and release of Giant in the case of non-kernel pmaps.	2004-09-01 19:18:59 +00:00
Julian Elischer	2630e4c90c	Give setrunqueue() and sched_add() more of a clue as to where they are coming from and what is expected from them. MFC after: 2 days	2004-09-01 02:11:28 +00:00
Alan Cox	9b98b79683	Move the acquisition and release of the lock on the object at the head of the shadow chain outside of the loop in vm_object_madvise(), reducing the number of times that this lock is acquired and released.	2004-08-29 20:14:10 +00:00
Ian Dowse	632ca524c3	Prevent vm_page_zero_idle_wakeup() from attempting to wake up the page zeroing thread before it has been created. It was possible for calls to free() very early in the boot process to panic here because the sleep queues were not yet initialised. Specifically, sysinit_add() running at SI_SUB_KLD would trigger this if the array of pointers became big enough to require uma_large_alloc() allocations. Submitted by: peter	2004-08-29 01:02:33 +00:00
Marcel Moolenaar	2bbb534b56	Move the cow field between wire_count and hold_count. This is the position that is 64-bit aligned and makes sure that the valid and dirty fields are also 64-bit aligned. This means that if PAGE_SIZE is 32K, the size of the vm_page structure is only increased by 8 bytes instead of 16 bytes. More importantly, the vm_page structure is either 120 or 128 bytes on ia64. These are "interesting" sizes.	2004-08-22 20:52:23 +00:00
Alan Cox	3268a1bf75	In the previous revision, I failed to condition an early release of Giant in vm_fault() on debug_mpsafevm. If debug_mpsafevm was not set, the result was an assertion failure early in the boot process. Reported by: green@	2004-08-22 00:08:43 +00:00
Alan Cox	b99e61353f	Further reduce the use of Giant by vm_fault(): Giant is held only when manipulating a vnode, e.g., calling vput(). This reduces contention for Giant during many copy-on-write faults, resulting in some additional speedup on SMPs. Note: debug_mpsafevm must be enabled for this optimization to take effect.	2004-08-21 19:20:21 +00:00
Alan Cox	0cb507cb20	Acquire and release Giant around a call to VOP_BMAP(). (This is a prerequisite to any further reduction in Giant's use by vm_fault().)	2004-08-19 02:37:12 +00:00
Alan Cox	c1fbc251cd	- Introduce and use a new tunable "debug.mpsafevm". At present, setting "debug.mpsafevm" results in (almost) Giant-free execution of zero-fill page faults. (Giant is held only briefly, just long enough to determine if there is a vnode backing the faulting address.) Also, condition the acquisition and release of Giant around calls to pmap_remove() on "debug.mpsafevm". The effect on performance is significant. On my dual Opteron, I see a 3.6% reduction in "buildworld" time. - Use atomic operations to update several counters in vm_fault().	2004-08-16 06:16:12 +00:00
Brian Feldman	7c938963a4	Rather than bringing back all of the changes to make VM map deletion wait for system wires to disappear, do so (much more trivially) by instead only checking for system wires of user maps and not kernel maps. Alternative by: tor Reviewed by: alc	2004-08-16 03:11:09 +00:00
Alan Cox	6eaee3fee4	Remove spl calls.	2004-08-14 18:57:41 +00:00
Alan Cox	0164e05781	Replace the linear search in vm_map_findspace() with an O(log n) algorithm built into the map entry splay tree. This replaces the first_free hint in struct vm_map with two fields in vm_map_entry: adj_free, the amount of free space following a map entry, and max_free, the maximum amount of free space in the entry's subtree. These fields make it possible to find a first-fit free region of a given size in one pass down the tree, so O(log n) amortized using splay trees. This significantly reduces the overhead in vm_map_findspace() for applications that mmap() many hundreds or thousands of regions, and has a negligible slowdown (0.1%) on buildworld. See, for example, the discussion of a micro-benchmark titled "Some mmap observations compared to Linux 2.6/OpenBSD" on -hackers in late October 2003. OpenBSD adopted this approach in March 2002, and NetBSD added it in November 2003, both with Red-Black trees. Submitted by: Mark W. Krentel	2004-08-13 08:06:34 +00:00
Tor Egge	19dc560756	The vm map lock is needed in vm_fault() after the page has been found, to avoid later changes before pmap_enter() and vm_fault_prefault() has completed. Simplify deadlock avoidance by not blocking on vm map relookup. In collaboration with: alc	2004-08-12 20:14:49 +00:00
Brian Feldman	c5f60ffccf	Re-delete the comment from r1.352.	2004-08-12 17:22:28 +00:00
Brian Feldman	0ada205ee6	Back out all behavioral chnages.	2004-08-10 14:42:48 +00:00
Brian Feldman	9689d5e5ee	Revamp VM map wiring. * Allow no-fault wiring/unwiring to succeed for consistency; however, the wired count remains at zero, so it's a special case. * Fix issues inside vm_map_wire() and vm_map_unwire() where the exact state of user wiring (one or zero) and system wiring (zero or more) could be confused; for example, system unwiring could succeed in removing a user wire, instead of being an error. * Require all mappings to be unwired before they are deleted. When VM space is still wired upon deletion, it will be waited upon for the following unwire. This makes vslock(9) work rather than allowing kernel-locked memory to be deleted out from underneath of its consumer as it would before.	2004-08-09 19:52:29 +00:00
Alan Cox	8673599662	Make two changes to vm_fault(). 1. Move a comment to its proper place, updating it. (Except for white- space, this comment had been unchanged since revision 1.1!) 2. Remove spl calls.	2004-08-09 18:46:39 +00:00
Alan Cox	91491c3566	Remove a stale comment from vm_map_lookup() that pertains to share maps. (The last vestiges of the share map code were removed in revisions 1.153 and 1.159.)	2004-08-09 18:15:46 +00:00
Alan Cox	eebf3286a6	Make two changes to vm_fault(). 1. Retain the map lock until after the calls to pmap_enter() and vm_fault_prefault(). 2. Remove a stale comment. Submitted by: tegge@	2004-08-09 06:01:46 +00:00
Poul-Henning Kamp	5721c9c76a	Tag all geom classes in the tree with a version number.	2004-08-08 07:57:53 +00:00
Alan Cox	e0b47a134b	Remove dead code. A vm_map's first_free is never NULL (even if the map is full). (This is preparation for an O(log n) implementation of vm_map_findspace().) Submitted by: Mark W. Krentel	2004-08-07 05:58:31 +00:00
Robert Watson	3659f747f1	Generate KTR trace records for uma_zalloc_arg() and uma_zfree_arg(). This doesn't trace every event of interest in UMA, but provides enough basic information to explain lock traces and sleep patterns.	2004-08-06 21:52:38 +00:00
Brian Feldman	28775a6130	Turn on the new contigmalloc(9) by default. There should not actually be a reason to use the old contigmalloc(9), but if desired, it the vm.old_contigmalloc setting can be tuned/sysctld back to 0 for now.	2004-08-05 21:54:11 +00:00
Poul-Henning Kamp	23fc1a90ea	Remove a product specific workaround for wrong modes when mmap(2)'ing devices. They have had plenty of time to adjust now.	2004-08-05 07:04:33 +00:00
Alan Cox	684a62b7bf	- Push down the acquisition and release of Giant into pmap_enter_quick() on those architectures without pmap locking. - Eliminate the acquisition and release of Giant in vm_map_pmap_enter().	2004-08-04 22:03:16 +00:00
Doug Rabson	c413d99c4e	In dev_pager_updatefake, m->valid is typically 0 on entry. It should be set to VM_PAGE_BITS_ALL before returning, to ensure that neither vm_pager_get_pages nor vm_fault calls vm_page_zero_invalid after dev_pager_getpages has returned. Submitted by: tegge	2004-08-04 08:58:58 +00:00
Alan Cox	21c125453c	Eliminate the acquisition and release of Giant around the call to pmap_mincore() in mincore(2). Either pmap locking exists (alpha, amd64, i386, ia64) or pmap_mincore() is unimplemented (arm, powerpc, sparc64).	2004-08-02 03:31:05 +00:00
Brian Feldman	b23f72e98a	* Add a "how" argument to uma_zone constructors and initialization functions so that they know whether the allocation is supposed to be able to sleep or not. * Allow uma_zone constructors and initialation functions to return either success or error. Almost all of the ones in the tree currently return success unconditionally, but mbuf is a notable exception: the packet zone constructor wants to be able to fail if it cannot suballocate an mbuf cluster, and the mbuf allocators want to be able to fail in general in a MAC kernel if the MAC mbuf initializer fails. This fixes the panics people are seeing when they run out of memory for mbuf clusters. * Allow debug.nosleepwithlocks on WITNESS to be disabled, without changing the default. Both bmilekic and jeff have reviewed the changes made to make failable zone allocations work.	2004-08-02 00:18:36 +00:00
Alan Cox	9bb0e06861	- Push down the acquisition and release of Giant into pmap_protect() on those architectures without pmap locking. - Eliminate the acquisition and release of Giant from vm_map_protect(). (Translation: mprotect(2) runs to completion without touching Giant on alpha, amd64, i386 and ia64.)	2004-07-30 20:38:30 +00:00
Alan Cox	9be60284a6	Giant is no longer required by vm_waitproc() and vmspace_exitfree(). Eliminate it acquisition and release around vm_waitproc() in kern_wait().	2004-07-30 20:31:02 +00:00
Doug Rabson	92bab635d3	Fix a memory leak in the device pager which is exposed by the NVIDIA OpenGL driver. Submitted by: nvidia (possibly also tegge)	2004-07-30 11:09:18 +00:00
Doug Rabson	874f013517	Fix handling of msync(2) for character special files. Submitted by: nvidia	2004-07-30 11:08:02 +00:00
Maxime Henrion	12c649749c	Get rid of another lockmgr(9) consumer by using sx locks for the user maps. We always acquire the sx lock exclusively here, but we can't use a mutex because we want to be able to sleep while holding the lock. This is completely equivalent to what we were doing with the lockmgr(9) locks before. Approved by: alc	2004-07-30 09:10:28 +00:00
Alan Cox	a087914310	Advance the state of pmap locking on alpha, amd64, and i386. - Enable recursion on the page queues lock. This allows calls to vm_page_alloc(VM_ALLOC_NORMAL) and UMA's obj_alloc() with the page queues lock held. Such calls are made to allocate page table pages and pv entries. - The previous change enables a partial reversion of vm/vm_page.c revision 1.216, i.e., the call to vm_page_alloc() by vm_page_cowfault() now specifies VM_ALLOC_NORMAL rather than VM_ALLOC_INTERRUPT. - Add partial locking to pmap_copy(). (As a side-effect, pmap_copy() should now be faster on i386 SMP because it no longer generates IPIs for TLB shootdown on the other processors.) - Complete the locking of pmap_enter() and pmap_enter_quick(). (As of now, all changes to a user-level pmap on alpha, amd64, and i386 are performed with appropriate locking.)	2004-07-29 18:56:31 +00:00
Bosko Milekic	244f45548a	Rework the way slab header storage space is calculated in UMA. - zone_large_init() stays pretty much the same. - zone_small_init() will try to stash the slab header in the slab page being allocated if the amount of calculated wasted space is less than UMA_MAX_WASTE (for both the UMA_ZONE_REFCNT case and regular case). If the amount of wasted space is >= UMA_MAX_WASTE, then UMA_ZONE_OFFPAGE will be set and the slab header will be allocated separately for better use of space. - uma_startup() calculates the maximum ipers required in offpage slabs (so that the offpage slab header zone(s) can be sized accordingly). The algorithm used to calculate this replaces the old calculation (which only happened to work coincidentally). We now iterate over possible object sizes, starting from the smallest one, until we determine that wastedspace calculated in zone_small_init() might end up being greater than UMA_MAX_WASTE, at which point we use the found object size to compute the maximum possible ipers. The reason this works is because: - wastedspace versus objectsize is a see-saw function with local minima all equal to zero and local maxima growing directly proportioned to objectsize. This implies that for objects up to or equal a certain objectsize, the see-saw remains entirely below UMA_MAX_WASTE, so for those objectsizes it is impossible to ever go OFFPAGE for slab headers. - ipers (items-per-slab) versus objectsize is an inversely proportional function which falls off very quickly (very large for small objectsizes). - To determine the maximum ipers we'll ever need from OFFPAGE slab headers we first find the largest objectsize for which we are guaranteed to not go offpage for and use it to compute ipers (as though we were offpage). Since the only objectsizes allowed to go offpage are bigger than the found objectsize, and since ipers vs objectsize is inversely proportional (and monotonically decreasing), then we are guaranteed that the ipers computed is always >= what we will ever need in offpage slab headers. - Define UMA_FRITM_SZ and UMA_FRITMREF_SZ to be the actual (possibly padded) size of each freelist index so that offset calculations are fixed. This might fix weird data corruption problems and certainly allows ARM to now boot to at least single-user (via simulator). Tested on i386 UP by me. Tested on sparc64 SMP by fenner. Tested on ARM simulator to single-user by cognet.	2004-07-29 15:25:40 +00:00
Alan Cox	56e0670fdc	Correct a very old error in both vm_object_madvise() (originating in vm/vm_object.c revision 1.88) and vm_object_sync() (originating in vm/vm_map.c revision 1.36): When descending a chain of backing objects, both use the wrong object's backing offset. Consequently, both may operate on the wrong pages. Quoting Matt, "This could be responsible for all of the sporatic madvise oddness that has been reported over the years." Reviewed by: Matt Dillon	2004-07-28 18:23:08 +00:00
Alan Cox	1a276a3f91	- Use atomic ops for updating the vmspace's refcnt and exitingcnt. - Push down Giant into shmexit(). (Giant is acquired only if the vmspace contains shm segments.) - Eliminate the acquisition of Giant from proc_rwmem(). - Reduce the scope of Giant in exit1(), uncovering the destruction of the address space.	2004-07-27 03:53:41 +00:00
Alan Cox	5122b74809	For years, kmem_alloc_pageable() has been misused. Now that the last of these misuses has been corrected, remove it before new ones appear, such as arm/arm/pmap.c revision 1.8.	2004-07-25 20:08:59 +00:00
Alan Cox	9b45f81502	Remove spl calls.	2004-07-25 19:28:10 +00:00
Alan Cox	57a21aba93	Make the code and comments for vm_object_coalesce() consistent.	2004-07-25 07:48:47 +00:00
Alan Cox	51ab6c2890	Simplify vmspace initialization. The bcopy() of fields from the old vmspace to the new vmspace in vmspace_exec() is mostly wasted effort. With one exception, vm_swrss, the copied fields are immediately overwritten. Instead, initialize these fields to zero in vmspace_alloc(), eliminating a bcopy() from vmspace_exec() and a bzero() from vmspace_fork().	2004-07-24 07:40:35 +00:00
Alan Cox	5285558ac2	- Change uma_zone_set_obj() to call kmem_alloc_nofault() instead of kmem_alloc_pageable(). The difference between these is that an errant memory access to the zone will be detected sooner with kmem_alloc_nofault(). The following changes serve to eliminate the following lock-order reversal reported by witness: 1st 0xc1a3c084 vm object (vm object) @ vm/swap_pager.c:1311 2nd 0xc07acb00 swap_pager swhash (swap_pager swhash) @ vm/swap_pager.c:1797 3rd 0xc1804bdc vm object (vm object) @ vm/uma_core.c:931 There is no potential deadlock in this case. However, witness is unable to recognize this because vm objects used by UMA have the same type as ordinary vm objects. To remedy this, we make the following changes: - Add a mutex type argument to VM_OBJECT_LOCK_INIT(). - Use the mutex type argument to assign distinct types to special vm objects such as the kernel object, kmem object, and UMA objects. - Define a static swap zone object for use by UMA. (Only static objects are assigned a special mutex type.)	2004-07-22 19:44:49 +00:00
Brian Feldman	d951b75210	Fix a race in vm_page_sleep_if_busy(). Due to vm_object locking being incomplete, it currently has to know how to drop and pick back up the vm_object's mutex if it has to sleep and drop the page queue mutex. The problem with this is that if the page is busy, while we are sleeping, the page can be freed and object disappear. When trying to lock m->object, we'd get a stale or NULL pointer and crash. The object is now cached, but this makes the assumption that the object is referenced in some manner and will not itself disappear while it is unlocked. Since this only happens if the object is locked, I had to remove an assumption earlier in contigmalloc() that reversed the order of locking the object and doing vm_page_sleep_if_busy(), not the normal order.	2004-07-21 23:56:09 +00:00
Peter Wemm	5476633aed	Semi-gratuitous change. Move two refcount operations to their own lines rather than be buried inside an if (expression). And now that the if expression is the same in both exit paths, use the same ordering.	2004-07-21 05:08:10 +00:00
Peter Wemm	3f25cbddc2	Move the initialization and teardown of pmaps to the vmspace zone's init and fini handlers. Our vm system removes all userland mappings at exit prior to calling pmap_release. It just so happens that we might as well reuse the pmap for the next process since the userland slate has already been wiped clean. However. There is a functional benefit to this as well. For platforms that share userland and kernel context in the same pmap, it means that the kernel portion of a pmap remains valid after the vmspace has been freed (process exit) and while it is in uma's cache. This is significant for i386 SMP systems with kernel context borrowing because it avoids a LOT of IPIs from the pmap_lazyfix() cleanup in the usual case. Tested on: amd64, i386, sparc64, alpha Glanced at by: alc	2004-07-21 00:29:21 +00:00
Brian Feldman	757cd67065	Remove extraneous locks on the VM free page queue mutex; it is not meant to be recursed upon, and could cauuse a deadlock inside the new contigmalloc (vm.old_contigmalloc=0) code. Submitted by: alc	2004-07-19 23:29:36 +00:00
Alan Cox	e832aafc51	- Eliminate the pte object from the pmap. Instead, page table pages are allocated as "no object" pages. Similar changes were made to the amd64 and i386 pmap last year. The primary reason being that maintaining a pte object leads to lock order violations. A secondary reason being that the pte object is redundant, i.e., the page table itself can be used to lookup page table pages. (Historical note: The pte object predates our ability to allocate "no object" pages. Thus, the pte object was a necessary evil.) - Unconditionally check the vm object lock's status in vm_page_remove(). Previously, this assertion could not be made on Alpha due to its use of a pte object.	2004-07-19 18:12:04 +00:00
Brian Feldman	0c3c862e21	Since breakage of malloc(9)/uma_zalloc(9) is totally non-optional in GENERIC/for WITNESS users, make sure the sysctl to disable the behavior is read-only and always enabled.	2004-07-19 15:05:24 +00:00
Brian Feldman	4362fada8f	Reimplement contigmalloc(9) with an algorithm which stands a greatly- improved chance of working despite pressure from running programs. Instead of trying to throw a bunch of pages out to swap and hope for the best, only a range that can potentially fulfill contigmalloc(9)'s request will have its contents paged out (potentially, not forcibly) at a time. The new contigmalloc operation still operates in three passes, but it could potentially be tuned to more or less. The first pass only looks at pages in the cache and free pages, so they would be thrown out without having to block. If this is not enough, the subsequent passes page out any unwired memory. To combat memory pressure refragmenting the section of memory being laundered, each page is removed from the systems' free memory queue once it has been freed so that blocking later doesn't cause the memory laundered so far to get reallocated. The page-out operations are now blocking, as it would make little sense to try to push out a page, then get its status immediately afterward to remove it from the available free pages queue, if it's unlikely to have been freed. Another change is that if KVA allocation fails, the allocated memory segment will be freed and not leaked. There is a sysctl/tunable, defaulting to on, which causes the old contigmalloc() algorithm to be used. Nonetheless, I have been using vm.old_contigmalloc=0 for over a month. It is safe to switch at run-time to see the difference it makes. A new interface has been used which does not require mapping the allocated pages into KVA: vm_page.h functions vm_page_alloc_contig() and vm_page_release_contig(). These are what vm.old_contigmalloc=0 uses internally, so the sysctl/tunable does not affect their operation. When using the contigmalloc(9) and contigfree(9) interfaces, memory is now tracked with malloc(9) stats. Several functions have been exported from kern_malloc.c to allow other subsystems to use these statistics, as well. This invalidates the BUGS section of the contigmalloc(9) manpage.	2004-07-19 06:21:27 +00:00
Alan Cox	3e36afbe27	Remove the GIANT_REQUIRED preceding pmap_remove() in vm_pageout_map_deactivate_pages().	2004-07-18 04:38:11 +00:00
Alan Cox	3d2e54c317	Push down the acquisition and release of the page queues lock into pmap_protect() and pmap_remove(). In general, they require the lock in order to modify a page's pv list or flags. In some cases, however, pmap_protect() can avoid acquiring the lock.	2004-07-15 18:00:43 +00:00
Alan Cox	26354d4c08	Remove an unused and unimplemented sysctl. (For the record, it was marked as unimplemented in revision 1.129 nearly six years ago.)	2004-07-12 17:45:37 +00:00
Alan Cox	790bdd0f2e	Increase the scope of the page queues lock in vm_page_alloc() to cover a diagnostic check that accesses the cache queue count.	2004-07-10 22:12:49 +00:00
Alan Cox	fd2d354908	Micro-optimize vmspace for 64-bit architectures: Colocate vm_refcnt and vm_exitingcnt so that alignment does not result in wasted space.	2004-07-06 17:35:10 +00:00
Bruce M Simpson	9bd86a9861	Properly brucify a string by outdenting it.	2004-07-06 02:27:30 +00:00
Bosko Milekic	0d0837ee6d	Introduce debug.nosleepwithlocks sysctl, 0 by default. If set to 1 and WITNESS is not built, then force all M_WAITOK allocations to M_NOWAIT behavior (transparently). This is to be used temporarily if wierd deadlocks are reported because we still have code paths that perform M_WAITOK allocations with lock(s) held, which can lead to deadlock. If WITNESS is compiled, then the sysctl is ignored and we ask witness to tell us wether we have locks held, converting to M_NOWAIT behavior only if it tells us that we do. Note this removes the previous mbuf.h inclusion as well (only needed by last revision), and cleans up unneeded [artificial] comparisons to just the mbuf zones. The problem described above has nothing to do with previous mbuf wait behavior; it is a general problem.	2004-07-04 16:07:44 +00:00
Brian Feldman	7a708c3626	Reextend the M_WAITOK-disabling-hack to all three of the mbuf-related zones, and do it by direct comparison of uma_zone_t instead of strcmp. The mbuf subsystem used to provide M_TRYWAIT/M_DONTWAIT semantics, but this is mostly no longer the case. M_WAITOK has taken over the spot M_TRYWAIT used to have, and for mbuf things, still may return NULL if the code path is incorrectly holding a mutex going into mbuf allocation functions. The M_WAITOK/M_NOWAIT semantics are absolute; though it may deadlock the system to try to malloc or uma_zalloc something with a mutex held and M_WAITOK specified, it is absolutely required to not return NULL and will result in instability and/or security breaches otherwise. There is still room to add the WITNESS_WARN() to all cases so that we are notified of the possibility of deadlocks, but it cannot change the value of the "badness" variable and allow allocation to actually fail except for the specialized cases which used to be M_TRYWAIT.	2004-07-04 15:59:25 +00:00
Brian Feldman	cf107c1d1a	Limit mbuma damage. Suddenly ALL allocations with M_WAITOK are subject to failing -- that is, allocations via malloc(M_WAITOK) that are required to never fail -- if WITNESS is not defined. While everyone should be running WITNESS, in any case, zone "Mbuf" allocations are really the only ones that should be screwed with by this hack. This hack is crashing people, and would continue to do so with or without WITNESS. Things shouldn't be allocating with M_WAITOK with locks held, but it's not okay just to always remove M_WAITOK when !WITNESS. Reported by: Bernd Walter <ticso@cicely5.cicely.de>	2004-07-03 18:11:41 +00:00
John Baldwin	0c0b25ae91	Implement preemption of kernel threads natively in the scheduler rather than as one-off hacks in various other parts of the kernel: - Add a function maybe_preempt() that is called from sched_add() to determine if a thread about to be added to a run queue should be preempted to directly. If it is not safe to preempt or if the new thread does not have a high enough priority, then the function returns false and sched_add() adds the thread to the run queue. If the thread should be preempted to but the current thread is in a nested critical section, then the flag TDF_OWEPREEMPT is set and the thread is added to the run queue. Otherwise, mi_switch() is called immediately and the thread is never added to the run queue since it is switch to directly. When exiting an outermost critical section, if TDF_OWEPREEMPT is set, then clear it and call mi_switch() to perform the deferred preemption. - Remove explicit preemption from ithread_schedule() as calling setrunqueue() now does all the correct work. This also removes the do_switch argument from ithread_schedule(). - Do not use the manual preemption code in mtx_unlock if the architecture supports native preemption. - Don't call mi_switch() in a loop during shutdown to give ithreads a chance to run if the architecture supports native preemption since the ithreads will just preempt DELAY(). - Don't call mi_switch() from the page zeroing idle thread for architectures that support native preemption as it is unnecessary. - Native preemption is enabled on the same archs that supported ithread preemption, namely alpha, i386, and amd64. This change should largely be a NOP for the default case as committed except that we will do fewer context switches in a few cases and will avoid the run queues completely when preempting. Approved by: scottl (with his re@ hat)	2004-07-02 20:21:44 +00:00
John Baldwin	bf0acc273a	- Change mi_switch() and sched_switch() to accept an optional thread to switch to. If a non-NULL thread pointer is passed in, then the CPU will switch to that thread directly rather than calling choosethread() to pick a thread to choose to. - Make sched_switch() aware of idle threads and know to do TD_SET_CAN_RUN() instead of sticking them on the run queue rather than requiring all callers of mi_switch() to know to do this if they can be called from an idlethread. - Move constants for arguments to mi_switch() and thread_single() out of the middle of the function prototypes and up above into their own section.	2004-07-02 19:09:50 +00:00
John Baldwin	d202e0cccc	- Don't use a variable to point to the user area that we only use once. Just use p2->p_uarea directly instead. - Remove an old and mostly bogus assertion regarding p2->p_sigacts. - Use RANGEOF macro ala fork1() to clean up bzero/bcopy of p_stats.	2004-07-02 03:45:07 +00:00

... 2 3 4 5 6 ...

2286 Commits