freebsd-dev

Author	SHA1	Message	Date
Konstantin Belousov	6195b24a79	Revert r173708's modifications to vm_object_page_remove(). Assume that a vnode is mapped shared and mlocked(), and then the vnode is truncated, or truncated and then again extended past the mapping point EOF. Truncation removes the pages past the truncation point, and if pages are later created at this range, they are not properly mapped into the mlocked region, and their wiring count is wrong. The revert leaves the invalidated but wired pages on the object queue, which means that the pages are found by vm_object_unwire() when the mapped range is munlock()ed, and reused by the buffer cache when the vnode is extended again. The changes in r173708 were required since then vm_map_unwire() looked at the page tables to find the page to unwire. This is no longer needed with the vm_object_unwire() introduction, which follows the objects shadow chain. Also eliminate OBJPR_NOTWIRED flag for vm_object_page_remove(), which is now redundand, we do not remove wired pages. Reported by: trasz, Dmitry Sivachenko <trtrmitya@gmail.com> Suggested and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2015-07-25 18:29:06 +00:00
Gleb Smirnoff	093c7f396d	Make KPI of vm_pager_get_pages() more strict: if a pager changes a page in the requested array, then it is responsible for disposition of previous page and is responsible for updating the entry in the requested array. Now consumers of KPI do not need to re-lookup the pages after call to vm_pager_get_pages(). Reviewed by: kib Sponsored by: Netflix Sponsored by: Nginx, Inc.	2015-06-12 11:32:20 +00:00
Eric van Gyzen	63e4c6cdf9	Provide vnode in memory map info for files on tmpfs When providing memory map information to userland, populate the vnode pointer for tmpfs files. Set the memory mapping to appear as a vnode type, to match FreeBSD 9 behavior. This fixes the use of tmpfs files with the dtrace pid provider, procstat -v, procfs, linprocfs, pmc (pmcstat), and ptrace (PT_VM_ENTRY). Submitted by: Eric Badger <eric@badgerio.us> (initial revision) Obtained from: Dell Inc. PR: 198431 MFC after: 2 weeks Reviewed by: jhb Approved by: kib (mentor)	2015-06-02 18:37:04 +00:00
John Baldwin	ff87ae350e	Export a list of VM objects in the system via a sysctl. The list can be examined via 'vmstat -o'. It can be used to determine which files are using physical pages of memory and how much each is using. Differential Revision: https://reviews.freebsd.org/D2277 Reviewed by: alc, kib MFC after: 2 weeks Sponsored by: Norse Corp, Inc. (forward porting to HEAD/10)	2015-05-27 18:11:05 +00:00
John Baldwin	e735691b61	Place VM objects on the object list when created and never remove them. This is ok since objects come from a NOFREE zone and allows objects to be locked while traversing the object list without triggering a LOR. Ensure that objects on the list are marked DEAD while free or stillborn, and that they have a refcount of zero. This required updating most of the pagers to explicitly mark an object as dead when deallocating it. (Only the vnode pager did this previously.) Differential Revision: https://reviews.freebsd.org/D2423 Reviewed by: alc, kib (earlier version) MFC after: 2 weeks Sponsored by: Norse Corp, Inc.	2015-05-08 19:43:37 +00:00
Alan Cox	6a24058fab	Correct a typo in vm_object_backing_scan() that originated in r254141. Specifically, change a lock acquire into a lock release. MFC after: 3 days Sponsored by: EMC / Isilon Storage Division	2015-03-07 04:18:40 +00:00
Alan Cox	777a36c5e3	Use RW_NEW rather than calling bzero().	2015-03-01 05:18:02 +00:00
Konstantin Belousov	f40cb1c645	Update mtime for tmpfs files modified through memory mapping. Similar to UFS, perform updates during syncer scans, which in particular means that tmpfs now performs scan on sync. Also, this means that a mtime update may be delayed up to 30 seconds after the write. The vm_object' OBJ_TMPFS_DIRTY flag for tmpfs swap object is similar to the OBJ_MIGHTBEDIRTY flag for the vnode object, it indicates that object could have been dirtied. Adapt fast page fault handler and vm_object_set_writeable_dirty() to handle OBJ_TMPFS_NODE same as OBJT_VNODE. Reported by: Ronald Klop <ronald-lists@klop.ws> Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2015-01-28 10:37:23 +00:00
Konstantin Belousov	30d57414a0	When the last reference on the vnode' vm object is dropped, read the vp->v_vflag without taking vnode lock and without bypass. We do know that vp is the lowest level in the stack, since the pointer is obtained from the object' handle. Stale VV_TEXT flag read can only happen if parallel execve() is performed and not yet activated the image, since process takes reference for text mapping. In this case, the execve() code manages the VV_TEXT flag on its own already. It was observed that otherwise read-only sendfile(2) requires exclusive vnode lock and contending on it on some loads for VV_TEXT handling. Reported by: glebius, scottl Tested by: glebius, pho Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-12-05 15:02:30 +00:00
Alan Cox	0346250941	When unwiring a region of an address space, do not assume that the underlying physical pages are mapped by the pmap. If, for example, the application has performed an mprotect(..., PROT_NONE) on any part of the wired region, then those pages will no longer be mapped by the pmap. So, using the pmap to lookup the wired pages in order to unwire them doesn't always work, and when it doesn't work wired pages are leaked. To avoid the leak, introduce and use a new function vm_object_unwire() that locates the wired pages by traversing the object and its backing objects. At the same time, switch from using pmap_change_wiring() to the recently introduced function pmap_unwire() for unwiring the region's mappings. pmap_unwire() is faster, because it operates a range of virtual addresses rather than a single virtual page at a time. Moreover, by operating on a range, it is superpage friendly. It doesn't waste time performing unnecessary demotions. Reported by: markj Reviewed by: kib Tested by: pho, jmg (arm) Sponsored by: EMC / Isilon Storage Division	2014-07-26 18:10:18 +00:00
Konstantin Belousov	4bace8e721	Correct assertion. The shadowing object cannot be tmpfs vm object, and tmpfs object cannot shadow. In other words, tmpfs vm object is always at the bottom of the shadow chain. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2014-07-24 10:25:42 +00:00
Konstantin Belousov	f08f7dca40	The OBJ_TMPFS flag of vm_object means that there is unreclaimed tmpfs vnode for the tmpfs node owning this object. The flag is currently used for two purposes. First, it allows to correctly handle VV_TEXT for tmpfs vnode when the ref count on the object is decremented to 1, similar to vnode_pager_dealloc() for regular filesystems. Second, it prevents some operations, which are done on OBJT_SWAP vm objects backing user anonymous memory, but are incorrect for the object owned by tmpfs node. The second kind of use of the OBJ_TMPFS flag is incorrect, since the vnode might be reclaimed, which clears the flag, but vm object operations must still be disallowed. Introduce one more flag, OBJ_TMPFS_NODE, which is permanently set on the object for VREG tmpfs node, and used instead of OBJ_TMPFS to test whether vm object collapse and similar actions should be disabled. Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2014-07-14 09:30:37 +00:00
Bryan Drewery	44f1c91610	Rename global cnt to vm_cnt to avoid shadowing. To reduce the diff struct pcu.cnt field was not renamed, so PCPU_OP(cnt.field) is still used. pc_cnt and pcpu are also used in kvm(3) and vmstat(8). The goal was to not affect externally used KPI. Bump __FreeBSD_version_ in case some out-of-tree module/code relies on the the global cnt variable. Exp-run revealed no ports using it directly. No objection from: arch@ Sponsored by: EMC / Isilon Storage Division	2014-03-22 10:26:09 +00:00
Konstantin Belousov	2309fa9b92	Do not vdrop() the tmpfs vnode until it is unlocked. The hold reference might be the last, and then vdrop() would free the vnode. Reported and tested by: bdrewery MFC after: 1 week	2014-03-12 15:13:57 +00:00
Attilio Rao	14a5dc1780	Fix-up r254141: in the process of making a failing vm_page_rename() a call of pager_swap_freespace() was moved around, now leading to freeing the incorrect page because of the pindex changes after vm_page_rename(). Get back to use the correct pindex when destroying the swap space. Sponsored by: EMC / Isilon storage division Reported by: avg Tested by: pho MFC after: 7 days	2014-02-14 03:34:12 +00:00
Gleb Smirnoff	5f3563b0a5	Fix function name in KASSERT(). Submitted by: hiren	2014-02-12 20:11:20 +00:00
Konstantin Belousov	9ded9474d3	Do not coalesce if the swap object belongs to tmpfs vnode. The coalesce would extend the object to keep pages for the anonymous mapping created by the process. The pages has no relations to the tmpfs file content which could be written into the corresponding range, causing anonymous mapping and file content aliasing and subsequent corruption. Another lesser problem created by coalescing is over-accounting on the tmpfs node destruction, since the object size is substracted from the total count of the pages owned by the tmpfs mount. Reported and tested by: bdrewery Sponsored by: The FreeBSD Foundation MFC after: 1 week	2013-11-05 06:18:50 +00:00
Konstantin Belousov	3aaea6efd5	Drain for the xbusy state for two places which potentially do pmap_remove_all(). Not doing the drain allows the pmap_enter() to proceed in parallel, making the pmap_remove_all() effects void. The race results in an invalidated page mapped wired by usermode. Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation Approved by: re (glebius)	2013-09-08 17:51:22 +00:00
Konstantin Belousov	5944de8ecd	Remove the deprecated VM_ALLOC_RETRY flag for the vm_page_grab(9). The flag was mandatory since r209792, where vm_page_grab(9) was changed to only support the alloc retry semantic. Suggested and reviewed by: alc Sponsored by: The FreeBSD Foundation	2013-08-22 07:39:53 +00:00
Attilio Rao	e946b94934	On all the architectures, avoid to preallocate the physical memory for nodes used in vm_radix. On architectures supporting direct mapping, also avoid to pre-allocate the KVA for such nodes. In order to do so make the operations derived from vm_radix_insert() to fail and handle all the deriving failure of those. vm_radix-wise introduce a new function called vm_radix_replace(), which can replace a leaf node, already present, with a new one, and take into account the possibility, during vm_radix_insert() allocation, that the operations on the radix trie can recurse. This means that if operations in vm_radix_insert() recursed vm_radix_insert() will start from scratch again. Sponsored by: EMC / Isilon storage division Reviewed by: alc (older version) Reviewed by: jeff Tested by: pho, scottl	2013-08-09 11:28:55 +00:00
Attilio Rao	c7aebda8a1	The soft and hard busy mechanism rely on the vm object lock to work. Unify the 2 concept into a real, minimal, sxlock where the shared acquisition represent the soft busy and the exclusive acquisition represent the hard busy. The old VPO_WANTED mechanism becames the hard-path for this new lock and it becomes per-page rather than per-object. The vm_object lock becames an interlock for this functionality: it can be held in both read or write mode. However, if the vm_object lock is held in read mode while acquiring or releasing the busy state, the thread owner cannot make any assumption on the busy state unless it is also busying it. Also: - Add a new flag to directly shared busy pages while vm_page_alloc and vm_page_grab are being executed. This will be very helpful once these functions happen under a read object lock. - Move the swapping sleep into its own per-object flag The KPI is heavilly changed this is why the version is bumped. It is very likely that some VM ports users will need to change their own code. Sponsored by: EMC / Isilon storage division Discussed with: alc Reviewed by: jeff, kib Tested by: gavin, bapt (older version) Tested by: pho, scottl	2013-08-09 11:11:11 +00:00
Jeff Roberson	5df87b21d3	Replace kernel virtual address space allocation with vmem. This provides transparent layering and better fragmentation. - Normalize functions that allocate memory to use kmem_* - Those that allocate address space are named kva_* - Those that operate on maps are named kmap_* - Implement recursive allocation handling for kmem_arena in vmem. Reviewed by: alc Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-08-07 06:21:20 +00:00
Konstantin Belousov	ebf5d94e82	Never remove user-wired pages from an object when doing msync(MS_INVALIDATE). The vm_fault_copy_entry() requires that object range which corresponds to the user-wired vm_map_entry, is always fully populated. Add OBJPR_NOTWIRED flag for vm_object_page_remove() to request the preserving behaviour, use it when calling vm_object_page_remove() from vm_object_sync(). Reported and tested by: pho Reviewed by: alc Sponsored by: The FreeBSD Foundation MFC after: 2 weeks	2013-07-11 05:47:26 +00:00
Jeff Roberson	5f51836645	- Add a general purpose resource allocator, vmem, from NetBSD. It was originally inspired by the Solaris vmem detailed in the proceedings of usenix 2001. The NetBSD version was heavily refactored for bugs and simplicity. - Use this resource allocator to allocate the buffer and transient maps. Buffer cache defrags are reduced by 25% when used by filesystems with mixed block sizes. Ultimately this may permit dynamic buffer cache sizing on low KVA machines. Discussed with: alc, kib, attilio Tested by: pho Sponsored by: EMC / Isilon Storage Division	2013-06-28 03:51:20 +00:00
Alan Cox	2051980f97	Revise the interface between vm_object_madvise() and vm_page_dontneed() so that pointless calls to pmap_is_modified() can be easily avoided when performing madvise(..., MADV_FREE). Sponsored by: EMC / Isilon Storage Division	2013-06-10 01:48:21 +00:00
Attilio Rao	dfd55c0c7b	In vm_object_split(), busy and consequently unbusy the pages only when swap_pager_copy() is invoked, otherwise there is no reason to do so. This will eliminate the necessity to busy pages most of the times. Sponsored by: EMC / Isilon storage division Reviewed by: alc	2013-06-04 22:47:01 +00:00
Konstantin Belousov	7560005c41	After the object lock was dropped, the object' reference count could change. Retest the ref_count and return from the function to not execute the further code which assumes that ref_count == 1 if it is not. Also, do not leak vnode lock if other thread cleared OBJ_TMPFS flag meantime. Reported by: bdrewery Tested by: bdrewery, pho Sponsored by: The FreeBSD Foundation	2013-05-30 20:00:19 +00:00
Konstantin Belousov	782d4a636b	Remove the capitalization in the assertion message. Print the address of the object to get useful information from optimizated kernels dump.	2013-05-30 19:53:31 +00:00
Konstantin Belousov	6f2af3fcf3	Rework the handling of the tmpfs node backing swap object and tmpfs vnode v_object to avoid double-buffering. Use the same object both as the backing store for tmpfs node and as the v_object. Besides reducing memory use up to 2x times for situation of mapping files from tmpfs, it also makes tmpfs read and write operations copy twice bytes less. VM subsystem was already slightly adapted to tolerate OBJT_SWAP object as v_object. Now the vm_object_deallocate() is modified to not reinstantiate OBJ_ONEMAPPING flag and help the VFS to correctly handle VV_TEXT flag on the last dereference of the tmpfs backing object. Reviewed by: alc Tested by: pho, bf MFC after: 1 month	2013-04-28 19:38:59 +00:00
Konstantin Belousov	e5f299ff76	Make vm_object_page_clean() and vm_mmap_vnode() tolerate the vnode' v_object of non OBJT_VNODE type. For vm_object_page_clean(), simply do not assert that object type must be OBJT_VNODE, and add a comment explaining how the check for OBJ_MIGHTBEDIRTY prevents the rest of function from operating on such objects. For vm_mmap_vnode(), if the object type is not OBJT_VNODE, require it to be for swap pager (or default), handle the bypass filesystems, and correctly acquire the object reference in this case. Reviewed by: alc Tested by: pho, bf MFC after: 1 week	2013-04-28 19:25:09 +00:00
Attilio Rao	774d251d99	Sync back vmcontention branch into HEAD: Replace the per-object resident and cached pages splay tree with a path-compressed multi-digit radix trie. Along with this, switch also the x86-specific handling of idle page tables to using the radix trie. This change is supposed to do the following: - Allowing the acquisition of read locking for lookup operations of the resident/cached pages collections as the per-vm_page_t splay iterators are now removed. - Increase the scalability of the operations on the page collections. The radix trie does rely on the consumers locking to ensure atomicity of its operations. In order to avoid deadlocks the bisection nodes are pre-allocated in the UMA zone. This can be done safely because the algorithm needs at maximum one new node per insert which means the maximum number of the desired nodes is the number of available physical frames themselves. However, not all the times a new bisection node is really needed. The radix trie implements path-compression because UFS indirect blocks can lead to several objects with a very sparse trie, increasing the number of levels to usually scan. It also helps in the nodes pre-fetching by introducing the single node per-insert property. This code is not generalized (yet) because of the possible loss of performance by having much of the sizes in play configurable. However, efforts to make this code more general and then reusable in further different consumers might be really done. The only KPI change is the removal of the function vm_page_splay() which is now reaped. The only KBI change, instead, is the removal of the left/right iterators from struct vm_page, which are now reaped. Further technical notes broken into mealpieces can be retrieved from the svn branch: http://svn.freebsd.org/base/user/attilio/vmcontention/ Sponsored by: EMC / Isilon storage division In collaboration with: alc, jeff Tested by: flo, pho, jhb, davide Tested by: ian (arm) Tested by: andreast (powerpc)	2013-03-18 00:25:02 +00:00
Attilio Rao	89f6b8632c	Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho	2013-03-09 02:32:23 +00:00
Attilio Rao	c934116100	Merge from vmc-playground: Introduce a new KPI that verifies if the page cache is empty for a specified vm_object. This KPI does not make assumptions about the locking in order to be used also for building assertions at init and destroy time. It is mostly used to hide implementation details of the page cache. Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: alc (vm_radix based version) Tested by: flo, pho, jhb, davide	2013-03-09 02:05:29 +00:00
Attilio Rao	198da1b2fa	Merge from vmcontention: As vm objects are type-stable there is no need to initialize the resident splay tree pointer and the cache splay tree pointer in _vm_object_allocate() but this could be done in the init UMA zone handler. The destructor UMA zone handler, will further check if the condition is retained at every destruction and catch for bugs. Sponsored by: EMC / Isilon storage division Submitted by: alc	2013-03-04 13:10:59 +00:00
Alan Cox	55f33f2caf	The value held by the vm object's field pg_color is only considered valid if the flag OBJ_COLORED is set. Since _vm_object_allocate() doesn't set this flag, it needn't initialize pg_color. Sponsored by: EMC / Isilon Storage Division	2013-03-02 18:07:29 +00:00
Attilio Rao	a4915c21d9	Merge from vmc-playground branch: Replace the sub-optimal uma_zone_set_obj() primitive with more modern uma_zone_reserve_kva(). The new primitive reserves before hand the necessary KVA space to cater the zone allocations and allocates pages with ALLOC_NOOBJ. More specifically: - uma_zone_reserve_kva() does not need an object to cater the backend allocator. - uma_zone_reserve_kva() can cater M_WAITOK requests, in order to serve zones which need to do uma_prealloc() too. - When possible, uma_zone_reserve_kva() uses directly the direct-mapping by uma_small_alloc() rather than relying on the KVA / offset combination. The removal of the object attribute allows 2 further changes: 1) _vm_object_allocate() becomes static within vm_object.c 2) VM_OBJECT_LOCK_INIT() is removed. This function is replaced by direct calls to mtx_init() as there is no need to export it anymore and the calls aren't either homogeneous anymore: there are now small differences between arguments passed to mtx_init(). Sponsored by: EMC / Isilon storage division Reviewed by: alc (which also offered almost all the comments) Tested by: pho, jhb, davide	2013-02-26 23:35:27 +00:00
Attilio Rao	64a3476f0c	Remove white spaces. Sponsored by: EMC / Isilon storage division	2013-02-26 20:35:40 +00:00
Attilio Rao	0dde287b20	Wrap the sleeps synchronized by the vm_object lock into the specific macro VM_OBJECT_SLEEP(). This hides some implementation details like the usage of the msleep() primitive and the necessity to access to the lock address directly. For this reason VM_OBJECT_MTX() macro is now retired. Sponsored by: EMC / Isilon storage division Reviewed by: alc Tested by: pho	2013-02-26 17:22:08 +00:00
Alan Cox	2863482058	In the past four years, we've added two new vm object types. Each time, similar changes had to be made in various places throughout the machine- independent virtual memory layer to support the new vm object type. However, in most of these places, it's actually not the type of the vm object that matters to us but instead certain attributes of its pages. For example, OBJT_DEVICE, OBJT_MGTDEVICE, and OBJT_SG objects contain fictitious pages. In other words, in most of these places, we were testing the vm object's type to determine if it contained fictitious (or unmanaged) pages. To both simplify the code in these places and make the addition of future vm object types easier, this change introduces two new vm object flags that describe attributes of the vm object's pages, specifically, whether they are fictitious or unmanaged. Reviewed and tested by: kib	2012-12-09 00:32:38 +00:00
Alan Cox	96b0b92ac1	Add support for the (relatively) new object type OBJT_MGTDEVICE to vm_object_set_memattr(). Also, add a "safety belt" so that vm_object_set_memattr() doesn't silently modify undefined object types. Reviewed by: kib MFC after: 10 days	2012-11-28 18:29:34 +00:00
Konstantin Belousov	5050aa86cf	Remove the support for using non-mpsafe filesystem modules. In particular, do not lock Giant conditionally when calling into the filesystem module, remove the VFS_LOCK_GIANT() and related macros. Stop handling buffers belonging to non-mpsafe filesystems. The VFS_VERSION is bumped to indicate the interface change which does not result in the interface signatures changes. Conducted and reviewed by: attilio Tested by: pho	2012-10-22 17:50:54 +00:00
Alan Cox	9af47af64a	Eliminate the conditional for releasing the page queues lock in vm_page_sleep(). vm_page_sleep() is no longer called with this lock held. Eliminate assertions that the page queues lock is NOT held. These assertions won't translate well to having distinct locks on the active and inactive page queues, and they really aren't that useful. MFC after: 3 weeks	2012-10-13 18:46:46 +00:00
Konstantin Belousov	877d24ac8a	Fix the mis-handling of the VV_TEXT on the nullfs vnodes. If you have a binary on a filesystem which is also mounted over by nullfs, you could execute the binary from the lower filesystem, or from the nullfs mount. When executed from lower filesystem, the lower vnode gets VV_TEXT flag set, and the file cannot be modified while the binary is active. But, if executed as the nullfs alias, only the nullfs vnode gets VV_TEXT set, and you still can open the lower vnode for write. Add a set of VOPs for the VV_TEXT query, set and clear operations, which are correctly bypassed to lower vnode. Tested by: pho (previous version) MFC after: 2 weeks	2012-09-28 11:25:02 +00:00
Konstantin Belousov	5f9c767b19	Plug the accounting leak for the wired pages when msync(MS_INVALIDATE) is performed on the vnode mapping which is wired in other address space. While there, explicitely assert that the page is unwired and zero the wire_count instead of substract. The condition is rechecked later in vm_page_free(_toq) already. Reported and tested by: zont Reviewed by: alc (previous version) MFC after: 1 week	2012-09-20 09:52:57 +00:00
Attilio Rao	571a1e92aa	Document the object type movements, related to swp_pager_copy(), in vm_object_collapse() and vm_object_split(). In collabouration with: alc MFC after: 3 days	2012-07-11 01:04:59 +00:00
John Baldwin	92a5994685	Fix madvise(MADV_WILLNEED) to properly handle individual mappings larger than 4GB. Specifically, the inlined version of 'ptoa' of the the 'int' count of pages overflowed on 64-bit platforms. While here, change vm_object_madvise() to accept two vm_pindex_t parameters (start and end) rather than a (start, count) tuple to match other VM APIs as suggested by alc@.	2012-03-19 18:47:34 +00:00
Konstantin Belousov	126d60823a	In vm_object_page_clean(), do not clean OBJ_MIGHTBEDIRTY object flag if the filesystem performed short write and we are skipping the page due to this. Propogate write error from the pager back to the callers of vm_pageout_flush(). Report the failure to write a page from the requested range as the FALSE return value from vm_object_page_clean(), and propagate it back to msync(2) to return EIO to usermode. While there, convert the clearobjflags variable in the vm_object_page_clean() and arguments of the helper functions to boolean. PR: kern/165927 Reviewed by: alc MFC after: 2 weeks	2012-03-17 23:00:32 +00:00
Konstantin Belousov	e65919f9fc	Do not restart the scan in vm_object_page_clean() on the object generation change if requested mode is async. The object generation is only changed when the object is marked as OBJ_MIGHTBEDIRTY. For async mode it is enough to write each dirty page, not to make a guarantee that all pages are cleared after the vm_object_page_clean() returned. Diagnosed by: truckman Tested by: flo Reviewed by: alc, truckman MFC after: 2 weeks	2012-01-04 16:04:20 +00:00
Alan Cox	b5f359b7c3	Optimize vm_object_split()'s handling of reservations.	2011-12-28 20:27:18 +00:00
Konstantin Belousov	75ff604a78	Optimize the common case of msyncing the whole file mapping with MS_SYNC flag. The system must guarantee that all writes are finished before syscalls returned. Schedule the writes in async mode, which is much faster and allows the clustering to occur. Wait for writes using VOP_FSYNC(), since we are syncing the whole file mapping. Potentially, the restriction to only apply the optimization can be relaxed by not requiring that the mapping cover whole file, as it is done by other OSes. Reported and tested by: az Reviewed by: alc MFC after: 2 weeks	2011-12-23 09:09:42 +00:00
Ed Schouten	6472ac3d8a	Mark all SYSCTL_NODEs static that have no corresponding SYSCTL_DECLs. The SYSCTL_NODE macro defines a list that stores all child-elements of that node. If there's no SYSCTL_DECL macro anywhere else, there's no reason why it shouldn't be static.	2011-11-07 15:43:11 +00:00
John Baldwin	936c09ac0f	Add the posix_fadvise(2) system call. It is somewhat similar to madvise(2) except that it operates on a file descriptor instead of a memory region. It is currently only supported on regular files. Just as with madvise(2), the advice given to posix_fadvise(2) can be divided into two types. The first type provide hints about data access patterns and are used in the file read and write routines to modify the I/O flags passed down to VOP_READ() and VOP_WRITE(). These modes are thus filesystem independent. Note that to ease implementation (and since this API is only advisory anyway), only a single non-normal range is allowed per file descriptor. The second type of hints are used to hint to the OS that data will or will not be used. These hints are implemented via a new VOP_ADVISE(). A default implementation is provided which does nothing for the WILLNEED request and attempts to move any clean pages to the cache page queue for the DONTNEED request. This latter case required two other changes. First, a new V_CLEANONLY flag was added to vinvalbuf(). This requests vinvalbuf() to only flush clean buffers for the vnode from the buffer cache and to not remove any backing pages from the vnode. This is used to ensure clean pages are not wired into the buffer cache before attempting to move them to the cache page queue. The second change adds a new vm_object_page_cache() method. This method is somewhat similar to vm_object_page_remove() except that instead of freeing each page in the specified range, it attempts to move clean pages to the cache queue if possible. To preserve the ABI of struct file, the f_cdevpriv pointer is now reused in a union to point to the currently active advice region if one is present for regular files. Reviewed by: jilles, kib, arch@ Approved by: re (kib) MFC after: 1 month	2011-11-04 04:02:50 +00:00
Konstantin Belousov	3407fefef6	Split the vm_page flags PG_WRITEABLE and PG_REFERENCED into atomic flags field. Updates to the atomic flags are performed using the atomic ops on the containing word, do not require any vm lock to be held, and are non-blocking. The vm_page_aflag_set(9) and vm_page_aflag_clear(9) functions are provided to modify afalgs. Document the changes to flags field to only require the page lock. Introduce vm_page_reference(9) function to provide a stable KPI and KBI for filesystems like tmpfs and zfs which need to mark a page as referenced. Reviewed by: alc, attilio Tested by: marius, flo (sparc64); andreast (powerpc, powerpc64) Approved by: re (bz)	2011-09-06 10:30:11 +00:00
Konstantin Belousov	d98d0ce27a	- Move the PG_UNMANAGED flag from m->flags to m->oflags, renaming the flag to VPO_UNMANAGED (and also making the flag protected by the vm object lock, instead of vm page queue lock). - Mark the fake pages with both PG_FICTITIOUS (as it is now) and VPO_UNMANAGED. As a consequence, pmap code now can use use just VPO_UNMANAGED to decide whether the page is unmanaged. Reviewed by: alc Tested by: pho (x86, previous version), marius (sparc64), marcel (arm, ia64, powerpc), ray (mips) Sponsored by: The FreeBSD Foundation Approved by: re (bz)	2011-08-09 21:01:36 +00:00
Alan Cox	6bbee8e28a	Add a new option, OBJPR_NOTMAPPED, to vm_object_page_remove(). Passing this option to vm_object_page_remove() asserts that the specified range of pages is not mapped, or more precisely that none of these pages have any managed mappings. Thus, vm_object_page_remove() need not call pmap_remove_all() on the pages. This change not only saves time by eliminating pointless calls to pmap_remove_all(), but it also eliminates an inconsistency in the use of pmap_remove_all() versus related functions, like pmap_remove_write(). It eliminates harmless but pointless calls to pmap_remove_all() that were being performed on PG_UNMANAGED pages. Update all of the existing assertions on pmap_remove_all() to reflect this change. Reviewed by: kib	2011-06-29 16:40:41 +00:00
Konstantin Belousov	031ec8c10a	In the VOP_PUTPAGES() implementations, change the default error from VM_PAGER_AGAIN to VM_PAGER_ERROR for the uwritten pages. Return VM_PAGER_AGAIN for the partially written page. Always forward at least one page in the loop of vm_object_page_clean(). VM_PAGER_ERROR causes the page reactivation and does not clear the page dirty state, so the write is not lost. The change fixes an infinite loop in vm_object_page_clean() when the filesystem returns permanent errors for some page writes. Reported and tested by: gavin Reviewed by: alc, rmacklem MFC after: 1 week	2011-06-01 21:00:28 +00:00
Max Laier	e18cc7bf3e	Another long standing vm bug found at Isilon: Fix a race between vm_object_collapse and vm_fault. Reviewed by: alc@ MFC after: 3 days	2011-05-09 20:27:49 +00:00
Konstantin Belousov	86769ac0a4	Fix two bugs in r218670. Hold the vnode around the region where object lock is dropped, until vnode lock is acquired. Do not drop the vnode reference for a case when the object was deallocated during unlock. Note that in this case, VV_TEXT is cleared by vnode_pager_dealloc(). Reported and tested by: pho Reviewed by: alc MFC after: 3 days	2011-04-23 21:38:21 +00:00
Konstantin Belousov	03fa5b34a0	Lock the vnode around clearing of VV_TEXT flag. Remove mp_fixme() note mentioning that vnode lock is needed. Reviewed by: alc Tested by: pho MFC after: 1 week	2011-02-13 21:52:26 +00:00
Alan Cox	17f3095d1a	Unless "cnt" exceeds MAX_COMMIT_COUNT, nfsrv_commit() and nfsvno_fsync() are incorrectly calling vm_object_page_clean(). They are passing the length of the range rather than the ending offset of the range. Perform the OFF_TO_IDX() conversion in vm_object_page_clean() rather than the callers. Reviewed by: kib MFC after: 3 weeks	2011-02-05 21:21:27 +00:00
Alan Cox	0cc74f144e	Since the last parameter to vm_object_shadow() is a vm_size_t and not a vm_pindex_t, it makes no sense for its callers to perform atop(). Let vm_object_shadow() do that instead.	2011-02-04 21:49:24 +00:00
Konstantin Belousov	c6c9025b65	For consistency, use kernel_object instead of &kernel_object_store when initializing the object mutex. Do the same for kmem_object. Discussed with: alc MFC after: 1 week	2011-01-15 21:56:38 +00:00
Alan Cox	edf93b25d3	Make a couple refinements to r216799 and r216810. In particular, revise a comment and move it to its proper place. Reviewed by: kib	2011-01-01 17:39:38 +00:00
Konstantin Belousov	50cfe7fa50	Remove OBJ_CLEANING flag. The vfs_setdirty_locked_object() is the only consumer of the flag, and it used the flag because OBJ_MIGHTBEDIRTY was cleared early in vm_object_page_clean, before the cleaning pass was done. This is no longer true after r216799. Moreover, since OBJ_CLEANING is a flag, and not the counter, it could be reset too prematurely when parallel vm_object_page_clean() are performed. Reviewed by: alc (as a part of the bigger patch) MFC after: 1 month (after r216799 is merged)	2010-12-29 22:26:49 +00:00
Konstantin Belousov	3280870dca	Move the increment of vm object generation count into vm_object_set_writeable_dirty(). Fix an issue where restart of the scan in vm_object_page_clean() did not removed write permissions for newly added pages or, if the mapping for some already scanned page changed to writeable due to fault. Merge the two loops in vm_object_page_clean(), doing the remove of write permission and cleaning in the same loop. The restart of the loop then correctly downgrade writeable mappings. Fix an issue where a second caller to msync() might actually return before the first caller had actually completed flushing the pages. Clear the OBJ_MIGHTBEDIRTY flag after the cleaning loop, not before. Calls to pmap_is_modified() are not needed after pmap_remove_write() there. Proposed, reviewed and tested by: alc MFC after: 1 week	2010-12-29 12:53:53 +00:00
Edward Tomasz Napierala	ef694c1ac4	Replace pointer to "struct uidinfo" with pointer to "struct ucred" in "struct vm_object". This is required to make it possible to account for per-jail swap usage. Reviewed by: kib@ Tested by: pho@ Sponsored by: FreeBSD Foundation	2010-12-02 17:37:16 +00:00
Konstantin Belousov	780636b72a	After the sleep caused by encountering a busy page, relookup the page. Submitted and reviewed by: alc Reprted and tested by: pho MFC after: 5 days	2010-11-24 12:25:17 +00:00
Konstantin Belousov	3157c50313	Eliminate the mab, maf arrays and related variables. The change also fixes off-by-one error in the calculation of mreq. Suggested and reviewed by: alc Tested by: pho MFC after: 5 days	2010-11-21 10:18:28 +00:00
Alan Cox	17ea6f00d5	Optimize vm_object_terminate(). Reviewed by: kib MFC after: 1 week	2010-11-20 22:30:09 +00:00
Konstantin Belousov	4c7b9a2063	The runlen returned from vm_pageout_flush() might be zero legitimately, when mreq page has status VM_PAGER_AGAIN. MFC after: 5 days	2010-11-20 17:27:38 +00:00
Konstantin Belousov	1e8a675c73	vm_pageout_flush() might cache the pages that finished write to the backing storage. Such pages might be then reused, racing with the assert in vm_object_page_collect_flush() that verified that dirty pages from the run (most likely, pages with VM_PAGER_AGAIN status) are write-protected still. In fact, the page indexes for the pages that were removed from the object page list should be ignored by vm_object_page_clean(). Return the length of successfully written run from vm_pageout_flush(), that is, the count of pages between requested page and first page after requested with status VM_PAGER_AGAIN. Supply the requested page index in the array to vm_pageout_flush(). Use the returned run length to forward the index of next page to clean in vm_object_page_clean(). Reported by: avg Reviewed by: alc MFC after: 1 week	2010-11-18 21:09:02 +00:00
Konstantin Belousov	4166faaee0	Only increment object generation count when inserting the page into object page list. The only use of object generation count now is a restart of the scan in vm_object_page_clean(), which makes sense to do on the page addition. Page removals do not affect the dirtiness of the object, as well as manipulations with the shadow chain. Suggested and reviewed by: alc MFC after: 1 week	2010-11-18 20:46:28 +00:00
Konstantin Belousov	757216f3a5	Several cleanups for the r209686: - remove unused defines; - remove unused curgeneration argument for vm_object_page_collect_flush(); - always assert that vm_object_page_clean() is called for OBJT_VNODE; - move vm_page_find_least() into for() statement initial clause. Submitted by: alc	2010-07-04 19:02:32 +00:00
Konstantin Belousov	e239bb9730	Reimplement vm_object_page_clean(), using the fact that vm object memq is ordered by page index. This greatly simplifies the implementation, since we no longer need to mark the pages with VPO_CLEANCHK to denote the progress. It is enough to remember the current position by index before dropping the object lock. Remove VPO_CLEANCHK and VM_PAGER_IGNORE_CLEANCHK as unused. Garbage-collect vm.msync_flush_flags sysctl. Suggested and reviewed by: alc Tested by: pho	2010-07-04 11:26:56 +00:00
Konstantin Belousov	b382c10a57	Introduce a helper function vm_page_find_least(). Use it in several places, which inline the function. Reviewed by: alc Tested by: pho MFC after: 1 week	2010-07-04 11:13:33 +00:00
Alan Cox	567e51e18c	Roughly half of a typical pmap_mincore() implementation is machine- independent code. Move this code into mincore(), and eliminate the page queues lock from pmap_mincore(). Push down the page queues lock into pmap_clear_modify(), pmap_clear_reference(), and pmap_is_modified(). Assert that these functions are never passed an unmanaged page. Eliminate an inaccurate comment from powerpc/powerpc/mmu_if.m: Contrary to what the comment says, pmap_mincore() is not simply an optimization. Without a complete pmap_mincore() implementation, mincore() cannot return either MINCORE_MODIFIED or MINCORE_REFERENCED because only the pmap can provide this information. Eliminate the page queues lock from vfs_setdirty_locked_object(), vm_pageout_clean(), vm_object_page_collect_flush(), and vm_object_page_clean(). Generally speaking, these are all accesses to the page's dirty field, which are synchronized by the containing vm object's lock. Reduce the scope of the page queues lock in vm_object_madvise() and vm_page_dontneed(). Reviewed by: kib (an earlier version)	2010-05-24 14:26:57 +00:00
Alan Cox	a1a95cd608	Add a comment about the proper use of vm_object_page_remove(). MFC after: 1 week	2010-05-16 16:54:05 +00:00
Alan Cox	3c4a24406b	Push down the page queues into vm_page_cache(), vm_page_try_to_cache(), and vm_page_try_to_free(). Consequently, push down the page queues lock into pmap_enter_quick(), pmap_page_wired_mapped(), pmap_remove_all(), and pmap_remove_write(). Push down the page queues lock into Xen's pmap_page_is_mapped(). (I overlooked the Xen pmap in r207702.) Switch to a per-processor counter for the total number of pages cached.	2010-05-08 20:34:01 +00:00
Alan Cox	7072188017	Eliminate acquisitions of the page queues lock that are no longer needed. Switch to a per-processor counter for the number of pages freed during process termination.	2010-05-07 05:23:15 +00:00
Alan Cox	eb00b276ab	Eliminate page queues locking around most calls to vm_page_free().	2010-05-06 18:58:32 +00:00
Alan Cox	5ac59343be	Acquire the page lock around all remaining calls to vm_page_free() on managed pages that didn't already have that lock held. (Freeing an unmanaged page, such as the various pmaps use, doesn't require the page lock.) This allows a change in vm_page_remove()'s locking requirements. It now expects the page lock to be held instead of the page queues lock. Consequently, the page queues lock is no longer required at all by callers to vm_page_rename(). Discussed with: kib	2010-05-05 18:16:06 +00:00
Alan Cox	ac800a8490	Correct an error in r207410: Remove an unlock of a lock that is no longer held.	2010-05-02 18:09:33 +00:00
Kip Macy	7bec141b12	push up dropping of the page queue lock to avoid holding it in vm_pageout_flush	2010-04-30 22:31:37 +00:00
Kip Macy	ad0c05daf9	don't call vm_pageout_flush with the page queue mutex held Reported by: Michael Butler	2010-04-30 21:21:21 +00:00
Kip Macy	2965a45315	On Alan's advice, rather than do a wholesale conversion on a single architecture from page queue lock to a hashed array of page locks (based on a patch by Jeff Roberson), I've implemented page lock support in the MI code and have only moved vm_page's hold_count out from under page queue mutex to page lock. This changes pmap_extract_and_hold on all pmaps. Supported by: Bitgravity Inc. Discussed with: alc, jeffr, and kib	2010-04-30 00:46:43 +00:00
Alan Cox	6a2a3d7338	Change vm_object_madvise() so that it checks whether the page is invalid or unmanaged before acquiring the page queues lock. Neither of these tests require that lock. Moreover, a better way of testing if the page is unmanaged is to test the type of vm object. This avoids a pointless vm_page_lookup(). MFC after: 3 weeks	2010-04-28 04:57:32 +00:00
Alan Cox	4b9dd5d537	There is no justification for vm_object_split() setting PG_REFERENCED on a page that it is going to sleep on. Eliminate it. MFC after: 3 weeks	2010-04-18 17:50:09 +00:00
Alan Cox	b11b56b55b	In vm_object_madvise() setting PG_REFERENCED on a page before sleeping on that page only makes sense if the advice is MADV_WILLNEED. In that case, the intention is to activate the page, so discouraging the page daemon from reclaiming the page makes sense. In contrast, in the other cases, MADV_DONTNEED and MADV_FREE, it makes no sense whatsoever to discourage the page daemon from reclaiming the page by setting PG_REFERENCED. Wrap a nearby line. Discussed with: kib MFC after: 3 weeks	2010-04-17 21:14:37 +00:00
Alan Cox	aefea7f519	In vm_object_backing_scan(), setting PG_REFERENCED on a page before sleeping on that page is nonsensical. Doing so reduces the likelihood that the page daemon will reclaim the page before the thread waiting in vm_object_backing_scan() is reawakened. However, it does not guarantee that the page is not reclaimed, so vm_object_backing_scan() restarts after reawakening. More importantly, this muddles the meaning of PG_REFERENCED. There is no reason to believe that the caller of vm_object_backing_scan() is going to use (i.e., access) the contents of the page. There is especially no reason to believe that an access is more likely because vm_object_backing_scan() had to sleep on the page. Discussed with: kib MFC after: 3 weeks	2010-04-17 18:35:07 +00:00
Konstantin Belousov	49e3050e6c	VI_OBJDIRTY vnode flag mirrors the state of OBJ_MIGHTBEDIRTY vm object flag. Besides providing the redundand information, need to update both vnode and object flags causes more acquisition of vnode interlock. OBJ_MIGHTBEDIRTY is only checked for vnode-backed vm objects. Remove VI_OBJDIRTY and make sure that OBJ_MIGHTBEDIRTY is set only for vnode-backed vm objects. Suggested and reviewed by: alc Tested by: pho MFC after: 3 weeks	2009-12-21 12:29:38 +00:00
John Baldwin	013818111a	Add a new type of VM object: OBJT_SG. An OBJT_SG object is very similar to a device pager (OBJT_DEVICE) object in that it uses fictitious pages to provide aliases to other memory addresses. The primary difference is that it uses an sglist(9) to determine the physical addresses for a given offset into the object instead of invoking the d_mmap() method in a device driver. Reviewed by: alc Approved by: re (kensmith) MFC after: 2 weeks	2009-07-24 13:50:29 +00:00
Alan Cox	3153e878dd	Add support to the virtual memory system for configuring machine- dependent memory attributes: Rename vm_cache_mode_t to vm_memattr_t. The new name reflects the fact that there are machine-dependent memory attributes that have nothing to do with controlling the cache's behavior. Introduce vm_object_set_memattr() for setting the default memory attributes that will be given to an object's pages. Introduce and use pmap_page_{get,set}_memattr() for getting and setting a page's machine-dependent memory attributes. Add full support for these functions on amd64 and i386 and stubs for them on the other architectures. The function pmap_page_set_memattr() is also responsible for any other machine-dependent aspects of changing a page's memory attributes, such as flushing the cache or updating the direct map. The uses include kmem_alloc_contig(), vm_page_alloc(), and the device pager: kmem_alloc_contig() can now be used to allocate kernel memory with non-default memory attributes on amd64 and i386. vm_page_alloc() and the device pager will set the memory attributes for the real or fictitious page according to the object's default memory attributes. Update the various pmap functions on amd64 and i386 that map pages to incorporate each page's memory attributes in the mapping. Notes: (1) Inherent to this design are safety features that prevent the specification of inconsistent memory attributes by different mappings on amd64 and i386. In addition, the device pager provides a warning when a device driver creates a fictitious page with memory attributes that are inconsistent with the real page that the fictitious page is an alias for. (2) Storing the machine-dependent memory attributes for amd64 and i386 as a dedicated "int" in "struct md_page" represents a compromise between space efficiency and the ease of MFCing these changes to RELENG_7. In collaboration with: jhb Approved by: re (kib)	2009-07-12 23:31:20 +00:00
Konstantin Belousov	9b4d473a6e	Eliminiate code duplication by calling vm_object_destroy() from vm_object_collapse(). Requested and reviewed by: alc Approved by: re (kensmith)	2009-06-28 08:42:17 +00:00
Alan Cox	26f4eea53f	The bits set in a page's dirty mask are a subset of the bits set in its valid mask. Consequently, there is no need to perform a bit-wise and of the page's dirty and valid masks in order to determine which parts of a page are dirty and valid. Eliminate an unnecessary #include.	2009-06-24 04:45:03 +00:00
Konstantin Belousov	3364c323e6	Implement global and per-uid accounting of the anonymous memory. Add rlimit RLIMIT_SWAP that limits the amount of swap that may be reserved for the uid. The accounting information (charge) is associated with either map entry, or vm object backing the entry, assuming the object is the first one in the shadow chain and entry does not require COW. Charge is moved from entry to object on allocation of the object, e.g. during the mmap, assuming the object is allocated, or on the first page fault on the entry. It moves back to the entry on forks due to COW setup. The per-entry granularity of accounting makes the charge process fair for processes that change uid during lifetime, and decrements charge for proper uid when region is unmapped. The interface of vm_pager_allocate(9) is extended by adding struct ucred *, that is used to charge appropriate uid when allocation if performed by kernel, e.g. md(4). Several syscalls, among them is fork(2), may now return ENOMEM when global or per-uid limits are enforced. In collaboration with: pho Reviewed by: alc Approved by: re (kensmith)	2009-06-23 20:45:22 +00:00
Alan Cox	387aabc513	Long, long ago in r27464 special case code for mapping device-backed memory with 4MB pages was added to pmap_object_init_pt(). This code assumes that the pages of a OBJT_DEVICE object are always physically contiguous. Unfortunately, this is not always the case. For example, jhb@ informs me that the recently introduced /dev/ksyms driver creates a OBJT_DEVICE object that violates this assumption. Thus, this revision modifies pmap_object_init_pt() to abort the mapping if the OBJT_DEVICE object's pages are not physically contiguous. This revision also changes some inconsistent if not buggy behavior. For example, the i386 version aborts if the first 4MB virtual page that would be mapped is already valid. However, it incorrectly replaces any subsequent 4MB virtual page mappings that it encounters, potentially leaking a page table page. The amd64 version has a bug of my own creation. It potentially busies the wrong page and always an insufficent number of pages if it blocks allocating a page table page. To my knowledge, there have been no reports of these bugs, hence, their persistance. I suspect that the existing restrictions that pmap_object_init_pt() placed on the OBJT_DEVICE objects that it would choose to map, for example, that the first page must be aligned on a 2 or 4MB physical boundary and that the size of the mapping must be a multiple of the large page size, were enough to avoid triggering the bug for drivers like ksyms. However, one side effect of testing the OBJT_DEVICE object's pages for physical contiguity is that a dubious difference between pmap_object_init_pt() and the standard path for mapping devices pages, i.e., vm_fault(), has been eliminated. Previously, pmap_object_init_pt() would only instantiate the first PG_FICTITOUS page being mapped because it never examined the rest. Now, however, pmap_object_init_pt() uses the new function vm_object_populate() to instantiate them all (in order to support testing their physical contiguity). These pages need to be instantiated for the mechanism that I have prototyped for automatically maintaining the consistency of the PAT settings across multiple mappings, particularly, amd64's direct mapping, to work. (Translation: This change is also being made to support jhb@'s work on the Nvidia feature requests.) Discussed with: jhb@	2009-06-14 19:51:43 +00:00
Alan Cox	a28042d1e3	Change vm_object_page_remove() such that it clears the page's dirty bits when it invalidates the page. Suggested by: tegge	2009-05-28 07:26:36 +00:00
Konstantin Belousov	bb2ac86f7d	Do not call vm_page_lookup() from the ddb routine, namely from "show vmopag" implementation. The vm_page_lookup() code modifies splay tree of the object pages, and asserts that object lock is taken. First issue could cause kernel data corruption, and second one instantly panics the INVARIANTS-enabled kernel. Take the advantage of the fact that object->memq is ordered by page index, and iterate over memq to calculate the runs. While there, make the code slightly more style-compliant by moving variables declarations to the right place. Discussed with: jhb, alc Reviewed by: alc MFC after: 2 weeks	2009-04-23 21:09:47 +00:00
Alan Cox	bfd9b137a0	Reduce the scope of the page queues lock in vm_object_page_remove(). MFC after: 1 week	2009-02-21 20:57:25 +00:00
Alan Cox	7b54b1a9f5	Eliminate OBJ_NEEDGIANT. After r188331, OBJ_NEEDGIANT's only use is by a redundant assertion in vm_fault(). Reviewed by: kib	2009-02-08 22:17:24 +00:00
Robert Noland	e9f541267d	Fix printing of KASSERT message missed in r163604. Approved by: kib	2008-12-21 16:56:13 +00:00
Attilio Rao	0d7935fd01	Remove the struct thread unuseful argument from bufobj interface. In particular following functions KPI results modified: - bufobj_invalbuf() - bufsync() and BO_SYNC() "virtual method" of the buffer objects set. Main consumers of bufobj functions are affected by this change too and, in particular, functions which changed their KPI are: - vinvalbuf() - g_vfs_close() Due to the KPI breakage, __FreeBSD_version will be bumped in a later commit. As a side note, please consider just temporary the 'curthread' argument passing to VOP_SYNC() (in bufsync()) as it will be axed out ASAP Reviewed by: kib Tested by: Giovanni Trematerra <giovanni dot trematerra at gmail dot com>	2008-10-10 21:23:50 +00:00
Tom Rhodes	6bd9cb1c81	Fill in a few sysctl descriptions. Reviewed by: alc, Matt Dillon <dillon@apollo.backplane.com> Approved by: alc	2008-08-03 14:26:15 +00:00
John Baldwin	2c3b410b3a	One more whitespace nit.	2008-07-30 21:23:32 +00:00
John Baldwin	3cca4b6fe8	A few more whitespace fixes.	2008-07-30 21:18:08 +00:00
Stephan Uphoff	2ac78f0e1a	Allow VM object creation in ufs_lookup. (If vfs.vmiodirenable is set) Directory IO without a VM object will store data in 'malloced' buffers severely limiting caching of the data. Without this change VM objects for directories are only created on an open() of the directory. TODO: Inline test if VM object already exists to avoid locking/function call overhead. Tested by: kris@ Reviewed by: jeff@ Reported by: David Filo	2008-05-20 19:05:43 +00:00
Jeff Roberson	52481a9a9d	- Use vm_object_reference_locked() directly from vm_object_reference(). This is intended to get rid of vget() consumers who don't wish to acquire a lock. This is functionally the same as calling vref(). vm_object_reference_locked() already uses vref. Discussed with: alc	2008-03-29 07:06:13 +00:00
Alan Cox	688559667f	Correct a long-standing error in vm_object_page_remove(). Specifically, pmap_remove_all() must not be called on fictitious pages. To date, fictitious pages have been allocated from zeroed memory, effectively hiding this problem because the fictitious pages appear to have an empty pv list. Submitted by: Kostik Belousov Rewrite the comments describing vm_object_page_remove() to better describe what it does. Add an assertion. Reviewed by: Kostik Belousov MFC after: 1 week	2008-02-26 17:16:48 +00:00
Alan Cox	4c8e0452e0	Correct a long-standing error in vm_object_deallocate(). Specifically, only anonymous default (OBJT_DEFAULT) and swap (OBJT_SWAP) objects should ever have OBJ_ONEMAPPING set. However, vm_object_deallocate() was setting it on device (OBJT_DEVICE) objects. As a result, vm_object_page_remove() could be called on a device object and if that occurred pmap_remove_all() would be called on the device object's pages. However, a device object's pages are fictitious, and fictitious pages do not have an initialized pv list (struct md_page). To date, fictitious pages have been allocated from zeroed memory, effectively hiding this problem. Now, however, the conversion of rotting diagnostics to invariants in the amd64 and i386 pmaps has revealed the problem. Specifically, assertion failures have occurred during the initialization phase of the X server on some hardware. MFC after: 1 week Discussed with: Kostik Belousov Reported by: Michiel Boland	2008-02-24 18:03:56 +00:00
Attilio Rao	22db15c06f	VOP_LOCK1() (and so VOP_LOCK()) and VOP_UNLOCK() are only used in conjuction with 'thread' argument passing which is always curthread. Remove the unuseful extra-argument and pass explicitly curthread to lower layer functions, when necessary. KPI results broken by this change, which should affect several ports, so version bumping and manpage update will be further committed. Tested by: kris, pho, Diego Sardina <siarodx at gmail dot com>	2008-01-13 14:44:15 +00:00
Attilio Rao	cb05b60a89	vn_lock() is currently only used with the 'curthread' passed as argument. Remove this argument and pass curthread directly to underlying VOP_LOCK1() VFS method. This modify makes the code cleaner and in particular remove an annoying dependence helping next lockmgr() cleanup. KPI results, obviously, changed. Manpage and FreeBSD_version will be updated through further commits. As a side note, would be valuable to say that next commits will address a similar cleanup about VFS methods, in particular vop_lock1 and vop_unlock. Tested by: Diego Sardina <siarodx at gmail dot com>, Andrea Di Pasquale <whyx dot it at gmail dot com>	2008-01-10 01:10:58 +00:00
Alan Cox	f8a47341fe	Add the superpage reservation system. This is "part 2 of 2" of the machine-independent support for superpages. (The earlier part was the rewrite of the physical memory allocator.) The remainder of the code required for superpages support is machine-dependent and will be added to the various pmap implementations at a later date. Initially, I am only supporting one large page size per architecture. Moreover, I am only enabling the reservation system on amd64. (In an emergency, it can be disabled by setting VM_NRESERVLEVELS to 0 in amd64/include/vmparam.h or your kernel configuration file.)	2007-12-29 19:53:04 +00:00
Alan Cox	59677d3c0e	Prevent the leakage of wired pages in the following circumstances: First, a file is mmap(2)ed and then mlock(2)ed. Later, it is truncated. Under "normal" circumstances, i.e., when the file is not mlock(2)ed, the pages beyond the EOF are unmapped and freed. However, when the file is mlock(2)ed, the pages beyond the EOF are unmapped but not freed because they have a non-zero wire count. This can be a mistake. Specifically, it is a mistake if the sole reason why the pages are wired is because of wired, managed mappings. Previously, unmapping the pages destroys these wired, managed mappings, but does not reduce the pages' wire count. Consequently, when the file is unmapped, the pages are not unwired because the wired mapping has been destroyed. Moreover, when the vm object is finally destroyed, the pages are leaked because they are still wired. The fix is to reduce the pages' wired count by the number of wired, managed mappings destroyed. To do this, I introduce a new pmap function pmap_page_wired_mappings() that returns the number of managed mappings to the given physical page that are wired, and I use this function in vm_object_page_remove(). Reviewed by: tegge MFC after: 6 weeks	2007-11-17 22:52:29 +00:00
Alan Cox	2573269111	The previous revision, updating vm_object_page_remove() for the new page cache, did not account for the case where the vm object has nothing but cached pages. Reported by: kris, tegge Reviewed by: tegge MFC after: 3 days	2007-10-18 23:02:18 +00:00
Alan Cox	c944491426	Correct an error of omission in the reimplementation of the page cache: vm_object_page_remove() should convert any cached pages that fall with the specified range to free pages. Otherwise, there could be a problem if a file is first truncated and then regrown. Specifically, some old data from prior to the truncation might reappear. Generalize vm_page_cache_free() to support the conversion of either a subset or the entirety of an object's cached pages. Reported by: tegge Reviewed by: tegge Approved by: re (kensmith)	2007-09-27 04:21:59 +00:00
Alan Cox	f3a2ed4bd9	Correct an error in the previous revision, specifically, vm_object_madvise() should request that the reactivated, cached page not be busied. Reported by: Rink Springer Approved by: re (kensmith)	2007-09-25 21:01:10 +00:00
Alan Cox	7bfda801a8	Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)	2007-09-25 06:25:06 +00:00
Alan Cox	2446e4f02c	Enable the new physical memory allocator. This allocator uses a binary buddy system with a twist. First and foremost, this allocator is required to support the implementation of superpages. As a side effect, it enables a more robust implementation of contigmalloc(9). Moreover, this reimplementation of contigmalloc(9) eliminates the acquisition of Giant by contigmalloc(..., M_NOWAIT, ...). The twist is that this allocator tries to reduce the number of TLB misses incurred by accesses through a direct map to small, UMA-managed objects and page table pages. Roughly speaking, the physical pages that are allocated for such purposes are clustered together in the physical address space. The performance benefits vary. In the most extreme case, a uniprocessor kernel running on an Opteron, I measured an 18% reduction in system time during a buildworld. This allocator does not implement page coloring. The reason is that superpages have much the same effect. The contiguous physical memory allocation necessary for a superpage is inherently colored. Finally, the one caveat is that this allocator does not effectively support prezeroed pages. I hope this is temporary. On i386, this is a slight pessimization. However, on amd64, the beneficial effects of the direct-map optimization outweigh the ill effects. I speculate that this is true in general of machines with a direct map. Approved by: re	2007-06-16 04:57:06 +00:00
Attilio Rao	393a081d42	Optimize vmmeter locking. In particular: - Add an explicative table for locking of struct vmmeter members - Apply new rules for some of those members - Remove some unuseful comments Heavily reviewed by: alc, bde, jeff Approved by: jeff (mentor)	2007-06-10 21:59:14 +00:00
Attilio Rao	b4b7081961	Do proper "locking" for missing vmmeters part. Now, we assume no more sched_lock protection for some of them and use the distribuited loads method for vmmeter (distribuited through CPUs). Reviewed by: alc, bde Approved by: jeff (mentor)	2007-06-04 21:45:18 +00:00
Attilio Rao	2feb50bf7d	Revert VMCNT_* operations introduction. Probabilly, a general approach is not the better solution here, so we should solve the sched_lock protection problems separately. Requested by: alc Approved by: jeff (mentor)	2007-05-31 22:52:15 +00:00
Jeff Roberson	222d01951f	- define and use VMCNT_{GET,SET,ADD,SUB,PTR} macros for manipulating vmcnts. This can be used to abstract away pcpu details but also changes to use atomics for all counters now. This means sched lock is no longer responsible for protecting counts in the switch routines. Contributed by: Attilio Rao <attilio@FreeBSD.org>	2007-05-18 07:10:50 +00:00
Alan Cox	19c244d064	Prevent a race between vm_object_collapse() and vm_object_split() from causing a crash. Suppose that we have two objects, obj and backing_obj, where backing_obj is obj's backing object. Further, suppose that backing_obj has a reference count of two. One being the reference held by obj and the other by a map entry. Now, suppose that the map entry is deallocated and its reference removed by vm_object_deallocate(). vm_object_deallocate() recognizes that the only remaining reference is from a shadow object, obj, and calls vm_object_collapse() on obj. vm_object_collapse() executes if (backing_object->ref_count == 1) { /* * If there is exactly one reference to the backing * object, we can collapse it into the parent. */ vm_object_backing_scan(object, OBSC_COLLAPSE_WAIT); vm_object_backing_scan(OBSC_COLLAPSE_WAIT) executes if (op & OBSC_COLLAPSE_WAIT) { vm_object_set_flag(backing_object, OBJ_DEAD); } Finally, suppose that either vm_object_backing_scan() or vm_object_collapse() sleeps releasing its locks. At this instant, another thread executes vm_object_split(). It crashes in vm_object_reference_locked() on the assertion that the object is not dead. If, however, assertions are not enabled, it crashes much later, after the object has been recycled, in vm_object_deallocate() because the shadow count and shadow list are inconsistent. Reviewed by: tegge Reported by: jhb MFC after: 1 week	2007-03-27 08:55:17 +00:00
Alan Cox	c5474b8f18	Change the order of lock reacquisition in vm_object_split() in order to simplify the code slightly. Add a comment concerning lock ordering.	2007-03-22 07:02:43 +00:00
John Baldwin	8db5fc58ff	Use pause() in vm_object_deallocate() to yield the CPU to the lock holder rather than a tsleep() on &proc0. The only wakeup on &proc0 is intended to awaken the swapper, not random threads blocked in vm_object_deallocate().	2007-02-27 19:40:26 +00:00
Alan Cox	9f5c801b94	Change the way that unmanaged pages are created. Specifically, immediately flag any page that is allocated to a OBJT_PHYS object as unmanaged in vm_page_alloc() rather than waiting for a later call to vm_page_unmanage(). This allows for the elimination of some uses of the page queues lock. Change the type of the kernel and kmem objects from OBJT_DEFAULT to OBJT_PHYS. This allows us to take advantage of the above change to simplify the allocation of unmanaged pages in kmem_alloc() and kmem_malloc(). Remove vm_page_unmanage(). It is no longer used.	2007-02-25 06:14:58 +00:00
Alan Cox	0cd31a0d75	Change the page's CLEANCHK flag from being a page queue mutex synchronized flag to a vm object mutex synchronized flag.	2007-02-22 06:15:52 +00:00
Xin LI	f67af5c918	Use FOREACH_PROC_IN_SYSTEM instead of using its unrolled form.	2007-01-17 15:05:52 +00:00
Alan Cox	73000556e8	Optimize vm_object_split(). Specifically, make the number of iterations equal to the number of physical pages that are renamed to the new object rather than the new object's virtual size.	2006-12-17 20:14:43 +00:00
Alan Cox	95442adf05	Simplify the computation of the new object's size in vm_object_split().	2006-12-16 08:17:07 +00:00
Alan Cox	2a53696fb8	The page queues lock is no longer required by vm_page_busy() or vm_page_wakeup(). Reduce or eliminate its use accordingly.	2006-10-22 21:18:48 +00:00
Alan Cox	9af80719db	Replace PG_BUSY with VPO_BUSY. In other words, changes to the page's busy flag, i.e., VPO_BUSY, are now synchronized by the per-vm object lock instead of the global page queues lock.	2006-10-22 04:28:14 +00:00
Alan Cox	b276ae6f6a	Add _vm_stats and _vm_stats_misc to the sysctl declarations in sysctl.h and eliminate their declarations from various source files.	2006-08-21 06:27:28 +00:00
Alan Cox	b146f9e5d2	Reimplement the page's NOSYNC flag as an object-synchronized instead of a page queues-synchronized flag. Reduce the scope of the page queues lock in vm_fault() accordingly. Move vm_fault()'s call to vm_object_set_writeable_dirty() outside of the scope of the page queues lock. Reviewed by: tegge Additionally, eliminate an unnecessary dereference in computing the argument that is passed to vm_object_set_writeable_dirty().	2006-08-13 00:11:09 +00:00
Alan Cox	5786be7cc7	Introduce a field to struct vm_page for storing flags that are synchronized by the lock on the object containing the page. Transition PG_WANTED and PG_SWAPINPROG to use the new field, eliminating the need for holding the page queues lock when setting or clearing these flags. Rename PG_WANTED and PG_SWAPINPROG to VPO_WANTED and VPO_SWAPINPROG, respectively. Eliminate the assertion that the page queues lock is held in vm_page_io_finish(). Eliminate the acquisition and release of the page queues lock around calls to vm_page_io_finish() in kern_sendfile() and vfs_unbusy_pages().	2006-08-09 17:43:27 +00:00
Alan Cox	91449ce98c	When sleeping on a busy page, use the lock from the containing object rather than the global page queues lock.	2006-08-03 23:56:11 +00:00
Alan Cox	78985e424a	Complete the transition from pmap_page_protect() to pmap_remove_write(). Originally, I had adopted sparc64's name, pmap_clear_write(), for the function that is now pmap_remove_write(). However, this function is more like pmap_remove_all() than like pmap_clear_modify() or pmap_clear_reference(), hence, the name change. The higher-level rationale behind this change is described in src/sys/amd64/amd64/pmap.c revision 1.567. The short version is that I'm trying to clean up and fix our support for execute access. Reviewed by: marcel@ (ia64)	2006-08-01 19:06:06 +00:00
Alan Cox	604c2bbc34	Export the number of object bypasses and collapses through sysctl.	2006-07-22 22:31:57 +00:00
Alan Cox	af51d7bf57	Eliminate OBJ_WRITEABLE. It hasn't been used in a long time.	2006-07-21 06:40:29 +00:00
Alan Cox	2e9f4a698d	Ensure that vm_object_deallocate() doesn't dereference a stale object pointer: When vm_object_deallocate() sleeps because of a non-zero paging in progress count on either object or object's shadow, vm_object_deallocate() must ensure that object is still the shadow's backing object when it reawakens. In fact, object may have been deallocated while vm_object_deallocate() slept. If so, reacquiring the lock on object can lead to a deadlock. Submitted by: ups@ MFC after: 3 weeks	2006-07-17 06:45:03 +00:00
Tor Egge	3b582b4e72	Eliminate a deadlock when creating snapshots. Blocking vn_start_write() must be called without any vnode locks held. Remove calls to vn_start_write() and vn_finished_write() in vnode_pager_putpages() and add these calls before the vnode lock is obtained to most of the callers that don't already have them.	2006-03-02 22:13:28 +00:00
John Baldwin	ca95b5146a	Lock the vm_object while checking its type to see if it is a vnode-backed object that requires Giant in vm_object_deallocate(). This is somewhat hairy in that if we can't obtain Giant directly, we have to drop the object lock, then lock Giant, then relock the object lock and verify that we still need Giant. If we don't (because the object changed to OBJT_DEAD for example), then we drop Giant before continuing. Reviewed by: alc Tested by: kris	2006-02-21 22:09:54 +00:00
Jeff Roberson	c05e22d44b	- Install a temporary bandaid in vm_object_reference() that will stop mtx_assert()s from triggering until I find a real long-term solution.	2006-02-01 09:47:02 +00:00
Alan Cox	997e1c252b	Use the new macros abstracting the page coloring/queues implementation. (There are no functional changes.)	2006-01-27 07:28:51 +00:00
Jeff Roberson	df59a0fee7	- Avoid calling vm_object_backing_scan() when collapsing an object when the resident page count matches the object size. We know it fully backs its parent in this case. Reviewed by: acl, tegge Sponsored by: Isilon Systems, Inc.	2006-01-25 08:42:58 +00:00
Alan Cox	02dd83311a	Make vm_object_vndeallocate() static. The external calls to it were eliminated in ufs/ffs/ffs_vnops.c's revision 1.125.	2006-01-22 23:56:20 +00:00
Alexander Leidinger	ef39c05baa	MI changes: - provide an interface (macros) to the page coloring part of the VM system, this allows to try different coloring algorithms without the need to touch every file [1] - make the page queue tuning values readable: sysctl vm.stats.pagequeue - autotuning of the page coloring values based upon the cache size instead of options in the kernel config (disabling of the page coloring as a kernel option is still possible) MD changes: - detection of the cache size: only IA32 and AMD64 (untested) contains cache size detection code, every other arch just comes with a dummy function (this results in the use of default values like it was the case without the autotuning of the page coloring) - print some more info on Intel CPU's (like we do on AMD and Transmeta CPU's) Note to AMD owners (IA32 and AMD64): please run "sysctl vm.stats.pagequeue" and report if the cache* values are zero (= bug in the cache detection code) or not. Based upon work by: Chad David <davidc@acns.ab.ca> [1] Reviewed by: alc, arch (in 2004) Discussed with: alc, Chad David, arch (in 2004)	2005-12-31 14:39:20 +00:00
Alan Cox	8215781ba2	Eliminate unneeded preallocation at initialization. Reviewed by: tegge	2005-12-03 19:37:29 +00:00
Alan Cox	f6d8983846	Use of the ZERO_COPY_SOCKETS options can result in an unusual state that vm_object_backing_scan() was not written to handle. Specifically, a wired page within a backing object that is shadowed by a page within the shadow object. Handle this state by removing the wired page from the backing object. The wired page will be freed by socow_iodone(). Stop masking errors: If a page is being freed by vm_object_backing_scan(), assert that it is no longer mapped rather than quietly destroying any mappings. Tested by: Harald Schmalzbauer	2005-10-22 18:46:38 +00:00
Tor Egge	8dbca793a9	Don't allow pagedaemon to skip pages while scanning PQ_ACTIVE or PQ_INACTIVE due to the vm object being locked. When a process writes large amounts of data to a file, the vm object associated with that file can contain most of the physical pages on the machine. If the process is preempted while holding the lock on the vm object, pagedaemon would be able to move very few pages from PQ_INACTIVE to PQ_CACHE or from PQ_ACTIVE to PQ_INACTIVE, resulting in unlimited cleaning of dirty pages belonging to other vm objects. Temporarily unlock the page queues lock while locking vm objects to avoid lock order violation. Detect and handle relevant page queue changes. This change depends on both the lock portion of struct vm_object and normal struct vm_page being type stable. Reviewed by: alc	2005-08-10 00:17:36 +00:00

1 2 3 4 5 ...

599 Commits