freebsd-skq/sys/vm/vm_fault.c

1749 lines
50 KiB
C
Raw Normal View History

/*-
1994-05-24 10:09:53 +00:00
* Copyright (c) 1991, 1993
* The Regents of the University of California. All rights reserved.
* Copyright (c) 1994 John S. Dyson
* All rights reserved.
* Copyright (c) 1994 David Greenman
* All rights reserved.
*
1994-05-24 10:09:53 +00:00
*
* This code is derived from software contributed to Berkeley by
* The Mach Operating System project at Carnegie-Mellon University.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions
* are met:
* 1. Redistributions of source code must retain the above copyright
* notice, this list of conditions and the following disclaimer.
* 2. Redistributions in binary form must reproduce the above copyright
* notice, this list of conditions and the following disclaimer in the
* documentation and/or other materials provided with the distribution.
* 3. All advertising materials mentioning features or use of this software
* must display the following acknowledgement:
1994-05-24 10:09:53 +00:00
* This product includes software developed by the University of
* California, Berkeley and its contributors.
* 4. Neither the name of the University nor the names of its contributors
* may be used to endorse or promote products derived from this software
* without specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
* ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
* OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
* HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
* SUCH DAMAGE.
*
1994-08-02 07:55:43 +00:00
* from: @(#)vm_fault.c 8.4 (Berkeley) 1/12/94
1994-05-24 10:09:53 +00:00
*
*
* Copyright (c) 1987, 1990 Carnegie-Mellon University.
* All rights reserved.
*
* Authors: Avadis Tevanian, Jr., Michael Wayne Young
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
*
1994-05-24 10:09:53 +00:00
* Permission to use, copy, modify and distribute this software and
* its documentation is hereby granted, provided that both the copyright
* notice and this permission notice appear in all copies of the
* software, derivative works or modified versions, and any portions
* thereof, and that both notices appear in supporting documentation.
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
*
* CARNEGIE MELLON ALLOWS FREE USE OF THIS SOFTWARE IN ITS "AS IS"
* CONDITION. CARNEGIE MELLON DISCLAIMS ANY LIABILITY OF ANY KIND
1994-05-24 10:09:53 +00:00
* FOR ANY DAMAGES WHATSOEVER RESULTING FROM THE USE OF THIS SOFTWARE.
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
*
1994-05-24 10:09:53 +00:00
* Carnegie Mellon requests users of this software to return to
*
* Software Distribution Coordinator or Software.Distribution@CS.CMU.EDU
* School of Computer Science
* Carnegie Mellon University
* Pittsburgh PA 15213-3890
*
* any improvements or extensions that they make and grant Carnegie the
* rights to redistribute these changes.
*/
/*
* Page fault handling module.
*/
2003-06-11 23:50:51 +00:00
#include <sys/cdefs.h>
__FBSDID("$FreeBSD$");
#include "opt_ktrace.h"
#include "opt_vm.h"
1994-05-24 10:09:53 +00:00
#include <sys/param.h>
#include <sys/systm.h>
2001-05-22 00:56:25 +00:00
#include <sys/kernel.h>
#include <sys/lock.h>
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
#include <sys/mman.h>
#include <sys/proc.h>
#include <sys/racct.h>
#include <sys/resourcevar.h>
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
#include <sys/rwlock.h>
#include <sys/sysctl.h>
2001-05-22 00:56:25 +00:00
#include <sys/vmmeter.h>
#include <sys/vnode.h>
#ifdef KTRACE
#include <sys/ktrace.h>
#endif
1994-05-24 10:09:53 +00:00
#include <vm/vm.h>
#include <vm/vm_param.h>
#include <vm/pmap.h>
#include <vm/vm_map.h>
#include <vm/vm_object.h>
1994-05-24 10:09:53 +00:00
#include <vm/vm_page.h>
#include <vm/vm_pageout.h>
#include <vm/vm_kern.h>
NOTE: libkvm, w, ps, 'top', and any other utility which depends on struct proc or any VM system structure will have to be rebuilt!!! Much needed overhaul of the VM system. Included in this first round of changes: 1) Improved pager interfaces: init, alloc, dealloc, getpages, putpages, haspage, and sync operations are supported. The haspage interface now provides information about clusterability. All pager routines now take struct vm_object's instead of "pagers". 2) Improved data structures. In the previous paradigm, there is constant confusion caused by pagers being both a data structure ("allocate a pager") and a collection of routines. The idea of a pager structure has escentially been eliminated. Objects now have types, and this type is used to index the appropriate pager. In most cases, items in the pager structure were duplicated in the object data structure and thus were unnecessary. In the few cases that remained, a un_pager structure union was created in the object to contain these items. 3) Because of the cleanup of #1 & #2, a lot of unnecessary layering can now be removed. For instance, vm_object_enter(), vm_object_lookup(), vm_object_remove(), and the associated object hash list were some of the things that were removed. 4) simple_lock's removed. Discussion with several people reveals that the SMP locking primitives used in the VM system aren't likely the mechanism that we'll be adopting. Even if it were, the locking that was in the code was very inadequate and would have to be mostly re-done anyway. The locking in a uni-processor kernel was a no-op but went a long way toward making the code difficult to read and debug. 5) Places that attempted to kludge-up the fact that we don't have kernel thread support have been fixed to reflect the reality that we are really dealing with processes, not threads. The VM system didn't have complete thread support, so the comments and mis-named routines were just wrong. We now use tsleep and wakeup directly in the lock routines, for instance. 6) Where appropriate, the pagers have been improved, especially in the pager_alloc routines. Most of the pager_allocs have been rewritten and are now faster and easier to maintain. 7) The pagedaemon pageout clustering algorithm has been rewritten and now tries harder to output an even number of pages before and after the requested page. This is sort of the reverse of the ideal pagein algorithm and should provide better overall performance. 8) Unnecessary (incorrect) casts to caddr_t in calls to tsleep & wakeup have been removed. Some other unnecessary casts have also been removed. 9) Some almost useless debugging code removed. 10) Terminology of shadow objects vs. backing objects straightened out. The fact that the vm_object data structure escentially had this backwards really confused things. The use of "shadow" and "backing object" throughout the code is now internally consistent and correct in the Mach terminology. 11) Several minor bug fixes, including one in the vm daemon that caused 0 RSS objects to not get purged as intended. 12) A "default pager" has now been created which cleans up the transition of objects to the "swap" type. The previous checks throughout the code for swp->pg_data != NULL were really ugly. This change also provides the rudiments for future backing of "anonymous" memory by something other than the swap pager (via the vnode pager, for example), and it allows the decision about which of these pagers to use to be made dynamically (although will need some additional decision code to do this, of course). 13) (dyson) MAP_COPY has been deprecated and the corresponding "copy object" code has been removed. MAP_COPY was undocumented and non- standard. It was furthermore broken in several ways which caused its behavior to degrade to MAP_PRIVATE. Binaries that use MAP_COPY will continue to work correctly, but via the slightly different semantics of MAP_PRIVATE. 14) (dyson) Sharing maps have been removed. It's marginal usefulness in a threads design can be worked around in other ways. Both #12 and #13 were done to simplify the code and improve readability and maintain- ability. (As were most all of these changes) TODO: 1) Rewrite most of the vnode pager to use VOP_GETPAGES/PUTPAGES. Doing this will reduce the vnode pager to a mere fraction of its current size. 2) Rewrite vm_fault and the swap/vnode pagers to use the clustering information provided by the new haspage pager interface. This will substantially reduce the overhead by eliminating a large number of VOP_BMAP() calls. The VOP_BMAP() filesystem interface should be improved to provide both a "behind" and "ahead" indication of contiguousness. 3) Implement the extended features of pager_haspage in swap_pager_haspage(). It currently just says 0 pages ahead/behind. 4) Re-implement the swap device (swstrategy) in a more elegant way, perhaps via a much more general mechanism that could also be used for disk striping of regular filesystems. 5) Do something to improve the architecture of vm_object_collapse(). The fact that it makes calls into the swap pager and knows too much about how the swap pager operates really bothers me. It also doesn't allow for collapsing of non-swap pager objects ("unnamed" objects backed by other pagers).
1995-07-13 08:48:48 +00:00
#include <vm/vm_pager.h>
#include <vm/vm_extern.h>
Fix the root cause of the "vm_reserv_populate: reserv <address> is already promoted" panics. The sequence of events that leads to a panic is rather long and circuitous. First, suppose that process P has a promoted superpage S within vm object O that it can write to. Then, suppose that P forks, which leads to S being write protected. Now, before P's child exits, suppose that P writes to another virtual page within O. Since the pages within O are copy on write, a shadow object for O is created to house the new physical copy of the faulted on virtual page. Then, before P can fault on S, P's child exists. Now, when P faults on S, it will follow the "optimized" path for copy-on-write faults in vm_fault(), wherein the underlying physical page is moved from O to its shadow object rather than allocating a new page and copying the new page's contents from the old page. Moreover, suppose that every 4 KB physical page making up S is moved to the shadow object in this way. However, the optimized path does not move the underlying superpage reservation, which is the root cause of the panics! Ultimately, P performs vm_object_collapse() on O's shadow object, which destroys O and in doing so breaks any reservations still belonging to O. This leaves the reservation underlying S in an inconsistent state: It's simultaneously not in use and promoted. Breaking a reservation does not demote it because I never intended for a promoted reservation to be broken. It makes little sense. Finally, this inconsistency leads to an assertion failure the next time that the reservation is used. The failing assertion does not (currently) exist in FreeBSD 10.x or earlier. There, we will quietly break the promoted reservation. While illogical and unintended, breaking the reservation is essentially harmless. PR: 198163 Reviewed by: kib Tested by: pho X-MFC after: r267213 Sponsored by: EMC / Isilon Storage Division
2015-03-19 01:40:43 +00:00
#include <vm/vm_reserv.h>
1994-05-24 10:09:53 +00:00
#define PFBAK 4
#define PFFOR 4
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
#define VM_FAULT_READ_DEFAULT (1 + VM_FAULT_READ_AHEAD_INIT)
#define VM_FAULT_READ_MAX (1 + VM_FAULT_READ_AHEAD_MAX)
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
#define VM_FAULT_DONTNEED_MIN 1048576
struct faultstate {
vm_page_t m;
vm_object_t object;
vm_pindex_t pindex;
vm_page_t first_m;
vm_object_t first_object;
vm_pindex_t first_pindex;
vm_map_t map;
vm_map_entry_t entry;
int map_generation;
bool lookup_still_valid;
struct vnode *vp;
};
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
static void vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr,
int ahead);
static void vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
int backward, int forward);
static inline void
release_page(struct faultstate *fs)
{
2009-02-08 19:37:01 +00:00
vm_page_xunbusy(fs->m);
vm_page_lock(fs->m);
vm_page_deactivate(fs->m);
vm_page_unlock(fs->m);
fs->m = NULL;
}
static inline void
unlock_map(struct faultstate *fs)
{
2009-02-08 19:37:01 +00:00
if (fs->lookup_still_valid) {
vm_map_lookup_done(fs->map, fs->entry);
fs->lookup_still_valid = false;
}
}
static void
unlock_vp(struct faultstate *fs)
{
if (fs->vp != NULL) {
vput(fs->vp);
fs->vp = NULL;
}
}
static void
unlock_and_deallocate(struct faultstate *fs)
{
vm_object_pip_wakeup(fs->object);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs->object);
if (fs->object != fs->first_object) {
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(fs->first_object);
vm_page_lock(fs->first_m);
vm_page_free(fs->first_m);
vm_page_unlock(fs->first_m);
vm_object_pip_wakeup(fs->first_object);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs->first_object);
fs->first_m = NULL;
}
vm_object_deallocate(fs->first_object);
unlock_map(fs);
unlock_vp(fs);
}
static void
vm_fault_dirty(vm_map_entry_t entry, vm_page_t m, vm_prot_t prot,
vm_prot_t fault_type, int fault_flags, bool set_wd)
{
bool need_dirty;
if (((prot & VM_PROT_WRITE) == 0 &&
(fault_flags & VM_FAULT_DIRTY) == 0) ||
(m->oflags & VPO_UNMANAGED) != 0)
return;
VM_OBJECT_ASSERT_LOCKED(m->object);
need_dirty = ((fault_type & VM_PROT_WRITE) != 0 &&
(fault_flags & VM_FAULT_WIRE) == 0) ||
(fault_flags & VM_FAULT_DIRTY) != 0;
if (set_wd)
vm_object_set_writeable_dirty(m->object);
else
/*
* If two callers of vm_fault_dirty() with set_wd ==
* FALSE, one for the map entry with MAP_ENTRY_NOSYNC
* flag set, other with flag clear, race, it is
* possible for the no-NOSYNC thread to see m->dirty
* != 0 and not clear VPO_NOSYNC. Take vm_page lock
* around manipulation of VPO_NOSYNC and
* vm_page_dirty() call, to avoid the race and keep
* m->oflags consistent.
*/
vm_page_lock(m);
/*
* If this is a NOSYNC mmap we do not want to set VPO_NOSYNC
* if the page is already dirty to prevent data written with
* the expectation of being synced from not being synced.
* Likewise if this entry does not request NOSYNC then make
* sure the page isn't marked NOSYNC. Applications sharing
* data should use the same flags to avoid ping ponging.
*/
if ((entry->eflags & MAP_ENTRY_NOSYNC) != 0) {
if (m->dirty == 0) {
m->oflags |= VPO_NOSYNC;
}
} else {
m->oflags &= ~VPO_NOSYNC;
}
/*
* If the fault is a write, we know that this page is being
* written NOW so dirty it explicitly to save on
* pmap_is_modified() calls later.
*
* Also tell the backing pager, if any, that it should remove
* any swap backing since the page is now dirty.
*/
if (need_dirty)
vm_page_dirty(m);
if (!set_wd)
vm_page_unlock(m);
if (need_dirty)
vm_pager_page_unswapped(m);
}
static void
vm_fault_fill_hold(vm_page_t *m_hold, vm_page_t m)
{
if (m_hold != NULL) {
*m_hold = m;
vm_page_lock(m);
vm_page_hold(m);
vm_page_unlock(m);
}
}
/*
* Unlocks fs.first_object and fs.map on success.
*/
static int
vm_fault_soft_fast(struct faultstate *fs, vm_offset_t vaddr, vm_prot_t prot,
int fault_type, int fault_flags, boolean_t wired, vm_page_t *m_hold)
{
vm_page_t m, m_map;
#if defined(__amd64__) && VM_NRESERVLEVEL > 0
vm_page_t m_super;
#endif
int flags, psind, rv;
MPASS(fs->vp == NULL);
m = vm_page_lookup(fs->first_object, fs->first_pindex);
/* A busy page can be mapped for read|execute access. */
if (m == NULL || ((prot & VM_PROT_WRITE) != 0 &&
vm_page_busied(m)) || m->valid != VM_PAGE_BITS_ALL)
return (KERN_FAILURE);
m_map = m;
psind = 0;
#if defined(__amd64__) && VM_NRESERVLEVEL > 0
if ((m->flags & PG_FICTITIOUS) == 0 &&
(m_super = vm_reserv_to_superpage(m)) != NULL &&
rounddown2(vaddr, pagesizes[m_super->psind]) >= fs->entry->start &&
roundup2(vaddr + 1, pagesizes[m_super->psind]) <= fs->entry->end &&
(vaddr & (pagesizes[m_super->psind] - 1)) == (VM_PAGE_TO_PHYS(m) &
(pagesizes[m_super->psind] - 1)) &&
pmap_ps_enabled(fs->map->pmap)) {
flags = PS_ALL_VALID;
if ((prot & VM_PROT_WRITE) != 0) {
/*
* Create a superpage mapping allowing write access
* only if none of the constituent pages are busy and
* all of them are already dirty (except possibly for
* the page that was faulted on).
*/
flags |= PS_NONE_BUSY;
if ((fs->first_object->flags & OBJ_UNMANAGED) == 0)
flags |= PS_ALL_DIRTY;
}
if (vm_page_ps_test(m_super, flags, m)) {
m_map = m_super;
psind = m_super->psind;
vaddr = rounddown2(vaddr, pagesizes[psind]);
/* Preset the modified bit for dirty superpages. */
if ((flags & PS_ALL_DIRTY) != 0)
fault_type |= VM_PROT_WRITE;
}
}
#endif
rv = pmap_enter(fs->map->pmap, vaddr, m_map, prot, fault_type |
PMAP_ENTER_NOSLEEP | (wired ? PMAP_ENTER_WIRED : 0), psind);
if (rv != KERN_SUCCESS)
return (rv);
vm_fault_fill_hold(m_hold, m);
vm_fault_dirty(fs->entry, m, prot, fault_type, fault_flags, false);
VM_OBJECT_RUNLOCK(fs->first_object);
if (psind == 0 && !wired)
vm_fault_prefault(fs, vaddr, PFBAK, PFFOR);
vm_map_lookup_done(fs->map, fs->entry);
curthread->td_ru.ru_minflt++;
return (KERN_SUCCESS);
}
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
static void
vm_fault_restore_map_lock(struct faultstate *fs)
{
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
MPASS(fs->first_object->paging_in_progress > 0);
if (!vm_map_trylock_read(fs->map)) {
VM_OBJECT_WUNLOCK(fs->first_object);
vm_map_lock_read(fs->map);
VM_OBJECT_WLOCK(fs->first_object);
}
fs->lookup_still_valid = true;
}
static void
vm_fault_populate_check_page(vm_page_t m)
{
/*
* Check each page to ensure that the pager is obeying the
* interface: the page must be installed in the object, fully
* valid, and exclusively busied.
*/
MPASS(m != NULL);
MPASS(m->valid == VM_PAGE_BITS_ALL);
MPASS(vm_page_xbusied(m));
}
static void
vm_fault_populate_cleanup(vm_object_t object, vm_pindex_t first,
vm_pindex_t last)
{
vm_page_t m;
vm_pindex_t pidx;
VM_OBJECT_ASSERT_WLOCKED(object);
MPASS(first <= last);
for (pidx = first, m = vm_page_lookup(object, pidx);
pidx <= last; pidx++, m = vm_page_next(m)) {
vm_fault_populate_check_page(m);
vm_page_lock(m);
vm_page_deactivate(m);
vm_page_unlock(m);
vm_page_xunbusy(m);
}
}
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
static int
vm_fault_populate(struct faultstate *fs, vm_offset_t vaddr, vm_prot_t prot,
int fault_type, int fault_flags, boolean_t wired, vm_page_t *m_hold)
{
vm_page_t m;
vm_pindex_t map_first, map_last, pager_first, pager_last, pidx;
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
int rv;
MPASS(fs->object == fs->first_object);
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
MPASS(fs->first_object->paging_in_progress > 0);
MPASS(fs->first_object->backing_object == NULL);
MPASS(fs->lookup_still_valid);
pager_first = OFF_TO_IDX(fs->entry->offset);
pager_last = pager_first + atop(fs->entry->end - fs->entry->start) - 1;
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
unlock_map(fs);
unlock_vp(fs);
/*
* Call the pager (driver) populate() method.
*
* There is no guarantee that the method will be called again
* if the current fault is for read, and a future fault is
* for write. Report the entry's maximum allowed protection
* to the driver.
*/
rv = vm_pager_populate(fs->first_object, fs->first_pindex,
fault_type, fs->entry->max_protection, &pager_first, &pager_last);
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
VM_OBJECT_ASSERT_WLOCKED(fs->first_object);
if (rv == VM_PAGER_BAD) {
/*
* VM_PAGER_BAD is the backdoor for a pager to request
* normal fault handling.
*/
vm_fault_restore_map_lock(fs);
if (fs->map->timestamp != fs->map_generation)
return (KERN_RESOURCE_SHORTAGE); /* RetryFault */
return (KERN_NOT_RECEIVER);
}
if (rv != VM_PAGER_OK)
return (KERN_FAILURE); /* AKA SIGSEGV */
/* Ensure that the driver is obeying the interface. */
MPASS(pager_first <= pager_last);
MPASS(fs->first_pindex <= pager_last);
MPASS(fs->first_pindex >= pager_first);
MPASS(pager_last < fs->first_object->size);
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
vm_fault_restore_map_lock(fs);
if (fs->map->timestamp != fs->map_generation) {
vm_fault_populate_cleanup(fs->first_object, pager_first,
pager_last);
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
return (KERN_RESOURCE_SHORTAGE); /* RetryFault */
}
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
/*
* The map is unchanged after our last unlock. Process the fault.
*
* The range [pager_first, pager_last] that is given to the
* pager is only a hint. The pager may populate any range
* within the object that includes the requested page index.
* In case the pager expanded the range, clip it to fit into
* the map entry.
*/
map_first = OFF_TO_IDX(fs->entry->offset);
if (map_first > pager_first) {
vm_fault_populate_cleanup(fs->first_object, pager_first,
map_first - 1);
pager_first = map_first;
}
map_last = map_first + atop(fs->entry->end - fs->entry->start) - 1;
if (map_last < pager_last) {
vm_fault_populate_cleanup(fs->first_object, map_last + 1,
pager_last);
pager_last = map_last;
}
for (pidx = pager_first, m = vm_page_lookup(fs->first_object, pidx);
pidx <= pager_last; pidx++, m = vm_page_next(m)) {
vm_fault_populate_check_page(m);
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
vm_fault_dirty(fs->entry, m, prot, fault_type, fault_flags,
true);
VM_OBJECT_WUNLOCK(fs->first_object);
pmap_enter(fs->map->pmap, fs->entry->start + IDX_TO_OFF(pidx) -
fs->entry->offset, m, prot, fault_type | (wired ?
PMAP_ENTER_WIRED : 0), 0);
VM_OBJECT_WLOCK(fs->first_object);
if (pidx == fs->first_pindex)
vm_fault_fill_hold(m_hold, m);
vm_page_lock(m);
if ((fault_flags & VM_FAULT_WIRE) != 0) {
KASSERT(wired, ("VM_FAULT_WIRE && !wired"));
vm_page_wire(m);
} else {
vm_page_activate(m);
}
vm_page_unlock(m);
vm_page_xunbusy(m);
}
curthread->td_ru.ru_majflt++;
return (KERN_SUCCESS);
}
1994-05-24 10:09:53 +00:00
/*
* vm_fault:
*
2000-03-26 15:20:23 +00:00
* Handle a page fault occurring at the given address,
1994-05-24 10:09:53 +00:00
* requiring the given permissions, in the map specified.
* If successful, the page is inserted into the
* associated physical map.
*
* NOTE: the given address should be truncated to the
* proper page address.
*
* KERN_SUCCESS is returned if the page fault is handled; otherwise,
* a standard error specifying why the fault is fatal is returned.
*
* The map in question must be referenced, and remains so.
* Caller may hold no locks.
1994-05-24 10:09:53 +00:00
*/
int
vm_fault(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
int fault_flags)
{
struct thread *td;
int result;
td = curthread;
if ((td->td_pflags & TDP_NOFAULTING) != 0)
return (KERN_PROTECTION_FAILURE);
#ifdef KTRACE
if (map != kernel_map && KTRPOINT(td, KTR_FAULT))
ktrfault(vaddr, fault_type);
#endif
result = vm_fault_hold(map, trunc_page(vaddr), fault_type, fault_flags,
NULL);
#ifdef KTRACE
if (map != kernel_map && KTRPOINT(td, KTR_FAULTEND))
ktrfaultend(result);
#endif
return (result);
}
int
vm_fault_hold(vm_map_t map, vm_offset_t vaddr, vm_prot_t fault_type,
int fault_flags, vm_page_t *m_hold)
1994-05-24 10:09:53 +00:00
{
struct faultstate fs;
struct vnode *vp;
Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302
2016-11-09 18:48:37 +00:00
vm_object_t next_object, retry_object;
vm_offset_t e_end, e_start;
Introduce a new page queue, PQ_LAUNDRY, for storing unreferenced, dirty pages, specificially, dirty pages that have passed once through the inactive queue. A new, dedicated thread is responsible for both deciding when to launder pages and actually laundering them. The new policy uses the relative sizes of the inactive and laundry queues to determine whether to launder pages at a given point in time. In general, this leads to more intelligent swapping behavior, since the laundry thread will avoid pageouts when the marginal benefit of doing so is low. Previously, without a dedicated queue for dirty pages, the page daemon didn't have the information to determine whether pageout provides any benefit to the system. Thus, the previous policy often resulted in small but steadily increasing amounts of swap usage when the system is under memory pressure, even when the inactive queue consisted mostly of clean pages. This change addresses that issue, and also paves the way for some future virtual memory system improvements by removing the last source of object-cached clean pages, i.e., PG_CACHE pages. The new laundry thread sleeps while waiting for a request from the page daemon thread(s). A request is raised by setting the variable vm_laundry_request and waking the laundry thread. We request launderings for two reasons: to try and balance the inactive and laundry queue sizes ("background laundering"), and to quickly make up for a shortage of free pages and clean inactive pages ("shortfall laundering"). When background laundering is requested, the laundry thread computes the number of page daemon wakeups that have taken place since the last laundering. If this number is large enough relative to the ratio of the laundry and (global) inactive queue sizes, we will launder vm_background_launder_target pages at vm_background_launder_rate KB/s. Otherwise, the laundry thread goes back to sleep without doing any work. When scanning the laundry queue during background laundering, reactivated pages are counted towards the laundry thread's target. In contrast, shortfall laundering is requested when an inactive queue scan fails to meet its target. In this case, the laundry thread attempts to launder enough pages to meet v_free_target within 0.5s, which is the inactive queue scan period. A laundry request can be latched while another is currently being serviced. In particular, a shortfall request will immediately preempt a background laundering. This change also redefines the meaning of vm_cnt.v_reactivated and removes the functions vm_page_cache() and vm_page_try_to_cache(). The new meaning of vm_cnt.v_reactivated now better reflects its name. It represents the number of inactive or laundry pages that are returned to the active queue on account of a reference. In collaboration with: markj Reviewed by: kib Tested by: pho Sponsored by: Dell EMC Isilon Differential Revision: https://reviews.freebsd.org/D8302
2016-11-09 18:48:37 +00:00
vm_pindex_t retry_pindex;
vm_prot_t prot, retry_prot;
int ahead, alloc_req, behind, cluster_offset, error, era, faultcount;
int locked, nera, result, rv;
u_char behavior;
boolean_t wired; /* Passed by reference. */
bool dead, hardfault, is_first_object_locked;
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
VM_CNT_INC(v_vm_faults);
fs.vp = NULL;
faultcount = 0;
nera = -1;
hardfault = false;
1994-05-24 10:09:53 +00:00
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
RetryFault:;
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* Find the backing store object and offset into it to begin the
* search.
1994-05-24 10:09:53 +00:00
*/
fs.map = map;
result = vm_map_lookup(&fs.map, vaddr, fault_type |
VM_PROT_FAULT_LOOKUP, &fs.entry, &fs.first_object,
&fs.first_pindex, &prot, &wired);
if (result != KERN_SUCCESS) {
unlock_vp(&fs);
return (result);
1994-05-24 10:09:53 +00:00
}
fs.map_generation = fs.map->timestamp;
VM level code cleanups. 1) Start using TSM. Struct procs continue to point to upages structure, after being freed. Struct vmspace continues to point to pte object and kva space for kstack. u_map is now superfluous. 2) vm_map's don't need to be reference counted. They always exist either in the kernel or in a vmspace. The vmspaces are managed by reference counts. 3) Remove the "wired" vm_map nonsense. 4) No need to keep a cache of kernel stack kva's. 5) Get rid of strange looking ++var, and change to var++. 6) Change more data structures to use our "zone" allocator. Added struct proc, struct vmspace and struct vnode. This saves a significant amount of kva space and physical memory. Additionally, this enables TSM for the zone managed memory. 7) Keep ioopt disabled for now. 8) Remove the now bogus "single use" map concept. 9) Use generation counts or id's for data structures residing in TSM, where it allows us to avoid unneeded restart overhead during traversals, where blocking might occur. 10) Account better for memory deficits, so the pageout daemon will be able to make enough memory available (experimental.) 11) Fix some vnode locking problems. (From Tor, I think.) 12) Add a check in ufs_lookup, to avoid lots of unneeded calls to bcmp. (experimental.) 13) Significantly shrink, cleanup, and make slightly faster the vm_fault.c code. Use generation counts, get rid of unneded collpase operations, and clean up the cluster code. 14) Make vm_zone more suitable for TSM. This commit is partially as a result of discussions and contributions from other people, including DG, Tor Egge, PHK, and probably others that I have forgotten to attribute (so let me know, if I forgot.) This is not the infamous, final cleanup of the vnode stuff, but a necessary step. Vnode mgmt should be correct, but things might still change, and there is still some missing stuff (like ioopt, and physical backing of non-merged cache files, debugging of layering concepts.)
1998-01-22 17:30:44 +00:00
if (fs.entry->eflags & MAP_ENTRY_NOFAULT) {
panic("%s: fault on nofault entry, addr: %#lx",
__func__, (u_long)vaddr);
}
if (fs.entry->eflags & MAP_ENTRY_IN_TRANSITION &&
fs.entry->wiring_thread != curthread) {
vm_map_unlock_read(fs.map);
vm_map_lock(fs.map);
if (vm_map_lookup_entry(fs.map, vaddr, &fs.entry) &&
(fs.entry->eflags & MAP_ENTRY_IN_TRANSITION)) {
unlock_vp(&fs);
fs.entry->eflags |= MAP_ENTRY_NEEDS_WAKEUP;
vm_map_unlock_and_wait(fs.map, 0);
} else
vm_map_unlock(fs.map);
goto RetryFault;
}
MPASS((fs.entry->eflags & MAP_ENTRY_GUARD) == 0);
if (wired)
fault_type = prot | (fault_type & VM_PROT_COPY);
else
KASSERT((fault_flags & VM_FAULT_WIRE) == 0,
("!wired && VM_FAULT_WIRE"));
/*
* Try to avoid lock contention on the top-level object through
* special-case handling of some types of page faults, specifically,
* those that are both (1) mapping an existing page from the top-
* level object and (2) not having to mark that object as containing
* dirty pages. Under these conditions, a read lock on the top-level
* object suffices, allowing multiple page faults of a similar type to
* run in parallel on the same top-level object.
*/
if (fs.vp == NULL /* avoid locked vnode leak */ &&
(fault_flags & (VM_FAULT_WIRE | VM_FAULT_DIRTY)) == 0 &&
/* avoid calling vm_object_set_writeable_dirty() */
((prot & VM_PROT_WRITE) == 0 ||
(fs.first_object->type != OBJT_VNODE &&
(fs.first_object->flags & OBJ_TMPFS_NODE) == 0) ||
(fs.first_object->flags & OBJ_MIGHTBEDIRTY) != 0)) {
VM_OBJECT_RLOCK(fs.first_object);
if ((prot & VM_PROT_WRITE) == 0 ||
(fs.first_object->type != OBJT_VNODE &&
(fs.first_object->flags & OBJ_TMPFS_NODE) == 0) ||
(fs.first_object->flags & OBJ_MIGHTBEDIRTY) != 0) {
rv = vm_fault_soft_fast(&fs, vaddr, prot, fault_type,
fault_flags, wired, m_hold);
if (rv == KERN_SUCCESS)
return (rv);
}
if (!VM_OBJECT_TRYUPGRADE(fs.first_object)) {
VM_OBJECT_RUNLOCK(fs.first_object);
VM_OBJECT_WLOCK(fs.first_object);
}
} else {
VM_OBJECT_WLOCK(fs.first_object);
}
/*
* Make a reference to this object to prevent its disposal while we
* are messing with it. Once we have the reference, the map is free
* to be diddled. Since objects reference their shadows (and copies),
* they will stay around as well.
*
* Bump the paging-in-progress count to prevent size changes (e.g.
* truncation operations) during I/O.
*/
vm_object_reference_locked(fs.first_object);
vm_object_pip_add(fs.first_object, 1);
fs.lookup_still_valid = true;
1994-05-24 10:09:53 +00:00
fs.first_m = NULL;
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* Search for the page at object/offset.
1994-05-24 10:09:53 +00:00
*/
fs.object = fs.first_object;
fs.pindex = fs.first_pindex;
1994-05-24 10:09:53 +00:00
while (TRUE) {
/*
* If the object is marked for imminent termination,
* we retry here, since the collapse pass has raced
* with us. Otherwise, if we see terminally dead
* object, return fail.
*/
if ((fs.object->flags & OBJ_DEAD) != 0) {
dead = fs.object->type == OBJT_DEAD;
unlock_and_deallocate(&fs);
if (dead)
return (KERN_PROTECTION_FAILURE);
pause("vmf_de", 1);
goto RetryFault;
}
/*
* See if page is resident
*/
fs.m = vm_page_lookup(fs.object, fs.pindex);
if (fs.m != NULL) {
1994-05-24 10:09:53 +00:00
/*
* Wait/Retry if the page is busy. We have to do this
* if the page is either exclusive or shared busy
* because the vm_pager may be using read busy for
* pageouts (and even pageins if it is the vnode
* pager), and we could end up trying to pagein and
* pageout the same page simultaneously.
*
* We can theoretically allow the busy case on a read
* fault if the page is marked valid, but since such
* pages are typically already pmap'd, putting that
* special case in might be more effort then it is
* worth. We cannot under any circumstances mess
* around with a shared busied page except, perhaps,
* to pmap it.
1994-05-24 10:09:53 +00:00
*/
if (vm_page_busied(fs.m)) {
/*
* Reference the page before unlocking and
* sleeping so that the page daemon is less
* likely to reclaim it.
*/
vm_page_aflag_set(fs.m, PGA_REFERENCED);
if (fs.object != fs.first_object) {
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
if (!VM_OBJECT_TRYWLOCK(
fs.first_object)) {
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs.object);
VM_OBJECT_WLOCK(fs.first_object);
VM_OBJECT_WLOCK(fs.object);
}
vm_page_lock(fs.first_m);
vm_page_free(fs.first_m);
vm_page_unlock(fs.first_m);
vm_object_pip_wakeup(fs.first_object);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs.first_object);
fs.first_m = NULL;
}
unlock_map(&fs);
if (fs.m == vm_page_lookup(fs.object,
fs.pindex)) {
vm_page_sleep_if_busy(fs.m, "vmpfw");
}
vm_object_pip_wakeup(fs.object);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs.object);
VM_CNT_INC(v_intrans);
vm_object_deallocate(fs.first_object);
1994-05-24 10:09:53 +00:00
goto RetryFault;
}
vm_page_lock(fs.m);
vm_page_remque(fs.m);
vm_page_unlock(fs.m);
/*
* Mark page busy for other processes, and the
* pagedaemon. If it still isn't completely valid
* (readable), jump to readrest, else break-out ( we
* found the page ).
*/
vm_page_xbusy(fs.m);
if (fs.m->valid != VM_PAGE_BITS_ALL)
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
goto readrest;
1994-05-24 10:09:53 +00:00
break;
}
KASSERT(fs.m == NULL, ("fs.m should be NULL, not %p", fs.m));
/*
* Page is not resident. If the pager might contain the page
* or this is the beginning of the search, allocate a new
* page. (Default objects are zero-fill, so there is no real
* pager for them.)
*/
if (fs.object->type != OBJT_DEFAULT ||
fs.object == fs.first_object) {
if (fs.pindex >= fs.object->size) {
unlock_and_deallocate(&fs);
return (KERN_PROTECTION_FAILURE);
}
Add a new populate() pager method and extend device pager ops vector with cdev_pg_populate() to provide device drivers access to it. It gives drivers fine control of the pages ownership and allows drivers to implement arbitrary prefault policies. The populate method is called on a page fault and is supposed to populate the vm object with the page at the fault location and some amount of pages around it, at pager's discretion. VM provides the pager with the hints about current range of the object mapping, to avoid instantiation of immediately unused pages, if pager decides so. Also, VM passes the fault type and map entry protection to the pager, allowing it to force the optimal required ownership of the mapped pages. Installed pages must contiguously fill the returned region, be fully valid and exclusively busied. Of course, the pages must be compatible with the object' type. After populate() successfully returned, VM fault handler installs as many instantiated pages into the process page tables as it sees reasonable, while still obeying the correct semantic for COW and vm map locking. The method is opt-in, pager sets OBJ_POPULATE flag to indicate that the method can be called. If pager' vm objects can be shadowed, pager must implement the traditional getpages() method in addition to the populate(). Populate() might fall back to the getpages() on per-call basis as well, by returning VM_PAGER_BAD error code. For now for device pagers, the populate() method is only allowed to be used by the managed device pagers, but the limitation is only made because there is no unmanaged fault handlers which could use it right now. KPI designed together with, and reviewed by: alc Tested by: pho Sponsored by: The FreeBSD Foundation MFC after: 3 weeks
2016-12-08 11:26:11 +00:00
if (fs.object == fs.first_object &&
(fs.first_object->flags & OBJ_POPULATE) != 0 &&
fs.first_object->shadow_count == 0) {
rv = vm_fault_populate(&fs, vaddr, prot,
fault_type, fault_flags, wired, m_hold);
switch (rv) {
case KERN_SUCCESS:
case KERN_FAILURE:
unlock_and_deallocate(&fs);
return (rv);
case KERN_RESOURCE_SHORTAGE:
unlock_and_deallocate(&fs);
goto RetryFault;
case KERN_NOT_RECEIVER:
/*
* Pager's populate() method
* returned VM_PAGER_BAD.
*/
break;
default:
panic("inconsistent return codes");
}
}
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* Allocate a new page for this object/offset pair.
*
* Unlocked read of the p_flag is harmless. At
* worst, the P_KILLED might be not observed
* there, and allocation can fail, causing
* restart and new reading of the p_flag.
1994-05-24 10:09:53 +00:00
*/
if (!vm_page_count_severe() || P_KILLED(curproc)) {
#if VM_NRESERVLEVEL > 0
vm_object_color(fs.object, atop(vaddr) -
fs.pindex);
#endif
alloc_req = P_KILLED(curproc) ?
VM_ALLOC_SYSTEM : VM_ALLOC_NORMAL;
if (fs.object->type != OBJT_VNODE &&
fs.object->backing_object == NULL)
alloc_req |= VM_ALLOC_ZERO;
fs.m = vm_page_alloc(fs.object, fs.pindex,
alloc_req);
}
if (fs.m == NULL) {
unlock_and_deallocate(&fs);
VM_WAITPFAULT;
1994-05-24 10:09:53 +00:00
goto RetryFault;
}
1994-05-24 10:09:53 +00:00
}
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
readrest:
/*
* At this point, we have either allocated a new page or found
* an existing page that is only partially valid.
*
* We hold a reference on the current object and the page is
* exclusive busied.
*/
/*
* If the pager for the current object might have the page,
* then determine the number of additional pages to read and
* potentially reprioritize previously read pages for earlier
* reclamation. These operations should only be performed
* once per page fault. Even if the current pager doesn't
* have the page, the number of additional pages to read will
* apply to subsequent objects in the shadow chain.
*/
if (fs.object->type != OBJT_DEFAULT && nera == -1 &&
!P_KILLED(curproc)) {
KASSERT(fs.lookup_still_valid, ("map unlocked"));
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
era = fs.entry->read_ahead;
behavior = vm_map_entry_behavior(fs.entry);
if (behavior == MAP_ENTRY_BEHAV_RANDOM) {
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
nera = 0;
} else if (behavior == MAP_ENTRY_BEHAV_SEQUENTIAL) {
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
nera = VM_FAULT_READ_AHEAD_MAX;
if (vaddr == fs.entry->next_read)
vm_fault_dontneed(&fs, vaddr, nera);
} else if (vaddr == fs.entry->next_read) {
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
/*
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
* This is a sequential fault. Arithmetically
* increase the requested number of pages in
* the read-ahead window. The requested
* number of pages is "# of sequential faults
* x (read ahead min + 1) + read ahead min"
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
*/
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
nera = VM_FAULT_READ_AHEAD_MIN;
if (era > 0) {
nera += era + 1;
if (nera > VM_FAULT_READ_AHEAD_MAX)
nera = VM_FAULT_READ_AHEAD_MAX;
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
}
if (era == VM_FAULT_READ_AHEAD_MAX)
vm_fault_dontneed(&fs, vaddr, nera);
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
} else {
/*
* This is a non-sequential fault.
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
*/
nera = 0;
}
if (era != nera) {
/*
* A read lock on the map suffices to update
* the read ahead count safely.
*/
Revamp the default page clustering strategy that is used by the page fault handler. For roughly twenty years, the page fault handler has used the same basic strategy: Fetch a fixed number of non-resident pages both ahead and behind the virtual page that was faulted on. Over the years, alternative strategies have been implemented for optimizing the handling of random and sequential access patterns, but the only change to the default strategy has been to increase the number of pages read ahead to 7 and behind to 8. The problem with the default page clustering strategy becomes apparent when you look at how it behaves on the code section of an executable or shared library. (To simplify the following explanation, I'm going to ignore the read that is performed to obtain the header and assume that no pages are resident at the start of execution.) Suppose that we have a code section consisting of 32 pages. Further, suppose that we access pages 4, 28, and 16 in that order. Under the default page clustering strategy, we page fault three times and perform three I/O operations, because the first and second page faults only read a truncated cluster of 12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we only fault twice and perform two I/O operations, because the first and second page faults read a full cluster of 16 pages. In general, truncated clusters are more common than full clusters. To address this problem, this revision changes the default page clustering strategy to align the start of the cluster to a page offset within the vm object that is a multiple of the cluster size. This results in many fewer truncated clusters. Returning to our example, if we now access pages 4, 28, and 16 in that order, the cluster that is read to satisfy the page fault on page 28 will now include page 16. So, the access to page 16 will no longer page fault and perform an I/O operation. Since the revised default page clustering strategy is typically reading more pages at a time, we are likely to read a few more pages that are never accessed. However, for the various programs that we looked at, including clang, emacs, firefox, and openjdk, the reduction in the number of page faults and I/O operations far outweighed the increase in the number of pages that are never accessed. Moreover, the extra resident pages allowed for many more superpage mappings. For example, if we look at the execution of clang during a buildworld, the number of (hard) page faults on the code section drops by 26%, the number of superpage mappings increases by about 29,000, but the number of never accessed pages only increases from 30.38% to 33.66%. Finally, this leads to a small but measureable reduction in execution time. In collaboration with: Emily Pettigrew <ejp1@rice.edu> Differential Revision: https://reviews.freebsd.org/D1500 Reviewed by: jhb, kib MFC after: 6 weeks
2015-01-16 18:17:09 +00:00
fs.entry->read_ahead = nera;
}
/*
* Prepare for unlocking the map. Save the map
* entry's start and end addresses, which are used to
* optimize the size of the pager operation below.
* Even if the map entry's addresses change after
* unlocking the map, using the saved addresses is
* safe.
*/
e_start = fs.entry->start;
e_end = fs.entry->end;
}
/*
* Call the pager to retrieve the page if there is a chance
* that the pager has it, and potentially retrieve additional
* pages at the same time.
*/
if (fs.object->type != OBJT_DEFAULT) {
/*
* Release the map lock before locking the vnode or
* sleeping in the pager. (If the current object has
* a shadow, then an earlier iteration of this loop
* may have already unlocked the map.)
*/
unlock_map(&fs);
if (fs.object->type == OBJT_VNODE &&
(vp = fs.object->handle) != fs.vp) {
/*
* Perform an unlock in case the desired vnode
* changed while the map was unlocked during a
* retry.
*/
unlock_vp(&fs);
locked = VOP_ISLOCKED(vp);
if (locked != LK_EXCLUSIVE)
locked = LK_SHARED;
/*
* We must not sleep acquiring the vnode lock
* while we have the page exclusive busied or
* the object's paging-in-progress count
* incremented. Otherwise, we could deadlock.
*/
error = vget(vp, locked | LK_CANRECURSE |
LK_NOWAIT, curthread);
if (error != 0) {
vhold(vp);
release_page(&fs);
unlock_and_deallocate(&fs);
error = vget(vp, locked | LK_RETRY |
LK_CANRECURSE, curthread);
vdrop(vp);
fs.vp = vp;
KASSERT(error == 0,
("vm_fault: vget failed"));
goto RetryFault;
}
fs.vp = vp;
}
KASSERT(fs.vp == NULL || !fs.map->system_map,
("vm_fault: vnode-backed object mapped by system map"));
1994-05-24 10:09:53 +00:00
/*
* Page in the requested page and hint the pager,
* that it may bring up surrounding pages.
1994-05-24 10:09:53 +00:00
*/
if (nera == -1 || behavior == MAP_ENTRY_BEHAV_RANDOM ||
P_KILLED(curproc)) {
behind = 0;
ahead = 0;
} else {
/* Is this a sequential fault? */
if (nera > 0) {
behind = 0;
ahead = nera;
} else {
/*
* Request a cluster of pages that is
* aligned to a VM_FAULT_READ_DEFAULT
* page offset boundary within the
* object. Alignment to a page offset
* boundary is more likely to coincide
* with the underlying file system
* block than alignment to a virtual
* address boundary.
*/
cluster_offset = fs.pindex %
VM_FAULT_READ_DEFAULT;
behind = ulmin(cluster_offset,
atop(vaddr - e_start));
ahead = VM_FAULT_READ_DEFAULT - 1 -
cluster_offset;
}
ahead = ulmin(ahead, atop(e_end - vaddr) - 1);
}
rv = vm_pager_get_pages(fs.object, &fs.m, 1,
&behind, &ahead);
1994-05-24 10:09:53 +00:00
if (rv == VM_PAGER_OK) {
faultcount = behind + 1 + ahead;
hardfault = true;
break; /* break to PAGE HAS BEEN FOUND */
1994-05-24 10:09:53 +00:00
}
if (rv == VM_PAGER_ERROR)
1998-07-22 09:38:04 +00:00
printf("vm_fault: pager read error, pid %d (%s)\n",
curproc->p_pid, curproc->p_comm);
/*
* If an I/O error occurred or the requested page was
* outside the range of the pager, clean up and return
* an error.
*/
if (rv == VM_PAGER_ERROR || rv == VM_PAGER_BAD) {
vm_page_lock(fs.m);
if (fs.m->wire_count == 0)
vm_page_free(fs.m);
else
vm_page_xunbusy_maybelocked(fs.m);
vm_page_unlock(fs.m);
fs.m = NULL;
unlock_and_deallocate(&fs);
return (rv == VM_PAGER_ERROR ? KERN_FAILURE :
KERN_PROTECTION_FAILURE);
1994-05-24 10:09:53 +00:00
}
/*
* The requested page does not exist at this object/
* offset. Remove the invalid page from the object,
* waking up anyone waiting for it, and continue on to
* the next object. However, if this is the top-level
* object, we must leave the busy page in place to
* prevent another process from rushing past us, and
* inserting the page in that object at the same time
* that we are.
*/
if (fs.object != fs.first_object) {
vm_page_lock(fs.m);
if (fs.m->wire_count == 0)
vm_page_free(fs.m);
else
vm_page_xunbusy_maybelocked(fs.m);
vm_page_unlock(fs.m);
fs.m = NULL;
1994-05-24 10:09:53 +00:00
}
}
1994-05-24 10:09:53 +00:00
/*
* We get here if the object has default pager (or unwiring)
* or the pager doesn't have the page.
1994-05-24 10:09:53 +00:00
*/
if (fs.object == fs.first_object)
fs.first_m = fs.m;
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* Move on to the next object. Lock the next object before
* unlocking the current one.
1994-05-24 10:09:53 +00:00
*/
next_object = fs.object->backing_object;
1994-05-24 10:09:53 +00:00
if (next_object == NULL) {
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* If there's no object left, fill the page in the top
* object with zeros.
1994-05-24 10:09:53 +00:00
*/
if (fs.object != fs.first_object) {
vm_object_pip_wakeup(fs.object);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs.object);
1994-05-24 10:09:53 +00:00
fs.object = fs.first_object;
fs.pindex = fs.first_pindex;
fs.m = fs.first_m;
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(fs.object);
1994-05-24 10:09:53 +00:00
}
fs.first_m = NULL;
1994-05-24 10:09:53 +00:00
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
/*
* Zero the page if necessary and mark it valid.
*/
if ((fs.m->flags & PG_ZERO) == 0) {
pmap_zero_page(fs.m);
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
} else {
VM_CNT_INC(v_ozfod);
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
}
VM_CNT_INC(v_zfod);
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
fs.m->valid = VM_PAGE_BITS_ALL;
/* Don't try to prefault neighboring pages. */
faultcount = 1;
break; /* break to PAGE HAS BEEN FOUND */
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
} else {
KASSERT(fs.object != next_object,
("object loop %p", next_object));
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(next_object);
vm_object_pip_add(next_object, 1);
if (fs.object != fs.first_object)
vm_object_pip_wakeup(fs.object);
fs.pindex +=
OFF_TO_IDX(fs.object->backing_object_offset);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs.object);
fs.object = next_object;
1994-05-24 10:09:53 +00:00
}
}
vm_page_assert_xbusied(fs.m);
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* PAGE HAS BEEN FOUND. [Loop invariant still holds -- the object lock
* is held.]
1994-05-24 10:09:53 +00:00
*/
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* If the page is being written, but isn't already owned by the
* top-level object, we have to copy it into a new page owned by the
* top-level object.
1994-05-24 10:09:53 +00:00
*/
if (fs.object != fs.first_object) {
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
/*
* We only really need to copy if we want to write it.
1994-05-24 10:09:53 +00:00
*/
if ((fault_type & (VM_PROT_COPY | VM_PROT_WRITE)) != 0) {
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
/*
* This allows pages to be virtually copied from a
* backing_object into the first_object, where the
* backing object has no other refs to it, and cannot
* gain any more refs. Instead of a bcopy, we just
* move the page from the backing object to the
* first object. Note that we must mark the page
* dirty in the first object so that it will go out
* to swap when needed.
1994-05-24 10:09:53 +00:00
*/
is_first_object_locked = false;
if (
/*
* Only one shadow object
*/
(fs.object->shadow_count == 1) &&
/*
* No COW refs, except us
*/
(fs.object->ref_count == 1) &&
/*
* No one else can look this object up
*/
(fs.object->handle == NULL) &&
/*
* No other ways to look the object up
*/
((fs.object->type == OBJT_DEFAULT) ||
(fs.object->type == OBJT_SWAP)) &&
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
(is_first_object_locked = VM_OBJECT_TRYWLOCK(fs.first_object)) &&
/*
* We don't chase down the shadow chain
*/
fs.object == fs.first_object->backing_object) {
vm_page_lock(fs.m);
vm_page_remove(fs.m);
vm_page_unlock(fs.m);
vm_page_lock(fs.first_m);
vm_page_replace_checked(fs.m, fs.first_object,
fs.first_pindex, fs.first_m);
vm_page_free(fs.first_m);
vm_page_unlock(fs.first_m);
vm_page_dirty(fs.m);
Fix the root cause of the "vm_reserv_populate: reserv <address> is already promoted" panics. The sequence of events that leads to a panic is rather long and circuitous. First, suppose that process P has a promoted superpage S within vm object O that it can write to. Then, suppose that P forks, which leads to S being write protected. Now, before P's child exits, suppose that P writes to another virtual page within O. Since the pages within O are copy on write, a shadow object for O is created to house the new physical copy of the faulted on virtual page. Then, before P can fault on S, P's child exists. Now, when P faults on S, it will follow the "optimized" path for copy-on-write faults in vm_fault(), wherein the underlying physical page is moved from O to its shadow object rather than allocating a new page and copying the new page's contents from the old page. Moreover, suppose that every 4 KB physical page making up S is moved to the shadow object in this way. However, the optimized path does not move the underlying superpage reservation, which is the root cause of the panics! Ultimately, P performs vm_object_collapse() on O's shadow object, which destroys O and in doing so breaks any reservations still belonging to O. This leaves the reservation underlying S in an inconsistent state: It's simultaneously not in use and promoted. Breaking a reservation does not demote it because I never intended for a promoted reservation to be broken. It makes little sense. Finally, this inconsistency leads to an assertion failure the next time that the reservation is used. The failing assertion does not (currently) exist in FreeBSD 10.x or earlier. There, we will quietly break the promoted reservation. While illogical and unintended, breaking the reservation is essentially harmless. PR: 198163 Reviewed by: kib Tested by: pho X-MFC after: r267213 Sponsored by: EMC / Isilon Storage Division
2015-03-19 01:40:43 +00:00
#if VM_NRESERVLEVEL > 0
/*
* Rename the reservation.
*/
vm_reserv_rename(fs.m, fs.first_object,
fs.object, OFF_TO_IDX(
fs.first_object->backing_object_offset));
#endif
/*
* Removing the page from the backing object
* unbusied it.
*/
vm_page_xbusy(fs.m);
fs.first_m = fs.m;
fs.m = NULL;
VM_CNT_INC(v_cow_optim);
} else {
/*
* Oh, well, lets copy it.
*/
pmap_copy_page(fs.m, fs.first_m);
fs.first_m->valid = VM_PAGE_BITS_ALL;
if (wired && (fault_flags &
VM_FAULT_WIRE) == 0) {
vm_page_lock(fs.first_m);
vm_page_wire(fs.first_m);
vm_page_unlock(fs.first_m);
vm_page_lock(fs.m);
vm_page_unwire(fs.m, PQ_INACTIVE);
vm_page_unlock(fs.m);
}
/*
* We no longer need the old page or object.
*/
release_page(&fs);
}
/*
* fs.object != fs.first_object due to above
* conditional
*/
vm_object_pip_wakeup(fs.object);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs.object);
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* Only use the new page below...
1994-05-24 10:09:53 +00:00
*/
fs.object = fs.first_object;
fs.pindex = fs.first_pindex;
fs.m = fs.first_m;
if (!is_first_object_locked)
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(fs.object);
VM_CNT_INC(v_cow_faults);
curthread->td_cow++;
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
} else {
prot &= ~VM_PROT_WRITE;
1994-05-24 10:09:53 +00:00
}
}
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* We must verify that the maps have not changed since our last
* lookup.
1994-05-24 10:09:53 +00:00
*/
if (!fs.lookup_still_valid) {
if (!vm_map_trylock_read(fs.map)) {
release_page(&fs);
unlock_and_deallocate(&fs);
goto RetryFault;
}
fs.lookup_still_valid = true;
if (fs.map->timestamp != fs.map_generation) {
result = vm_map_lookup_locked(&fs.map, vaddr, fault_type,
&fs.entry, &retry_object, &retry_pindex, &retry_prot, &wired);
/*
* If we don't need the page any longer, put it on the inactive
* list (the easiest thing to do here). If no one needs it,
* pageout will grab it eventually.
*/
if (result != KERN_SUCCESS) {
release_page(&fs);
unlock_and_deallocate(&fs);
1994-05-24 10:09:53 +00:00
/*
* If retry of map lookup would have blocked then
* retry fault from start.
*/
if (result == KERN_FAILURE)
goto RetryFault;
return (result);
}
if ((retry_object != fs.first_object) ||
(retry_pindex != fs.first_pindex)) {
release_page(&fs);
unlock_and_deallocate(&fs);
goto RetryFault;
}
1994-05-24 10:09:53 +00:00
/*
* Check whether the protection has changed or the object has
* been copied while we left the map unlocked. Changing from
* read to write permission is OK - we leave the page
* write-protected, and catch the write fault. Changing from
* write to read permission means that we can't mark the page
* write-enabled after all.
*/
prot &= retry_prot;
1994-05-24 10:09:53 +00:00
}
}
/*
* If the page was filled by a pager, save the virtual address that
* should be faulted on next under a sequential access pattern to the
* map entry. A read lock on the map suffices to update this address
* safely.
*/
if (hardfault)
fs.entry->next_read = vaddr + ptoa(ahead) + PAGE_SIZE;
vm_fault_dirty(fs.entry, fs.m, prot, fault_type, fault_flags, true);
vm_page_assert_xbusied(fs.m);
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
/*
* Page must be completely valid or it is not fit to
The VFS/BIO subsystem contained a number of hacks in order to optimize piecemeal, middle-of-file writes for NFS. These hacks have caused no end of trouble, especially when combined with mmap(). I've removed them. Instead, NFS will issue a read-before-write to fully instantiate the struct buf containing the write. NFS does, however, optimize piecemeal appends to files. For most common file operations, you will not notice the difference. The sole remaining fragment in the VFS/BIO system is b_dirtyoff/end, which NFS uses to avoid cache coherency issues with read-merge-write style operations. NFS also optimizes the write-covers-entire-buffer case by avoiding the read-before-write. There is quite a bit of room for further optimization in these areas. The VM system marks pages fully-valid (AKA vm_page_t->valid = VM_PAGE_BITS_ALL) in several places, most noteably in vm_fault. This is not correct operation. The vm_pager_get_pages() code is now responsible for marking VM pages all-valid. A number of VM helper routines have been added to aid in zeroing-out the invalid portions of a VM page prior to the page being marked all-valid. This operation is necessary to properly support mmap(). The zeroing occurs most often when dealing with file-EOF situations. Several bugs have been fixed in the NFS subsystem, including bits handling file and directory EOF situations and buf->b_flags consistancy issues relating to clearing B_ERROR & B_INVAL, and handling B_DONE. getblk() and allocbuf() have been rewritten. B_CACHE operation is now formally defined in comments and more straightforward in implementation. B_CACHE for VMIO buffers is based on the validity of the backing store. B_CACHE for non-VMIO buffers is based simply on whether the buffer is B_INVAL or not (B_CACHE set if B_INVAL clear, and vise-versa). biodone() is now responsible for setting B_CACHE when a successful read completes. B_CACHE is also set when a bdwrite() is initiated and when a bwrite() is initiated. VFS VOP_BWRITE routines (there are only two - nfs_bwrite() and bwrite()) are now expected to set B_CACHE. This means that bowrite() and bawrite() also set B_CACHE indirectly. There are a number of places in the code which were previously using buf->b_bufsize (which is DEV_BSIZE aligned) when they should have been using buf->b_bcount. These have been fixed. getblk() now clears B_DONE on return because the rest of the system is so bad about dealing with B_DONE. Major fixes to NFS/TCP have been made. A server-side bug could cause requests to be lost by the server due to nfs_realign() overwriting other rpc's in the same TCP mbuf chain. The server's kernel must be recompiled to get the benefit of the fixes. Submitted by: Matthew Dillon <dillon@apollo.backplane.com>
1999-05-02 23:57:16 +00:00
* map into user space. vm_pager_get_pages() ensures this.
*/
KASSERT(fs.m->valid == VM_PAGE_BITS_ALL,
("vm_fault: page %p partially invalid", fs.m));
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(fs.object);
/*
* Put this page into the physical map. We had to do the unlock above
* because pmap_enter() may sleep. We don't put the page
* back on the active queue until later so that the pageout daemon
* won't find it (yet).
*/
pmap_enter(fs.map->pmap, vaddr, fs.m, prot,
fault_type | (wired ? PMAP_ENTER_WIRED : 0), 0);
if (faultcount != 1 && (fault_flags & VM_FAULT_WIRE) == 0 &&
wired == 0)
vm_fault_prefault(&fs, vaddr,
faultcount > 0 ? behind : PFBAK,
faultcount > 0 ? ahead : PFFOR);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(fs.object);
vm_page_lock(fs.m);
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* If the page is not wired down, then put it where the pageout daemon
* can find it.
1994-05-24 10:09:53 +00:00
*/
if ((fault_flags & VM_FAULT_WIRE) != 0) {
KASSERT(wired, ("VM_FAULT_WIRE && !wired"));
vm_page_wire(fs.m);
} else
vm_page_activate(fs.m);
if (m_hold != NULL) {
*m_hold = fs.m;
vm_page_hold(fs.m);
}
vm_page_unlock(fs.m);
vm_page_xunbusy(fs.m);
/*
* Unlock everything, and return
*/
unlock_and_deallocate(&fs);
if (hardfault) {
VM_CNT_INC(v_io_faults);
curthread->td_ru.ru_majflt++;
#ifdef RACCT
if (racct_enable && fs.object->type == OBJT_VNODE) {
PROC_LOCK(curproc);
if ((fault_type & (VM_PROT_COPY | VM_PROT_WRITE)) != 0) {
racct_add_force(curproc, RACCT_WRITEBPS,
PAGE_SIZE + behind * PAGE_SIZE);
racct_add_force(curproc, RACCT_WRITEIOPS, 1);
} else {
racct_add_force(curproc, RACCT_READBPS,
PAGE_SIZE + ahead * PAGE_SIZE);
racct_add_force(curproc, RACCT_READIOPS, 1);
}
PROC_UNLOCK(curproc);
}
#endif
} else
curthread->td_ru.ru_minflt++;
1994-05-24 10:09:53 +00:00
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
return (KERN_SUCCESS);
1994-05-24 10:09:53 +00:00
}
/*
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
* Speed up the reclamation of pages that precede the faulting pindex within
* the first object of the shadow chain. Essentially, perform the equivalent
* to madvise(..., MADV_DONTNEED) on a large cluster of pages that precedes
* the faulting pindex by the cluster size when the pages read by vm_fault()
* cross a cluster-size boundary. The cluster size is the greater of the
* smallest superpage size and VM_FAULT_DONTNEED_MIN.
*
* When "fs->first_object" is a shadow object, the pages in the backing object
* that precede the faulting pindex are deactivated by vm_fault(). So, this
* function must only be concerned with pages in the first object.
*/
static void
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
vm_fault_dontneed(const struct faultstate *fs, vm_offset_t vaddr, int ahead)
{
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
vm_map_entry_t entry;
vm_object_t first_object, object;
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
vm_offset_t end, start;
vm_page_t m, m_next;
vm_pindex_t pend, pstart;
vm_size_t size;
object = fs->object;
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_ASSERT_WLOCKED(object);
first_object = fs->first_object;
if (first_object != object) {
if (!VM_OBJECT_TRYWLOCK(first_object)) {
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(object);
VM_OBJECT_WLOCK(first_object);
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(object);
}
}
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
/* Neither fictitious nor unmanaged pages can be reclaimed. */
if ((first_object->flags & (OBJ_FICTITIOUS | OBJ_UNMANAGED)) == 0) {
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
size = VM_FAULT_DONTNEED_MIN;
if (MAXPAGESIZES > 1 && size < pagesizes[1])
size = pagesizes[1];
end = rounddown2(vaddr, size);
if (vaddr - end >= size - PAGE_SIZE - ptoa(ahead) &&
(entry = fs->entry)->start < end) {
if (end - entry->start < size)
start = entry->start;
else
start = end - size;
pmap_advise(fs->map->pmap, start, end, MADV_DONTNEED);
pstart = OFF_TO_IDX(entry->offset) + atop(start -
entry->start);
m_next = vm_page_find_least(first_object, pstart);
pend = OFF_TO_IDX(entry->offset) + atop(end -
entry->start);
while ((m = m_next) != NULL && m->pindex < pend) {
m_next = TAILQ_NEXT(m, listq);
if (m->valid != VM_PAGE_BITS_ALL ||
vm_page_busied(m))
continue;
/*
* Don't clear PGA_REFERENCED, since it would
* likely represent a reference by a different
* process.
*
* Typically, at this point, prefetched pages
* are still in the inactive queue. Only
* pages that triggered page faults are in the
* active queue.
*/
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
vm_page_lock(m);
vm_page_deactivate(m);
Replace vm_fault()'s heuristic for automatic cache behind with a heuristic that performs the equivalent of an automatic madvise(..., MADV_DONTNEED). The current heuristic, even with the improvements that I made a few years ago, is a good example of making the wrong trade-off, or optimizing for the infrequent case. The infrequent case being reading a single file that is much larger than memory using mmap(2). And, in this case, the page daemon isn't the bottleneck; it's the I/O. In all other cases, the current heuristic has too many false positives, i.e., it caches too many pages that are later reused. To give one example, thousands of pages are cached by the current heuristic during a buildworld and all of them are reactivated before the buildworld completes. In particular, clang reads source files using mmap(2) and there are some relatively large source files in our source tree, e.g., sqlite, that are read multiple times. With the new heuristic, I see fewer false positives and they have a much lower cost. I actually tried something like this more than two years ago and it didn't perform as well as the cache behind heuristic. However, that was before the changes to the page daemon in late summer of 2013 and the existence of pmap_advise(). In particular, with the page daemon doing its work more frequently and in smaller batches, it now completes its work while the application accessing the file is blocked on I/O. Whereas previously, the page daemon appeared to hog the CPU for so long that it caused "hiccups" in the application's execution. Finally, I'll add that the elimination of cache pages is a prerequisite for NUMA support. Reviewed by: jeff, kib Sponsored by: EMC / Isilon Storage Division
2015-04-04 19:10:22 +00:00
vm_page_unlock(m);
}
}
}
if (first_object != object)
VM_OBJECT_WUNLOCK(first_object);
}
/*
* vm_fault_prefault provides a quick way of clustering
* pagefaults into a processes address space. It is a "cousin"
* of vm_map_pmap_enter, except it runs at page fault time instead
* of mmap time.
*/
static void
vm_fault_prefault(const struct faultstate *fs, vm_offset_t addra,
int backward, int forward)
{
pmap_t pmap;
vm_map_entry_t entry;
vm_object_t backing_object, lobject;
vm_offset_t addr, starta;
vm_pindex_t pindex;
vm_page_t m;
int i;
pmap = fs->map->pmap;
if (pmap != vmspace_pmap(curthread->td_proc->p_vmspace))
return;
entry = fs->entry;
if (addra < backward * PAGE_SIZE) {
starta = entry->start;
} else {
starta = addra - backward * PAGE_SIZE;
if (starta < entry->start)
starta = entry->start;
}
/*
* Generate the sequence of virtual addresses that are candidates for
* prefaulting in an outward spiral from the faulting virtual address,
* "addra". Specifically, the sequence is "addra - PAGE_SIZE", "addra
* + PAGE_SIZE", "addra - 2 * PAGE_SIZE", "addra + 2 * PAGE_SIZE", ...
* If the candidate address doesn't have a backing physical page, then
* the loop immediately terminates.
*/
for (i = 0; i < 2 * imax(backward, forward); i++) {
addr = addra + ((i >> 1) + 1) * ((i & 1) == 0 ? -PAGE_SIZE :
PAGE_SIZE);
if (addr > addra + forward * PAGE_SIZE)
addr = 0;
if (addr < starta || addr >= entry->end)
continue;
if (!pmap_is_prefaultable(pmap, addr))
continue;
pindex = ((addr - entry->start) + entry->offset) >> PAGE_SHIFT;
lobject = entry->object.vm_object;
VM_OBJECT_RLOCK(lobject);
while ((m = vm_page_lookup(lobject, pindex)) == NULL &&
lobject->type == OBJT_DEFAULT &&
(backing_object = lobject->backing_object) != NULL) {
KASSERT((lobject->backing_object_offset & PAGE_MASK) ==
0, ("vm_fault_prefault: unaligned object offset"));
pindex += lobject->backing_object_offset >> PAGE_SHIFT;
VM_OBJECT_RLOCK(backing_object);
VM_OBJECT_RUNLOCK(lobject);
lobject = backing_object;
}
if (m == NULL) {
VM_OBJECT_RUNLOCK(lobject);
break;
}
if (m->valid == VM_PAGE_BITS_ALL &&
(m->flags & PG_FICTITIOUS) == 0)
Change the management of cached pages (PQ_CACHE) in two fundamental ways: (1) Cached pages are no longer kept in the object's resident page splay tree and memq. Instead, they are kept in a separate per-object splay tree of cached pages. However, access to this new per-object splay tree is synchronized by the _free_ page queues lock, not to be confused with the heavily contended page queues lock. Consequently, a cached page can be reclaimed by vm_page_alloc(9) without acquiring the object's lock or the page queues lock. This solves a problem independently reported by tegge@ and Isilon. Specifically, they observed the page daemon consuming a great deal of CPU time because of pages bouncing back and forth between the cache queue (PQ_CACHE) and the inactive queue (PQ_INACTIVE). The source of this problem turned out to be a deadlock avoidance strategy employed when selecting a cached page to reclaim in vm_page_select_cache(). However, the root cause was really that reclaiming a cached page required the acquisition of an object lock while the page queues lock was already held. Thus, this change addresses the problem at its root, by eliminating the need to acquire the object's lock. Moreover, keeping cached pages in the object's primary splay tree and memq was, in effect, optimizing for the uncommon case. Cached pages are reclaimed far, far more often than they are reactivated. Instead, this change makes reclamation cheaper, especially in terms of synchronization overhead, and reactivation more expensive, because reactivated pages will have to be reentered into the object's primary splay tree and memq. (2) Cached pages are now stored alongside free pages in the physical memory allocator's buddy queues, increasing the likelihood that large allocations of contiguous physical memory (i.e., superpages) will succeed. Finally, as a result of this change long-standing restrictions on when and where a cached page can be reclaimed and returned by vm_page_alloc(9) are eliminated. Specifically, calls to vm_page_alloc(9) specifying VM_ALLOC_INTERRUPT can now reclaim and return a formerly cached page. Consequently, a call to malloc(9) specifying M_NOWAIT is less likely to fail. Discussed with: many over the course of the summer, including jeff@, Justin Husted @ Isilon, peter@, tegge@ Tested by: an earlier version by kris@ Approved by: re (kensmith)
2007-09-25 06:25:06 +00:00
pmap_enter_quick(pmap, addr, m, entry->protection);
VM_OBJECT_RUNLOCK(lobject);
}
}
/*
* Hold each of the physical pages that are mapped by the specified range of
* virtual addresses, ["addr", "addr" + "len"), if those mappings are valid
* and allow the specified types of access, "prot". If all of the implied
* pages are successfully held, then the number of held pages is returned
* together with pointers to those pages in the array "ma". However, if any
* of the pages cannot be held, -1 is returned.
*/
int
vm_fault_quick_hold_pages(vm_map_t map, vm_offset_t addr, vm_size_t len,
vm_prot_t prot, vm_page_t *ma, int max_count)
{
vm_offset_t end, va;
vm_page_t *mp;
int count;
boolean_t pmap_failed;
if (len == 0)
return (0);
end = round_page(addr + len);
addr = trunc_page(addr);
/*
* Check for illegal addresses.
*/
if (addr < vm_map_min(map) || addr > end || end > vm_map_max(map))
return (-1);
if (atop(end - addr) > max_count)
panic("vm_fault_quick_hold_pages: count > max_count");
count = atop(end - addr);
/*
* Most likely, the physical pages are resident in the pmap, so it is
* faster to try pmap_extract_and_hold() first.
*/
pmap_failed = FALSE;
for (mp = ma, va = addr; va < end; mp++, va += PAGE_SIZE) {
*mp = pmap_extract_and_hold(map->pmap, va, prot);
if (*mp == NULL)
pmap_failed = TRUE;
else if ((prot & VM_PROT_WRITE) != 0 &&
(*mp)->dirty != VM_PAGE_BITS_ALL) {
/*
* Explicitly dirty the physical page. Otherwise, the
* caller's changes may go unnoticed because they are
* performed through an unmanaged mapping or by a DMA
* operation.
*
* The object lock is not held here.
* See vm_page_clear_dirty_mask().
*/
vm_page_dirty(*mp);
}
}
if (pmap_failed) {
/*
* One or more pages could not be held by the pmap. Either no
* page was mapped at the specified virtual address or that
* mapping had insufficient permissions. Attempt to fault in
* and hold these pages.
*/
for (mp = ma, va = addr; va < end; mp++, va += PAGE_SIZE)
if (*mp == NULL && vm_fault_hold(map, va, prot,
VM_FAULT_NORMAL, mp) != KERN_SUCCESS)
goto error;
}
return (count);
error:
for (mp = ma; mp < ma + count; mp++)
if (*mp != NULL) {
vm_page_lock(*mp);
vm_page_unhold(*mp);
vm_page_unlock(*mp);
}
return (-1);
}
1994-05-24 10:09:53 +00:00
/*
* Routine:
* vm_fault_copy_entry
* Function:
* Create new shadow object backing dst_entry with private copy of
* all underlying pages. When src_entry is equal to dst_entry,
* function implements COW for wired-down map entry. Otherwise,
* it forks wired entry into dst_map.
1994-05-24 10:09:53 +00:00
*
* In/out conditions:
* The source and destination maps must be locked for write.
* The source map entry must be wired down (or be a sharing map
* entry corresponding to a main map entry that is wired down).
*/
void
vm_fault_copy_entry(vm_map_t dst_map, vm_map_t src_map,
vm_map_entry_t dst_entry, vm_map_entry_t src_entry,
vm_ooffset_t *fork_charge)
1994-05-24 10:09:53 +00:00
{
vm_object_t backing_object, dst_object, object, src_object;
vm_pindex_t dst_pindex, pindex, src_pindex;
vm_prot_t access, prot;
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
vm_offset_t vaddr;
vm_page_t dst_m;
vm_page_t src_m;
boolean_t upgrade;
1994-05-24 10:09:53 +00:00
#ifdef lint
src_map++;
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
#endif /* lint */
1994-05-24 10:09:53 +00:00
upgrade = src_entry == dst_entry;
access = prot = dst_entry->protection;
1994-05-24 10:09:53 +00:00
src_object = src_entry->object.vm_object;
src_pindex = OFF_TO_IDX(src_entry->offset);
1994-05-24 10:09:53 +00:00
if (upgrade && (dst_entry->eflags & MAP_ENTRY_NEEDS_COPY) == 0) {
dst_object = src_object;
vm_object_reference(dst_object);
} else {
/*
* Create the top-level object for the destination entry. (Doesn't
* actually shadow anything - we copy the pages directly.)
*/
dst_object = vm_object_allocate(OBJT_DEFAULT,
atop(dst_entry->end - dst_entry->start));
#if VM_NRESERVLEVEL > 0
dst_object->flags |= OBJ_COLORED;
dst_object->pg_color = atop(dst_entry->start);
#endif
}
1994-05-24 10:09:53 +00:00
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(dst_object);
KASSERT(upgrade || dst_entry->object.vm_object == NULL,
("vm_fault_copy_entry: vm_object not NULL"));
if (src_object != dst_object) {
dst_entry->object.vm_object = dst_object;
dst_entry->offset = 0;
dst_object->charge = dst_entry->end - dst_entry->start;
}
if (fork_charge != NULL) {
KASSERT(dst_entry->cred == NULL,
("vm_fault_copy_entry: leaked swp charge"));
dst_object->cred = curthread->td_ucred;
crhold(dst_object->cred);
*fork_charge += dst_object->charge;
} else if (dst_object->cred == NULL) {
KASSERT(dst_entry->cred != NULL, ("no cred for entry %p",
dst_entry));
dst_object->cred = dst_entry->cred;
dst_entry->cred = NULL;
}
/*
* If not an upgrade, then enter the mappings in the pmap as
* read and/or execute accesses. Otherwise, enter them as
* write accesses.
*
* A writeable large page mapping is only created if all of
* the constituent small page mappings are modified. Marking
* PTEs as modified on inception allows promotion to happen
* without taking potentially large number of soft faults.
*/
if (!upgrade)
access &= ~VM_PROT_WRITE;
1994-05-24 10:09:53 +00:00
/*
* Loop through all of the virtual pages within the entry's
* range, copying each page from the source object to the
* destination object. Since the source is wired, those pages
* must exist. In contrast, the destination is pageable.
* Since the destination object does share any backing storage
* with the source object, all of its pages must be dirtied,
* regardless of whether they can be written.
1994-05-24 10:09:53 +00:00
*/
for (vaddr = dst_entry->start, dst_pindex = 0;
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
vaddr < dst_entry->end;
vaddr += PAGE_SIZE, dst_pindex++) {
again:
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* Find the page in the source object, and copy it in.
* Because the source is wired down, the page will be
* in memory.
1994-05-24 10:09:53 +00:00
*/
if (src_object != dst_object)
VM_OBJECT_RLOCK(src_object);
object = src_object;
pindex = src_pindex + dst_pindex;
while ((src_m = vm_page_lookup(object, pindex)) == NULL &&
(backing_object = object->backing_object) != NULL) {
/*
* Unless the source mapping is read-only or
* it is presently being upgraded from
* read-only, the first object in the shadow
* chain should provide all of the pages. In
* other words, this loop body should never be
* executed when the source mapping is already
* read/write.
*/
KASSERT((src_entry->protection & VM_PROT_WRITE) == 0 ||
upgrade,
("vm_fault_copy_entry: main object missing page"));
VM_OBJECT_RLOCK(backing_object);
pindex += OFF_TO_IDX(object->backing_object_offset);
if (object != dst_object)
VM_OBJECT_RUNLOCK(object);
object = backing_object;
}
KASSERT(src_m != NULL, ("vm_fault_copy_entry: page missing"));
if (object != dst_object) {
/*
* Allocate a page in the destination object.
*/
dst_m = vm_page_alloc(dst_object, (src_object ==
dst_object ? src_pindex : 0) + dst_pindex,
VM_ALLOC_NORMAL);
if (dst_m == NULL) {
VM_OBJECT_WUNLOCK(dst_object);
VM_OBJECT_RUNLOCK(object);
VM_WAIT;
VM_OBJECT_WLOCK(dst_object);
goto again;
}
pmap_copy_page(src_m, dst_m);
VM_OBJECT_RUNLOCK(object);
dst_m->valid = VM_PAGE_BITS_ALL;
dst_m->dirty = VM_PAGE_BITS_ALL;
} else {
dst_m = src_m;
if (vm_page_sleep_if_busy(dst_m, "fltupg"))
goto again;
vm_page_xbusy(dst_m);
KASSERT(dst_m->valid == VM_PAGE_BITS_ALL,
("invalid dst page %p", dst_m));
}
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(dst_object);
1994-05-24 10:09:53 +00:00
/*
* Enter it in the pmap. If a wired, copy-on-write
* mapping is being replaced by a write-enabled
* mapping, then wire that new mapping.
1994-05-24 10:09:53 +00:00
*/
pmap_enter(dst_map->pmap, vaddr, dst_m, prot,
access | (upgrade ? PMAP_ENTER_WIRED : 0), 0);
1994-05-24 10:09:53 +00:00
/*
These changes embody the support of the fully coherent merged VM buffer cache, much higher filesystem I/O performance, and much better paging performance. It represents the culmination of over 6 months of R&D. The majority of the merged VM/cache work is by John Dyson. The following highlights the most significant changes. Additionally, there are (mostly minor) changes to the various filesystem modules (nfs, msdosfs, etc) to support the new VM/buffer scheme. vfs_bio.c: Significant rewrite of most of vfs_bio to support the merged VM buffer cache scheme. The scheme is almost fully compatible with the old filesystem interface. Significant improvement in the number of opportunities for write clustering. vfs_cluster.c, vfs_subr.c Upgrade and performance enhancements in vfs layer code to support merged VM/buffer cache. Fixup of vfs_cluster to eliminate the bogus pagemove stuff. vm_object.c: Yet more improvements in the collapse code. Elimination of some windows that can cause list corruption. vm_pageout.c: Fixed it, it really works better now. Somehow in 2.0, some "enhancements" broke the code. This code has been reworked from the ground-up. vm_fault.c, vm_page.c, pmap.c, vm_object.c Support for small-block filesystems with merged VM/buffer cache scheme. pmap.c vm_map.c Dynamic kernel VM size, now we dont have to pre-allocate excessive numbers of kernel PTs. vm_glue.c Much simpler and more effective swapping code. No more gratuitous swapping. proc.h Fixed the problem that the p_lock flag was not being cleared on a fork. swap_pager.c, vnode_pager.c Removal of old vfs_bio cruft to support the past pseudo-coherency. Now the code doesn't need it anymore. machdep.c Changes to better support the parameter values for the merged VM/buffer cache scheme. machdep.c, kern_exec.c, vm_glue.c Implemented a seperate submap for temporary exec string space and another one to contain process upages. This eliminates all map fragmentation problems that previously existed. ffs_inode.c, ufs_inode.c, ufs_readwrite.c Changes for merged VM/buffer cache. Add "bypass" support for sneaking in on busy buffers. Submitted by: John Dyson and David Greenman
1995-01-09 16:06:02 +00:00
* Mark it no longer busy, and put it on the active list.
1994-05-24 10:09:53 +00:00
*/
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WLOCK(dst_object);
if (upgrade) {
if (src_m != dst_m) {
vm_page_lock(src_m);
vm_page_unwire(src_m, PQ_INACTIVE);
vm_page_unlock(src_m);
vm_page_lock(dst_m);
vm_page_wire(dst_m);
vm_page_unlock(dst_m);
} else {
KASSERT(dst_m->wire_count > 0,
("dst_m %p is not wired", dst_m));
}
} else {
vm_page_lock(dst_m);
vm_page_activate(dst_m);
vm_page_unlock(dst_m);
}
vm_page_xunbusy(dst_m);
1994-05-24 10:09:53 +00:00
}
Switch the vm_object mutex to be a rwlock. This will enable in the future further optimizations where the vm_object lock will be held in read mode most of the time the page cache resident pool of pages are accessed for reading purposes. The change is mostly mechanical but few notes are reported: * The KPI changes as follow: - VM_OBJECT_LOCK() -> VM_OBJECT_WLOCK() - VM_OBJECT_TRYLOCK() -> VM_OBJECT_TRYWLOCK() - VM_OBJECT_UNLOCK() -> VM_OBJECT_WUNLOCK() - VM_OBJECT_LOCK_ASSERT(MA_OWNED) -> VM_OBJECT_ASSERT_WLOCKED() (in order to avoid visibility of implementation details) - The read-mode operations are added: VM_OBJECT_RLOCK(), VM_OBJECT_TRYRLOCK(), VM_OBJECT_RUNLOCK(), VM_OBJECT_ASSERT_RLOCKED(), VM_OBJECT_ASSERT_LOCKED() * The vm/vm_pager.h namespace pollution avoidance (forcing requiring sys/mutex.h in consumers directly to cater its inlining functions using VM_OBJECT_LOCK()) imposes that all the vm/vm_pager.h consumers now must include also sys/rwlock.h. * zfs requires a quite convoluted fix to include FreeBSD rwlocks into the compat layer because the name clash between FreeBSD and solaris versions must be avoided. At this purpose zfs redefines the vm_object locking functions directly, isolating the FreeBSD components in specific compat stubs. The KPI results heavilly broken by this commit. Thirdy part ports must be updated accordingly (I can think off-hand of VirtualBox, for example). Sponsored by: EMC / Isilon storage division Reviewed by: jeff Reviewed by: pjd (ZFS specific review) Discussed with: alc Tested by: pho
2013-03-09 02:32:23 +00:00
VM_OBJECT_WUNLOCK(dst_object);
if (upgrade) {
dst_entry->eflags &= ~(MAP_ENTRY_COW | MAP_ENTRY_NEEDS_COPY);
vm_object_deallocate(src_object);
}
}
/*
* Block entry into the machine-independent layer's page fault handler by
* the calling thread. Subsequent calls to vm_fault() by that thread will
* return KERN_PROTECTION_FAILURE. Enable machine-dependent handling of
* spurious page faults.
*/
int
vm_fault_disable_pagefaults(void)
{
return (curthread_pflags_set(TDP_NOFAULTING | TDP_RESETSPUR));
}
void
vm_fault_enable_pagefaults(int save)
{
curthread_pflags_restore(save);
}